[squid-users] squid and netdata causes squid to drop SYN?

Wed Jan 22 06:40:48 UTC 2020

On 22/01/20 6:55 pm, Amish wrote:
> On 21/01/20 9:09 pm, Alex Rousskov wrote:
>> On 1/20/20 11:28 PM, Amish wrote:
>>
>>> 2) Is calling squidclient so frequently a right thing to do by netdata?
>> The answer depends on what cache manager query (or queries) your netdata
>> is sending to Squid. Sending some queries every second is perfectly
>> fine, but there are other, "heavy" queries that should not be sent so
>> often and could, if sent with a high enough concurrency level,
>> effectively DoS a Squid instance. For example, queries that require
>> iterating all cached objects should not be sent to busy Squids.
>>
>> If netdata does not document the queries it uses, you can probably use
>> Squid access.log to figure out what queries netdata is sending (and how
>> long they take).
> 
> Thanks Matus UHLAR and Alex for responses.
> 
> I have not gone in detail through netdata sources but here is whatever I
> could find.
> 
> Squid python code that runs HTTP query on squid: (I have never coded in
> python)
> https://github.com/netdata/netdata/blob/master/collectors/python.d.plugin/squid/squid.chart.py
> 
> 
> Configuration that decides what to query. (netdata chooses one of
> options specified)
> https://github.com/netdata/netdata/blob/master/collectors/python.d.plugin/squid/squid.conf
> 
> 
> It appears that it runs a query on "counters". But I dont know if that
> is counted as a "heavy" query or not.

It is one of the light ones. So if that were all that is going on I
would not be expecting a problem.

The worst case I have seen with the quick reports is tools not closing
the sockets properly and running out FD numbers. But at 1/sec there
would only be 900 FD held up for the TCP 15min TIMEWAIT, not enough
relative to your 16K available to cause the level of issues you are seeing.

Your mention of intercepting traffic to port 3128 makes me wonder if the
netdata auto-detect is trying to use that port.
 If that is happening there would be some visible effects:
 A) if you have firewall protection DROP'ing direct connections to the
intercept port. That would show up exactly as SYN with no SYN+ACK on any
auto-detect probes the tool used to that port. (This is one reason I am
so vocal about people not using port 3128 to intercept).

 B) if you are missing that mandatory firewall protection, the tool may
be triggering forwarding loops. Which use many more FD than it should
(up to all of them) with the visible effects being clients not able to
connect around peak times and SYN being dropped when limits are hit
(either the loop limit, or interception port protection.
 If the tool is smart enough to detect the error state and move to
another port for a while that might explain the intermittent nature.

> 
>> N.B. If netdata is killing the previous query when starting a new
>> would-be-concurrent query, then there should be no DoS conditions -- a
>> single "heavy" query may slow Squid down a bit but should not stall the
>> whole Squid instance. Thus, if netdata ensures that the number of
>> concurrent cache manager queries is small, then there may be a Squid bug
>> related to terminating an aborted query. Otherwise, one could argue that
>> the lack of concurrency controls is a netdata bug.
> 
> Not sure if netdata terminates previous query or not. But I do see use
> of keep-alive in netdata code.
> 
> And also I completely understand that this area needs to be looked upon
> by netdata team. I will follow up with them.
> 
> But posting here just case, a quick glance can reveal a squid bug (or
> buggy approach by netdata) somewhere.

Since Squid is sending chunked header, the response should be chunked
like they expect and keep-alive pipeline available for use on their
followup requests. If the connection is closed with proper TCP closure
sequence, that is not a problem though.

(NP: Just a "should" because I have run out of time today to confirm
that particular mgr report is acting properly - some do, others not so
nice.)

Amos