[squid-users] Cache ran out of descriptors due to ICAP service/TCP SYNs ?

Wed Jul 18 15:52:33 UTC 2018

On 18/07/18 18:30, Ahmad, Sarfaraz wrote:
> Thanks for the reply. I haven't completely understood the revert and have a few more related questions.
> 
> I see these messages, 
> Jul 17 19:21:14 proxy2.hyd.deshaw.com squid[5747]: suspending ICAP service for too many failures
> Jul 17 19:21:14 proxy2.hyd.deshaw.com squid[5747]: optional ICAP service is suspended: icap://127.0.0.1:1344/reqmod [down,susp,fail11]
> 1)   If the ICAP service is unresponsive, Squid would not exhaust its file descriptors trying to reach the service again and again right (too many TCP SYNs for trying to connect to the ICAP service )? 
> 

Correct. It would not exhaust resources on *that* action. Other actions
possibly resulting from that state existing are another matter entirely.

> 
> 
> Max Connections returned by the ICAP service is 16. And given my ICAP settings, 
> icap_enable on
> icap_service test_icap reqmod_precache icap://127.0.0.1:1344/reqmod bypass=on routing=off on-overload=wait
> On-overload is set to "wait". The documentation says " * wait:   wait (in a FIFO queue) for an ICAP connection slot" . This means that a new TCP connection would not be attempted if max connections is reached right ? 
> 2)   Am I right in saying that if the ICAP service is underperforming or has failed, this won't lead a sudden increase in the open file descriptors with on-overload set to "wait" ?
> 

No. The side effects of the ICAP service not being used determine the
possible outcomes there.

> 
> Also I have no way to explain the "connection reset by peer" messages.

Neither, given the details provided.

> Jul 13 11:23:18 <hostname> squid[13123]: Error negotiating SSL connection on FD 1292: (104) Connection reset by peer
> Jul 13 11:23:18 <hostname> squid[13123]: Error negotiating SSL connection on FD 1631: (104) Connection reset by peer
> Jul 13 11:35:17 <hostname> squid[13123]: Error negotiating SSL connection on FD 1331: (104) Connection reset by peer
> 
> I have a few proxies (running in separate virtual machines). All of them went unresponsive at around the same time, leading to an outage of the internet.
> I am using WCCPv2 to redirect from firewall to these proxies.  I checked the logs there and WCCP communication was not intermittent.
> The logs on the proxies are bombarded with " Error negotiating SSL connection on FD 1331: (104) Connection reset by peer " messages.

A strong sign that forwarding loops are occuring, or something cut a
huge number of TCP connections at once.

Although syslog recording is limited by the network traffic. So
situations of high network flooding its timestamps can be very
inaccurate or unordered.

> Since the ICAP service in not SSL-protected I think these messages mostly imply receiving TCP RSTs from remote servers. (or could it be clients somehow ?).

Yes, another reason I am thinking along the lines of forwarding loops.

> Once I removed WCCP direction rules from the firewall, internet was
back up.
> This hints that something in this proxy pipeline was amiss and not with the internet link itself. I don't see any outages on that.

Nod. Keep in mind though that "proxy pipeline" includes the WCCP rules
in the router, NAT rules on the proxy machine, proxy config, connection
to/from the ICAP server, and NAT rules on the proxy machine outgoing,
and WCCP rules on the router a second time.

So a lot of parts, most outside of Squid - any one of which can screw up
the entire pathway.

> I am pretty sure ACLs weren't changed and there was no forwarding loop.
> What could possibly explain the connection reset by peer messages ? Even if the internet was down, that won't lead to TCP RSTs. 
> I cannot tie these TCP RSTs and the incoming requests getting held up and ultimately leading to FD exhaustion.

Too many possibilities to list here, and we do not have sufficient
information. You need to track down exactly which software is generating
them, and why.

> 
> You earlier said 
>>> In normal operation it is not serious, but you are already into abnormal operation by the crashing. So not releasing sockets/FD fast enough makes the overall problem worse.
> If squid-1 is crashing and getting respawned, it will have its own 16K FD limit right, I wonder how the newer squid-1 serves older requests. Can you please elaborate on " So not releasing sockets/FD fast enough makes the overall problem worse." ?
> 

Depending on your OS there are per-process and process group limits. The
latter may be applicable if you are using SMP workers.

Amos