[squid-users] BUG 3556

Thu Oct 15 13:59:03 UTC 2020

On 10/15/20 4:07 AM, Stephen Borrill wrote:
> At a few installations of squid 4.12 (patched for GREASE) on NetBSD
> 9, I'm seeing that occasionally one of the listening ports no longer 
> accepts connections (it doesn't reject them, but a connection does
> not get established).> The port appears random; it's not the same
> every time and isn't related to ports with SSL interception. A
> restart of squid fixes it.

> Looking through the logs, this appears to coincide with lines such as:
> 
> 2020/10/14 22:32:16 kid1| ERROR: getsockname() failed to locate local-IP
> on local=[::] remote=10.0.106.147:61996 FD 25 flags=1: (22) Invalid argument
> 2020/10/14 22:32:16 kid1| BUG 3556: FD 25 is not an open socket.
> 
> This looks similar to Alex Rousskov's recent observations:
> https://bugs.squid-cache.org/show_bug.cgi?id=3556#c15

Please keep in mind that those "BUG 3556" messages warn us about Squid
bugs elsewhere/somewhere in Squid code. For each particular message
instance, the exact bug is unknown a priori, and several different bugs
have triggered these messages in the past. While the original bug 3556
report was for a specific bug, the log messages were not (and are not).

> However, we have also seen with at sites where there is no SSL
> interception (the above lines are from such an installation).

> 1) Am I right that triggering BUG 3556 could lead to the described symptoms?

I would rephrase this as "Failure to obtain the (intended) IP address of
an (intercepted) connection leads to BUG 3556 messages."

> 2) Is this a squid bug or triggered by a problem in the underlying OS?

Those "BUG 3556" messages indicate a Squid bug. There is no question
about that. However, the ERROR messages may indicate a Squid
bug/deficiency and/or an environment (OS configuration, etc.) problem.
In summary, you are dealing with multiple problems here. You should
focus on the ERROR messages, not "BUG 3556" messages.

> If the latter, where to start looking?

Check system log for errors. Perhaps you are exhausting some system
resource?

I would also try to map ERROR messages to client transactions in hope to
spot some common pattern behind those failed transactions.
Unfortunately, I do not know whether Squid (especially Squid v4) would
log these failed transactions.

> 3) What workarounds are there?

a) Monitor logs and automatically restart the Squid instance if needed.

b) Patch Squid to kill the affected process. Adding "assert(false);"
after the ERROR message is printed in Comm::TcpAcceptor::oldAccept()
will kill the process. Killing a single worker may or may not be enough
in SMP mode; it would be interesting and potentially useful to know
whether that is enough.

You may be able to easily test your workaround using the trick I
outlined in https://bugs.squid-cache.org/show_bug.cgi?id=3556#c15

> Given that a restart fixes it, in some
> respects it would be better for squid to quit so it can be restarted
> rather than continue to run in a half-working state.

Yes, earlier Squids were written using the "Do whatever you can to stay
up" or "Damn the torpedoes!" principle. FWIW, I am pushing for reversing
the relevant code logic to follow the "Squid instance that cannot
provide an essential service explicitly requested by the admin should
quit with an error" principle, but it will take time to achieve that ideal.

> 4) Related to 3), are there any other ways to detect the problem when it
> is happening besides parsing logs or testing if all ports are accepting
> connections? This could be used to trigger an automated restart as a
> temporary workaround.

Yes, see suggestion 3b above.

HTH,

Alex.