[squid-users] Linearly increasing delays in HTTPS proxy CONNECTS / 3.5.20

Fri Aug 30 13:40:11 UTC 2019

On 8/30/19 8:16 AM, Ilari Laitinen wrote:

> I also noticed that the platform (in general) tries to resolve ipv6
> first, but the TCP dumps have no ipv6 packages at all. This is
> baffling, because there were indeed some unrelated open ipv6
> connections on the Squid server (reported by netstat).

You may be able to validate your packet collection rules by adjusting
them to include those known IPv6 connections/ports. Perhaps you are just
not collecting IPv6 traffic.

It is also possible that Squid gets AAAA records but never uses them
because Squid thinks that IPv6 is disabled on your server.

> I unfortunately cannot share the debug log because it contains some
> sensitive information. We nevertheless recorded what ended up being a
> huge sample.

If you hire a Squid developer to help you, they should be willing to
sign a reasonable NDA and/or view data on your servers, without copying.
IMHO, it does not make much sense to sit on a likely valuable direct
information while, at the same time, spending a lot of time to find
distant echoes of that same information elsewhere!

> I suspect Squid might be waiting for local TCP ports from the kernel
> (or something related).

IIRC, ephemeral source port allocator is instantaneous -- Squid either
gets a port or a port allocation error, without waiting. When we
overload the server with high-performance tests (without an explicit
port manager), we see port allocation errors rather than stalled tests.
However, perhaps that is not true in your OS/environment.

> Right now, there are four different IP addresses returned for the
> target cloud service. For practical purposes, they are returned in a
> random order. The traffic would ideally be spread over all of them.
> Unfortunately it is evident both from the debug log and from the TCP
> dump that Squid is using only one of the addresses at a time. The
> amount of connections in the TIME_WAIT state for that single IP
> address gets very close to the maximum defined by the
> net.ipv4.ip_local_port_range sysctl. After a while (a minute or so in
> the recording) this address changes presumably in response to a new
> DNS query result.

In theory, Squid should round-robin across all destination IP addresses
for a single host name. If your Squid v3 does not, it is probably a
Squid bug that can be fixed [by upgrading].

Said that, IIRC, the notion of "round robin" is rather vague in Squid
because there are several places where an IP may be requested for the
same host name inside the same transaction. I would not be surprised if
that low-level round-robin behavior results in the same IP being used
for most transactions in some environments (until an error or a new DNS
query reshuffles the IPs). Debugging logs may expose this problem.

> Could this be the bottleneck?

I would expect that the lack of ports would lead to errors, not stalled
transactions. However, there may be some hidden dependency that I am
missing. For example, lack of ports leads to errors, the errors are not
logged where you can see them, but lead to excessive DNS retries and/or
Squid bugs that lead to delays.

> One possible workaround that I can think of is setting a short
> positive_dns_ttl, but this doesn’t fully guarantee an even
> distribution, now does it?

No, it does not. Moreover, Squid v3 had some TTL handling bugs that were
fixed (in v4 and later code) by the Happy Eyeballs project. Taking all
the known problems into the account, it is difficult for me to predict
the effect of changing TTLs. Said that, it does not hurt to try! Maybe
you will be lucky, and a simple configuration change will remove the
cause of increasing transaction delays.

HTH,

Alex.