[squid-users] Huge amount of time_wait connections after upgrade from v2 to v3

Fri Jul 7 14:06:46 UTC 2017

Thank you for the fast reply.

> On Jul 7, 2017, at 01:10, Amos Jeffries <squid3 at treenet.co.nz> wrote:
> 
>> On 07/07/17 13:55, Ivan Larionov wrote:
>> Hi. Sorry that I'm answering to the old thread. I was on vacation and didn't have a chance to test the proposed solution.
>> Dieter, yes, I'm on the old CentOS 6 based OS (Amazon Linux) but with a new kernel 4.9.27.
>> Amos, thank you for the suggestions about configure flags and squid config options, I fixed all issues you pointed to.
>> Unfortunately following workarounds didn't help:
>> * client_idle_pconn_timeout 30 seconds
>> * half_closed_clients on
>> * client_persistent_connections off
>> * server_persistent_connections off
> 
> TIME_WAIT is a sign that Squid is following the normal TCP process for closing connections, and doing so before the remote endpoint closes.
> 
> Disabling persistent connections increases the number of connections going through that process. So you definitely want those settings ON to reduce the WAIT states.
> 

I understand that. I just wrote that I tried this options and they had no effect. They didn't increase nor decrease number of TIME_WAIT connections. I removed them when I started testing older versions.

> If the remote end is the one doing the closure, then you will see less TIME_WAIT, but CLOSE_WAIT will increase instead. The trick is in finding the right balance of timeouts on both client and server idle pconn to get the minimum of total WAIT states. That is network dependent.
> 
> Generally though forward/explicit and intercept proxies want client_idle_pconn_timeout to be shorter than server_idle_pconn_timeout. Reverse proxy want the opposite.
> 
> 
> 
>> However I assumed that this is a bug and that I can find older version which worked fine. I started testing from 3.1.x all the way to 3.5.26 and this is what I found:
>> * All versions until 3.5.21 work fine. There no issues with huge amount of TIME_WAIT connections under load.
>> * 3.5.20 is the latest stable version.
>> * 3.5.21 is the first broken version.
>> * 3.5.23, 3.5.25, 3.5.26 are broken as well.
>> This effectively means that bug is somewhere in between 3.5.20 and 3.5.21.
>> I hope this helps and I hope you'll be able to find an issue. If you can create a bug report based on this information and post it here it would be awesome.
> 
> The changes in 3.5.21 were fixes to some common crashes and better caching behaviour. So I expect at least some of the change is due to higher traffic throughput on proxies previously restricted by those problems.
> 

I can't imagine how throughput increase could result in 500 times more TIME_WAIT connections count.

In our prod environment when we updated from 2.7.x to 3.5.25 we saw increase from 100 to 10000. This is 100x.

When I was load testing different versions yesterday I was always sending the same amount of RPS to them. Update from 3.5.20 to 3.5.21 resulted in jump from 20 to 10000 TIME_WAIT count. This is 500x.

I know that time_wait is fine in general. Until you have too many of them.

> Amos