[squid-users] file descriptors leak

Tue Nov 24 02:54:58 UTC 2015

On 24/11/2015 7:45 a.m., André Janna wrote:
> 
> Assin Em 22/11/2015 16:25, Eliezer Croitoru escreveu:
>> Hey Andre,
>>
>> There are couple things to the picture.
>> It's not only squid that is the "blame".
>> It depends on what your OS tcp stack settings are.
>> To verify couple things you can try to use the netstat tool.
>> run the command "netstat -nto" to see what is the timers status.
>> You can then see how long will a new connection stay in the
>> established state.
>> It might be the squid settings but if the client is not there it could
>> be because of some tcp tunable kernel settings.
> 
> Hi Eliezer and Amos,
> my kernel is a regular Debian Jessie kernel using the following tcp values.
>     tcp_keepalive_time: 7200
>     tcp_keepalive_intvl: 25
>     tcp_keepalive_probes: 9
>     tcp_retries1: 3
>     tcp_retries2: 15
>     tcp_fin_timeout: 60
> So in my understanding the longest timeout is set to 2 hours and a few
> minutes for keepalive connections.

Okay. It is not always your kernel Squid machine. I've seen one mobile
network where the Ethernet<->radio modem was interpreting the radio
being alive as TCP keep-alive needing to stay alive. So just having the
phones connected to the network would keep everything active.

IIRC the only fix for that scenario is reducing Squid's client_lifetime
value.

FYI: unless you have a specific need for 3.5 you should be fine with the
3.4 squid3 package that is available for Jesse from Debian backports.
The alternative is going the other way and upgrading right to the latest
3.5 snapshot (and/or 4.0 snapshot) to see if it is one of the CONNECT or
TLS issues we have fixed recently.

> 
> Today I monitored file descriptors 23 and 24 on my box during 5 hours
> and lsof always showed:
>     squid      6574           proxy   23u     IPv6 5320944     
> 0t0        TCP 172.16.10.22:3126->192.168.90.35:34571 (CLOSE_WAIT)
>     squid      6574           proxy   24u     IPv6 5327276     
> 0t0        TCP 172.16.10.22:3126->192.168.89.236:49435 (ESTABLISHED)
> while netstat always showed:
>     tcp6       1      0 172.16.10.22:3126 192.168.90.35:34571    
> CLOSE_WAIT  6574/(squid-1)   off (0.00/0/0)
>     tcp6       0      0 172.16.10.22:3126 192.168.89.236:49435   
> ESTABLISHED 6574/(squid-1)   off (0.00/0/0)
> 
> The "off" flag in netstat output tells that for these sockets keepalive
> and retransmission timers are disabled.

Oooh. That should mean 30sec timout and then RST. Not even a whole
minute of idle time.

> Right now netstat shows 15,568 connections on squid port 3126 and only
> 107 have timer set to a value other than "off".
> 
> I read that connections that are in CLOSE_WAIT state don't have any tcp
> timeout, it's Squid that must close the socket.

Squid closes the socket/FD as soon as it received the FIN or RST that
began the CLOSE_WAIT state. Unless it was Squid closing that began it.

> 
>  About the connections in ESTABLISHED state, I monitored the connection
> to mobile device 192.168.89.236 using "tcpdump -i eth2 -n host
> 192.168.89.236" during 2 hours and a half.
> Tcpdump didn't record any packet and netstat is still displaying:
>     tcp6       1      0 172.16.10.22:3126 192.168.90.35:34571    
> CLOSE_WAIT  6574/(squid-1)   off (0.00/0/0)
>     tcp6       0      0 172.16.10.22:3126 192.168.89.236:49435   
> ESTABLISHED 6574/(squid-1)   off (0.00/0/0)
> 
> So unfortunately I still don't understand why Squid or the kernel don't
> close these sockets.

Neither. So it is time to move away from lsof and start using packet
capture to get a full-body packet trace to find out what exact packets
are happening on at least one affected TCP connection.

If possible identifying one of these connections from its SYN onwards
would be great, but if not then a 20min period of activity on an
existing one might still how more hints.

Amos