[squid-users] squid cache takes a break

Sat Sep 9 12:10:33 UTC 2017

On 09/09/17 05:39, Vieri wrote:
> Hi,
> 
> Sorry for the title, but I really don't know how to describe what just happened today. It's really odd.
> 
> I previously posted a few similar issues which were all fixed if I increased certain parameters (ulimits, children-{max,startup,idle}, TTL, etc.).
> 
> This time however, after several days trouble-free I got another show-stopper. The local squid cache stopped serving for almost half an hour. After that, it all started working again magically. I had the chance to log into the server with ssh and try a few things:
> 
> - In the cache log I could see these messages:
> Starting new bllookup helpers...
> helperOpenServers: Starting 10/80 'squid_url_lookup.pl' processes
> WARNING: Cannot run '/opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl' process.
> WARNING: Cannot run '/opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl' process.
> WARNING: Cannot run '/opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl' process.
> 
> It doesn't say much as to why it "cannot run" the external program.
> 

Looking at the code that message only seems to get logged if there is a 
TCP/UDP connection involved and it is having packet errors (many reasons 
for that).

> This is how the program is defined in squid.conf:
> external_acl_type bllookup ttl=86400 negative_ttl=86400 children-max=80 children-startup=40 children-idle=10 %URI /opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl [...]
> 

... so far matching, external ACL use a private/localhost TCP connection 
between each helper and Squid.

> Other than that, the log is pretty quiet.
> 
> The HTTP clients do not get served at all. They keep waiting for a reply.
> 

Squid is waiting for a reply from a helper about whether the request is 
allowed or not.

> 
> top - 13:52:38 up 9 days,  6:19,  2 users,  load average: 2.04, 1.82, 1.65
> Tasks: 405 total,   1 running, 404 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
> %Cpu1  :  0.0 us,  0.3 sy,  0.0 ni, 97.0 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
> %Cpu2  :  0.3 us,  0.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
> %Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
> %Cpu4  :  0.0 us,  0.0 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  3.3 si,  0.0 st
> %Cpu5  :  0.0 us,  0.3 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
> %Cpu6  :  0.3 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
> %Cpu7  :  0.3 us,  0.0 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  4.0 si,  0.0 st
> KiB Mem : 32865056 total, 12324092 free, 16396808 used,  4144156 buff/cache
> KiB Swap: 37036988 total, 35197252 free,  1839736 used. 15977208 avail Mem
> 
> PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 29123 squid     20   0  293460 206776   8480 S   0.7  0.6  10:31.16 squid

Starting each helper requires 206766 MB of RAM to be allocated in 
swap/virtual. Copying 10x that data (+10 helpers per occurance) may take 
a while.

We workaround that normally by using concurrency. Helpers that support 
high levels of concurrency can handle a lot more than blocking /

Note that concurrency is just a way of pipelining requests to a helper, 
it does not require multi-threading though helpers using MT are 
naturally even better with concurrency.

> # ps aux | grep squid | grep '\-n'
> squid      805  0.0  6.4 3769624 2127988 ?     S    13:53   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
> squid     1452  0.0  6.4 3769624 2127988 ?     S    13:55   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
> root      3043  0.0  0.0  84856  1728 ?        Ss   Aug31   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.http.conf -n squidhttp
> squid     3046  0.0  0.0 128232 31052 ?        S    Aug31   0:35 (squid-1) -YC -f /etc/squid/squid.http.conf -n squidhttp
> root      3538  0.0  0.0  86912  1740 ?        Ss   Aug31   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.https.conf -n squidhttps
> squid     3540  0.0  0.1 134940 36244 ?        S    Aug31   1:09 (squid-1) -YC -f /etc/squid/squid.https.conf -n squidhttps
> root     28492  0.0  0.0  86908  4220 ?        Ss   Sep05   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.owa.conf -n squidowa
> squid    28495  0.0  0.0 103616 18364 ?        S    Sep05   0:12 (squid-1) -YC -f /etc/squid/squid.owa.conf -n squidowa
> root     29120  0.0  0.0  86908  4220 ?        Ss   Sep05   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.owa2.conf -n squidowa2
> squid    29123  0.2  0.6 293460 206776 ?       S    Sep05  10:31 (squid-1) -YC -f /etc/squid/squid.owa2.conf -n squidowa2
> squid    30330  0.0  6.4 3769624 2127928 ?     S    13:42   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
> squid    30866  0.0  6.4 3769624 2127928 ?     S    13:44   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
> squid    31507  0.0  6.4 3769624 2127928 ?     S    13:47   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
> squid    32055  0.0  6.4 3769624 2127928 ?     S    13:49   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
> squid    32659  0.0  6.4 3769624 2127928 ?     S    13:51   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
> 
> I changed the debug_options to ALL,9 and tried to restart or "reconfigure" Squid.
> I had trouble with that and decided to:
> - stop all squid instances (the other reverse proxies, eg. squidhttp, squidhttps, squidowa, squidowa2, were working fine)
> - manually kill all squid processes related to the failing local cache (including ssl_crtd)
> - made sure there were no more squid processes and that debug_options was ALL,9
> - started the local squid cache only (only 1 daemon - no reverse proxies)
> 
> Surprisingly, it still didn't work...
> Clients could not browse. The squid log only had this set of messages once in a while (very quiet log):
> 
> 2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(180) ipcCreate: ipcCreate: prfd FD 94
> 2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(181) ipcCreate: ipcCreate: pwfd FD 94
> 2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(182) ipcCreate: ipcCreate: crfd FD 93
> 2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(183) ipcCreate: ipcCreate: cwfd FD 93
> 2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(196) ipcCreate: ipcCreate: FD 94 sockaddr [::1]:40905
> 2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(212) ipcCreate: ipcCreate: FD 93 sockaddr [::1]:50647
> 2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(222) ipcCreate: ipcCreate: FD 93 listening...
> 2017/09/08 14:18:10.322 kid1| 5,3| comm.cc(868) _comm_close: comm_close: start closing FD 93
> 2017/09/08 14:18:10.322 kid1| 5,3| comm.cc(540) commUnsetFdTimeout: Remove timeout for FD 93
> 2017/09/08 14:18:10.322 kid1| 5,5| comm.cc(721) commCallCloseHandlers: commCallCloseHandlers: FD 93
> 2017/09/08 14:18:10.322 kid1| 5,4| AsyncCall.cc(26) AsyncCall: The AsyncCall comm_close_complete constructed, this=0xeb7c10 [call123]
> 2017/09/08 14:18:10.322 kid1| 5,4| AsyncCall.cc(93) ScheduleCall: comm.cc(941) will call comm_close_complete(FD 93) [call123]
> 2017/09/08 14:18:10.322 kid1| 5,9| comm.cc(602) comm_connect_addr: connecting socket FD 94 to [::1]:50647 (want family: 10)
> 2017/09/08 14:18:10.322 kid1| 21,3| tools.cc(543) leave_suid: leave_suid: PID 8303 called
> 2017/09/08 14:18:10.323 kid1| 21,3| tools.cc(636) no_suid: no_suid: PID 8303 giving up root priveleges forever
> 2017/09/08 14:18:10.323 kid1| 54,3| ipc.cc(304) ipcCreate: ipcCreate: calling accept on FD 93

Whatever this helper being started on FD 93/94 was, it is successful.
We know it is a helper because the FD are used by ipc.cc (helper I/O 
channels) for something, but that is all.

> [...]
> 2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(644) comm_connect_addr: sock=98, addrinfo( flags=4, family=10, socktype=1, protocol=6, &addr=0xeb79a0, addrlen=28 )
> 2017/09/08 14:24:49.682 kid1| 5,9| comm.cc(645) comm_connect_addr: connect FD 98: (-1) (110) Connection timed out
> 2017/09/08 14:24:49.682 kid1| 14,9| comm.cc(646) comm_connect_addr: connecting to: [::1]:60557
> 2017/09/08 14:24:49.682 kid1| 5,3| comm.cc(868) _comm_close: comm_close: start closing FD 98
> 2017/09/08 14:24:49.682 kid1| 5,3| comm.cc(540) commUnsetFdTimeout: Remove timeout for FD 98
> 2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(721) commCallCloseHandlers: commCallCloseHandlers: FD 98
> 2017/09/08 14:24:49.682 kid1| 5,4| AsyncCall.cc(26) AsyncCall: The AsyncCall comm_close_complete constructed, this=0xeb7e90 [call128]
> 2017/09/08 14:24:49.682 kid1| 5,4| AsyncCall.cc(93) ScheduleCall: comm.cc(941) will call comm_close_complete(FD 98) [call128]
> 2017/09/08 14:24:49.682 kid1| WARNING: Cannot run '/usr/libexec/squid/ext_wbinfo_group_acl' process.

The comm TCP timeout is for FD 98. It may or may not be related to the 
wbinfo helper. The trace shown does not include any ipc.cc or fd.cc 
output indicating what that FD is used for. see the below trace.

This trace is more complete:

> 2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(347) comm_openex: comm_openex: Attempt open socket for: [::1]
> 2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(388) comm_openex: comm_openex: Opened socket local=[::1] remote=[::] FD 99 flags=1 : family=10, type=1, protocol=0
> 2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(420) comm_init_opened: local=[::1] remote=[::] FD 99 flags=1 is a new socket
> 2017/09/08 14:24:49.682 kid1| 51,3| fd.cc(198) fd_open: fd_open() FD 99 ext_wbinfo_group_acl
> 2017/09/08 14:24:49.682 kid1| 50,6| comm.cc(209) commBind: commBind: bind socket FD 99 to [::1]
> 2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(347) comm_openex: comm_openex: Attempt open socket for: [::1]
> 2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(388) comm_openex: comm_openex: Opened socket local=[::1] remote=[::] FD 100 flags=1 : family=10, type=1, protocol=0
> 2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(420) comm_init_opened: local=[::1] remote=[::] FD 100 flags=1 is a new socket
> 2017/09/08 14:24:49.682 kid1| 51,3| fd.cc(198) fd_open: fd_open() FD 100 ext_wbinfo_group_acl
> 2017/09/08 14:24:49.682 kid1| 50,6| comm.cc(209) commBind: commBind: bind socket FD 100 to [::1]
> 2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(180) ipcCreate: ipcCreate: prfd FD 100
> 2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(181) ipcCreate: ipcCreate: pwfd FD 100
> 2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(182) ipcCreate: ipcCreate: crfd FD 99
> 2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(183) ipcCreate: ipcCreate: cwfd FD 99
> 2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(196) ipcCreate: ipcCreate: FD 100 sockaddr [::1]:56367
> 2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(212) ipcCreate: ipcCreate: FD 99 sockaddr [::1]:35643
> 2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(222) ipcCreate: ipcCreate: FD 99 listening...
> 2017/09/08 14:24:49.683 kid1| 5,3| comm.cc(868) _comm_close: comm_close: start closing FD 99
> 2017/09/08 14:24:49.683 kid1| 5,3| comm.cc(540) commUnsetFdTimeout: Remove timeout for FD 99
> 2017/09/08 14:24:49.683 kid1| 5,5| comm.cc(721) commCallCloseHandlers: commCallCloseHandlers: FD 99
> 2017/09/08 14:24:49.683 kid1| 5,4| AsyncCall.cc(26) AsyncCall: The AsyncCall comm_close_complete constructed, this=0xeb7f10 [call129]
> 2017/09/08 14:24:49.683 kid1| 5,4| AsyncCall.cc(93) ScheduleCall: comm.cc(941) will call comm_close_complete(FD 99) [call129]
> 2017/09/08 14:24:49.683 kid1| 5,9| comm.cc(602) comm_connect_addr: connecting socket FD 100 to [::1]:35643 (want family: 10)
> 2017/09/08 14:24:49.683 kid1| 21,3| tools.cc(543) leave_suid: leave_suid: PID 10064 called
> 2017/09/08 14:24:49.683 kid1| 21,3| tools.cc(636) no_suid: no_suid: PID 10064 giving up root priveleges forever
> 2017/09/08 14:24:49.683 kid1| 54,3| ipc.cc(304) ipcCreate: ipcCreate: calling accept on FD 99

This ext_wbinfo_group_acl helper on FD 99/100 was started successfully 
AFAIK.

> 
> I also tried restarting Winbind, c-icap server, clamd, but nothing changed. They all seemed to be working properly anyway.
> 
> Finally, at some point (after half an hour) right after killing all squid processes yet again, I restarted the squid services one last time (reverted to debug_options ALL,1) before giving up and deciding to let users bypass the proxy. Well, it all started working again...
> 
> So now I'd really like to know what I can do the next time it stops working like this.

Couple of workarounds.

a) start fewer helpers at a time.

b) reduce cache_mem.

c) add concurrency support to the helpers.

All of the above are aimed at reducing the amount of background 
administrative work Squid has to do for helpers, or the amount of memory 
consumed. And thus reducing the duration and effects of these pauses, 
even if (a) causes them to be slightly more frequent.

> I'm considering setting debug_options to ALL,6 while I still can, and wait to see if it fails again. When it does, I might have more information.
> 

The log lines above indicate sections "5,9 50,6 51,3 54,9" are logging 
the helper startup details. You can reduce debug_options to those for 
smaller logs to see if the TCP connect timeout is affecting helper startup.

Amos