[squid-dev] Rock store stopped accessing discs

Tue Mar 14 16:43:48 UTC 2017

Em 07/03/2017 20:26, Alex Rousskov escreveu:
> These stuck disker responses probably explain why your disks do not
> receive any traffic. It is potentially important that both disker
> responses shown in your logs got stuck at approximately the same
> absolute time ~13 days ago (around 2017-02-22, give or take a day;
> subtract 1136930911 milliseconds from 15:53:05.255 in your Squid time
> zone to know the "exact" time when those stuck requests were queued).
>
> How can a disker response get stuck? Most likely, something unusual
> happened ~13 days ago. This could be a Squid bug and/or a kid restart.
>
> * Do all currently running Squid kid processes have about the same start
> time? [1]
>
> * Do you see ipcIo6.381049w7 or ipcIo6.153009r8 mentioned in any old
> non-debugging messages/warnings?

I searched the log files from those days, nothing unusual, "grep" 
returns nothing for ipcIo6.381049w7 or ipcIo6.153009r8.

On that day I couldn't verify if the kids were still with the same 
uptime, I've reformatted those /cache2 /cache3 and /cache4 partitions 
and started fresh with squid -z, but looking at the PS right now, I feel 
I can answer that question:

root at proxy:~# ps auxw |grep squid-
proxy    10225  0.0  0.0 13964224 21708 ?      S    Mar10   0:10 
(squid-coord-10) -s
proxy    10226  0.1 12.5 14737524 8268056 ?    S    Mar10   7:14 
(squid-disk-9) -s
proxy    10227  0.0 11.6 14737524 7686564 ?    S    Mar10   3:08 
(squid-disk-8) -s
proxy    10228  0.1 14.9 14737540 9863652 ?    S    Mar10   7:30 
(squid-disk-7) -s
proxy    18348  3.5 10.3 17157560 6859904 ?    S    Mar13  48:44 
(squid-6) -s
proxy    18604  2.8  9.0 16903948 5977728 ?    S    Mar13  37:28 
(squid-4) -s
proxy    18637  1.7 10.8 16836872 7163392 ?    R    Mar13  23:03 
(squid-1) -s
proxy    20831 15.3 10.3 17226652 6838372 ?    S    08:50  39:51 
(squid-2) -s
proxy    21189  5.3  2.8 16538064 1871788 ?    S    12:29   2:12 
(squid-5) -s
proxy    21214  3.8  1.5 16448972 1012720 ?    S    12:43   1:03 
(squid-3) -s

Diskers aren't dying but workers are, a lot.. with that "assertion 
failed: client_side_reply.cc:1167: http->storeEntry()->objectLen() >= 
headers_sz" thing.

Viewing DF and IOSTAT, it seems right now /cache3 isn't being accessed 
anymore. (I think it is the disk-8 above, look at the CPU time usage..)

Another weird thing: lots of timeouts and overflows are happening on 
non-active hours.. From 0h to 7h we have like 1-2% of the clients we 
usually have from 8h to 17h.. (commercial time)

2017/03/14 00:26:50 kid3| WARNING: abandoning 23 /cache4/rock I/Os after 
at least 7.00s timeout
2017/03/14 00:26:53 kid1| WARNING: abandoning 1 /cache4/rock I/Os after 
at least 7.00s timeout
2017/03/14 02:14:48 kid5| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo5.68259w9
2017/03/14 06:33:43 kid3| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo3.55919w9
2017/03/14 06:57:53 kid3| ERROR: worker I/O push queue for /cache4/rock 
overflow: ipcIo3.58130w9

This cache4 partition is where huge files would be stored:
maximum_object_size 4 GB
cache_dir rock /cache2 110000 min-size=0 max-size=65536 
max-swap-rate=150 swap-timeout=360
cache_dir rock /cache3 110000 min-size=65537 max-size=262144 
max-swap-rate=150 swap-timeout=380
cache_dir rock /cache4 110000 min-size=262145 max-swap-rate=150 
swap-timeout=500

Still don't know how /cache3 stopped and /cache4 is still active, even 
with all those warnings and errors.. :/

-- 
Atenciosamente / Best Regards,

Heiler Bemerguy
Network Manager - CINBESA
55 91 98151-4894/3184-1751