[squid-dev] Rock store stopped accessing discs

Tue Mar 14 19:41:33 UTC 2017

On 03/14/2017 10:43 AM, Heiler Bemerguy wrote:
> Em 07/03/2017 20:26, Alex Rousskov escreveu:
>> How can a disker response get stuck? Most likely, something unusual
>> happened ~13 days ago. This could be a Squid bug and/or a kid restart.

> root at proxy:~# ps auxw |grep squid-
> proxy    10225  0.0  0.0 13964224 21708 ?      S    Mar10   0:10 (squid-coord-10) -s
> proxy    10226  0.1 12.5 14737524 8268056 ?    S    Mar10   7:14 (squid-disk-9) -s
> proxy    10227  0.0 11.6 14737524 7686564 ?    S    Mar10   3:08 (squid-disk-8) -s
> proxy    10228  0.1 14.9 14737540 9863652 ?    S    Mar10   7:30 (squid-disk-7) -s
> proxy    18348  3.5 10.3 17157560 6859904 ?    S    Mar13  48:44 (squid-6) -s
> proxy    18604  2.8  9.0 16903948 5977728 ?    S    Mar13  37:28 (squid-4) -s
> proxy    18637  1.7 10.8 16836872 7163392 ?    R    Mar13  23:03 (squid-1) -s
> proxy    20831 15.3 10.3 17226652 6838372 ?    S    08:50  39:51 (squid-2) -s
> proxy    21189  5.3  2.8 16538064 1871788 ?    S    12:29   2:12 (squid-5) -s
> proxy    21214  3.8  1.5 16448972 1012720 ?    S    12:43   1:03 (squid-3) -s

> Diskers aren't dying but workers are, a lot.. 

I suspect that worker deaths may cause SMP queues to get stuck, but I
have not validated that theory. We probably need to add more code to SMP
queues so that they can recover from untimely kid deaths.

> Another weird thing: lots of timeouts and overflows are happening on
> non-active hours.. From 0h to 7h we have like 1-2% of the clients we
> usually have from 8h to 17h.. (commercial time)

If a queue is stuck, you will see these errors and warnings as long as
there is some need for disk I/O. The volume is not important.

> Still don't know how /cache3 stopped and /cache4 is still active, even
> with all those warnings and errors.. :/

Do you expect Squid to function well in the presence of assertions and
to explain what went wrong while asserting? Unfortunately, we are very
far from that kind of robustness and self-diagnosis nirvana!

I have not studied your error messages in detail, but it is possible
that there are not-yet-stuck queues that feed cache4 while all cache3
queues are stuck. There is one SMP queue for each worker:disker pair.

Alex.