[squid-dev] Rock store stopped accessing discs

Wed Mar 15 17:57:37 UTC 2017

Hello list,

I've made a simple test here.

cache_dir rock /cache2 110000 min-size=0 max-size=65536
#max-swap-rate=150 swap-timeout=360
cache_dir rock /cache3 110000 min-size=65537 max-size=262144
#max-swap-rate=150 swap-timeout=380
cache_dir rock /cache4 110000 min-size=262145
#max-swap-rate=150 swap-timeout=500

Commented out all the max-swap-rate and swap-timeout. Guess what? No 
more errors on cache.log..

before:
root at proxy:/var/log/squid# cat cache.log.2 |grep overflow |wc -l
53

now:
root at proxy:/var/log/squid# cat cache.log |grep overflow |wc -l
1

before:
root at proxy:/var/log/squid# cat cache.log.2 |grep "7.00s timeout" |wc -l
86

now:
root at proxy:/var/log/squid# cat cache.log |grep "7.00s timeout" |wc -l
0

It seems the code that "shapes" the rockstore I/Os is kinda buggy. 
Version 4.0.18

-- 
Atenciosamente / Best Regards,

Heiler Bemerguy
Network Manager - CINBESA
55 91 98151-4894/3184-1751

Em 07/03/2017 20:26, Alex Rousskov escreveu:
> On 03/07/2017 01:08 PM, Heiler Bemerguy wrote:
>
>> Some log from right now...
> Here is my analysis:
>
>> 15:53:05.255| ipc/Queue.h findOldest: peeking from 7 to 6 at 1
> Squid worker (kid6) is looking at the queue of disker (kid7) responses.
> There is just one response in the queue.
>
>
>> IpcIoFile.cc canWait: cannot wait: 1136930911 oldest: ipcIo6.381049w7
> Squid worker is trying to estimate whether it has enough time to queue
> (and eventually perform) more disk I/O. The expected wait is
> 1'136'930'911 milliseconds or ~13 days. That is longer than the
> configured cache_dir swap-timeout (a few hundred milliseconds) so Squid
> refuses to queue this disk I/O request:
>
>> store_io.cc storeCreate: no swapdirs for e:=sw1p2RDV/0x206e17d0*4
>
> The same story happens in your other log snippets AFAICT:
>
>> 15:53:05.302| ipc/Queue.h findOldest: peeking from 8 to 6 at 1
>> IpcIoFile.cc canWait: cannot wait: 1136930958 oldest: ipcIo6.153009r8
>> store_io.cc storeCreate: no swapdirs for e:=msw1DV/0x3a4d5870*2
> and
>
>> 15:53:05.318| ipc/Queue.h findOldest: peeking from 7 to 6 at 1
>> IpcIoFile.cc canWait: cannot wait: 1136930974 oldest: ipcIo6.381049w7
>> store_io.cc storeCreate: no swapdirs for e:m381432=w1p2DV/0x3ace6740*4
> and
>
>> 15:53:05.793| ipc/Queue.h findOldest: peeking from 7 to 6 at 1
>> IpcIoFile.cc canWait: cannot wait: 1136931448 oldest: ipcIo6.381049w7
>> store_io.cc storeCreate: no swapdirs for e:=sw1V/0x206e17d0*1
>
> Focusing on one disker (kid7), we can see that the oldest response does
> not change: It is always ipcIo6.381049w7. This stuck response results in
> gradual increment of the expected wait time with every check, matching
> wall clock time increment:
>
>> 15:53:05.255| cannot wait: 1136930911
>> 15:53:05.318| cannot wait: 1136930974
>> 15:53:05.793| cannot wait: 1136931448
> These stuck disker responses probably explain why your disks do not
> receive any traffic. It is potentially important that both disker
> responses shown in your logs got stuck at approximately the same
> absolute time ~13 days ago (around 2017-02-22, give or take a day;
> subtract 1136930911 milliseconds from 15:53:05.255 in your Squid time
> zone to know the "exact" time when those stuck requests were queued).
>
> How can a disker response get stuck? Most likely, something unusual
> happened ~13 days ago. This could be a Squid bug and/or a kid restart.
>
> * Do all currently running Squid kid processes have about the same start
> time? [1]
>
> * Do you see ipcIo6.381049w7 or ipcIo6.153009r8 mentioned in any old
> non-debugging messages/warnings?
>
> [1]
> http://stackoverflow.com/questions/5731234/how-to-get-the-start-time-of-a-long-running-linux-process
>
>
> Thank you,
>
> Alex.
>