[squid-users] very poor performance of rock cache ipc

Mon Oct 16 20:24:48 UTC 2023

On 15.10.23 05:42, Alex Rousskov wrote:
> On 2023-10-14 12:04, Julian Taylor wrote:
>> On 14.10.23 17:40, Alex Rousskov wrote:
>>> On 2023-10-13 16:01, Julian Taylor wrote:
>>>
>>
>> The reproducer uses as single request, the same very thing can be 
>> observed on a very busy squid
> 
> If a busy Squid sends lots of IPC messages between worker and disker, 
> then either there is a Squid bug we do not know about OR that disker is 
> just not as busy as one might expect it to be.
> 
> In Squid v6+, you can observe disker queues using mgr:store_queues cache 
> manager report. In your environment, do those queues always have lots of 
> requests when Squid is busy? Feel free to share (a pointer to) a 
> representative sample of those reports from your busy Squid.
> 
> N.B. Besides worker-disker IPC messages, there are also worker-worker 
> cache synchronization IPC messages. They also have the same "do not send 
> IPC messages if the queue has some pending items already" optimization.
> 

I checked the queues running with the configuration from my initial mail 
with workers increase and the queues are generally low, around 1-10 
items in the queue when sending around 100 parallel requests reading 
about 100mb data files. Here is a sample: https://dpaste.com/8SLNRW5F8
Also with the higher request rate than the single curl the majority of 
work throughput was more than doubled by increasing the blocksize.

How are the queues supposed to look like on a busy squid that is not 
spending a large portion of its time doing notify IPC?

Increasing the parallel requests does decrease the amount of overhead 
but its still pretty large, I measured about 10%-30% cpu overhead with 
100 parallel requests served from cache in the worker and disker
Here a snipped of a profile:
--22.34%--JobDialer<AsyncJob>::dial(AsyncCall&)
    |
    |--21.19%--Ipc::UdsSender::start()
    |       |
    |        --21.13%--Ipc::UdsSender::write()
    |           |
    |           |--16.12%--Ipc::UdsOp::conn()
    |           |          |
    |           |           --15.84%--comm_open_uds(int, int, 
sockaddr_un*, int)
    |           |                |--1.70%--commSetCloseOnExec(int)
    |           |                 --1.56%--commSetNonBlocking(int)
   ...
--12.98%--comm_close_complete(int)

Clearing and constructing the large Ipc::TypedMsgHdr is also very 
noticeable.

That the overhead and maximum throughput is so low for not so busy 
squids (say 1-10 requests per second but requests on average > 1MiB) is 
imo also a reason for concern and could be improved.

If I understand the way it works correctly e.g. the worker when it gets 
a request splits it into 4k blocks and enqueues read requests into the 
ipc queue and if the queue is empty it emits a notify ipc so the disker 
starts popping from the queue.

On large requests that are answered immediately from the disker the 
problem seems to be that the queue is mostly empty and it sends an ipc 
ping pong for each 4k block.

So my though was when the request is larger than 4k enqueue multiple 
pending reads in the worker and only notify after a certain amount has 
been added to the queue, vice versa in the disker.
So I messed around a bit trying to reduce the notifications by delaying 
the Notify call in src/DiskIO/IpcIo/IpcIoFile.cc for larger requests but 
it ended up blocking after the first queue push with no notify. If I 
understand the queue correctly this is due to the reader requires a 
notify to initially start and and simply pushing multiple read requests 
onto the queue without notifying will not work as trivially as I hoped.

Is this approach feasible or am I misunderstanding how it works?

I also tried to add reusing of the IPC connection between calls so the 
major source of overhead,tearing down and reestablishing the connection, 
is removed but that also turned out difficult as the connections are 
closed in various places and the general complexity of the code.