[squid-users] very poor performance of rock cache ipc

Sun Oct 15 03:42:39 UTC 2023

On 2023-10-14 12:04, Julian Taylor wrote:
> On 14.10.23 17:40, Alex Rousskov wrote:
>> On 2023-10-13 16:01, Julian Taylor wrote:
>>
>>> When using squid for caching using the rock cache_dir setting the 
>>> performance is pretty poor with multiple workers.
>>> The reason for this is due to the very high number of systemcalls 
>>> involved in the IPC between the disker and workers.
>>
>> Please allow me to rephrase your conclusion to better match (expected) 
>> reality and avoid misunderstanding:
>>
>> By design, a mostly idle SMP Squid should use a lot more system calls 
>> per disk cache hit than a busy SMP Squid would:
>>
>> * Mostly idle Squid: Every disk I/O may require a few IPC messages.
>> * Busy Squid: Bugs notwithstanding, disk I/Os require no IPC messages.
>>
>>
>> In your single-request test, you are observing the expected effects 
>> described in the first bullet. That does not imply those effects are 
>> "good" or "desirable" in your use case, of course. It only means that 
>> SMP Squid was no optimized for that use case; SMP rock design was 
>> explicitly targeting the opposite use case (i.e. a busy Squid).
> 
> The reproducer uses as single request, the same very thing can be 
> observed on a very busy squid

If a busy Squid sends lots of IPC messages between worker and disker, 
then either there is a Squid bug we do not know about OR that disker is 
just not as busy as one might expect it to be.

In Squid v6+, you can observe disker queues using mgr:store_queues cache 
manager report. In your environment, do those queues always have lots of 
requests when Squid is busy? Feel free to share (a pointer to) a 
representative sample of those reports from your busy Squid.

N.B. Besides worker-disker IPC messages, there are also worker-worker 
cache synchronization IPC messages. They also have the same "do not send 
IPC messages if the queue has some pending items already" optimization.

> and workaround improves both the single 
> request case and the actual heavy loaded production squid in the same way.

FWIW, I do not think that observation contradicts anything I have said.

> The hardware involved has a 10G card, not ssds but lots of ram so it has 
> a very high page cache hit rate and the squid is very busy, so much it 
> is overloaded by system cpu usage in default configuration with the rock 
> cache. The network or disk bandwidth is barely ever utilized more than 
> 10% with all 8 cpus busy on system load.

The above facts suggest that the disk is just not used much OR there is 
a bug somewhere. Slower (for any reason, including CPU overload) IPC 
messages should lead to longer queues and the disappearance of "your 
queue is no longer empty!" IPC messages.

> The only way to get the squid to utilize the machine is to increase the 
> IO size via the request buffer change or not use the rock cache. UFS 
> cache works ok in comparison, but requires multiple independent squid 
> instances as it does not support SMP.
> 
> Increasing the IO size to 32KiB as I mentioned does allow the squid 
> workers to utilize a good 60% of the hardware network and disk 
> capabilities.

Please note that I am not disputing this observation. Unfortunately, it 
does not help me guess where the actual/core problem or bottleneck is. 
Hopefully, cache manager mgr:store_queues report will shed some light.

>> Roughly speaking, here, "busy" means "there are always some messages 
>> in the disk I/O queue [maintained by Squid in shared memory]".
>>
>> You may wonder how it is possible that an increase in I/O work results 
>> in decrease (and, hopefully, elimination) of related IPC messages. 
>> Roughly speaking, a worker must send an IPC "you have a new I/O 
>> request" message only when its worker->disker queue is empty. If the 
>> queue is not empty, then there is no reason to send an IPC message to 
>> wake up disker because disker will see the new message when dequeuing 
>> the previous one. Same for the opposite direction: disker->worker...

> This is probably true if you have slow disks and are actually IO bound, 
> but on fast disks or high page cache hit rate you essential see this ipc 
> ping pong and very little actual work being done.

AFAICT, "too slow" IPC messages should result in non-empty queues and, 
hence, no IPC messages at all. For this logic to work, it does not 
matter whether the system is I/O bound or not, whether disks are "slow" 
or not.

>>  > Is it necessary to have these read chunks so small
>>
>> It is not. Disk I/O size should be at least the system I/O page size, 
>> but it can be larger. The optimal I/O size is probably very dependent 
>> on traffic patterns. IIRC, Squid I/O size is at most one Squid page 
>> (SM_PAGE_SIZE or 4KB).
>>
>> FWIW, I suspect there are significant inefficiencies in disk I/O 
>> related request alignment: The code does not attempt to read from and 
>> write to disk page boundaries, probably resulting in multiple 
>> low-level disk I/Os per one Squid 4KB I/O in some (many?) cases. With 
>> modern non-rotational storage these effects are probably less 
>> pronounced, but they probably still exist.

> The kernel drivers will mostly handle this for you if multiple requests 
> are available, but this is also almost irrelevant with current hardware, 
> typically it will be so fast software overhead will make it hard to 
> utilize modern large disk arrays properly

I doubt doing twice as many low-level disk I/Os (due to wrong alignment) 
is likely to be irrelevant, but we do not need to agree on that to make 
progress: Clearly, excessive low-level disk I/Os is not the bottleneck 
in your current environment.

> you probably need to look at 
> other approaches like io_ring to get rid of the classical read/write 
> systemcall overhead dominating your performance.

Yes, but those things are complementary (i.e. not mutually exclusive).

Cheers,

Alex.