[squid-users] very poor performance of rock cache ipc

Fri Oct 13 20:01:02 UTC 2023

Hello,
When using squid for caching using the rock cache_dir setting the 
performance is pretty poor with multiple workers.
The reason for this is due to the very high number of systemcalls 
involved in the IPC between the disker and workers.

You can reproduce this very easily with a simple setup with following 
configuration in the current git HEAD and older versions:

maximum_object_size 8 GB
cache_dir rock /cachedir/cache 1024
cache_peer some.host parent 80 3130 default no-query no-digest
http_port 3128

Now download a larger file from some.host through the cache so it cached 
and repeat.

curl --proxy localhost:3128  http://some.host/file >  /dev/null

The download of the cached file from the local machine will be performed 
with a very low rate, on my not ancient machine 35mb/s with everything 
is being cached in memory.

If you check what is happening in the disker you see that it reads a 
4112 byte ipc message from the worker, performs a read of 4KiB size then 
opens a new socket to notifies the worker, does 4 fcntl calls on the 
socket and then sends a 4112 byte (2 x86 pages) size ipc message and 
then closes the socket, this repeats for every 4KiB read and you have 
the same thing in the receiving worker side.

Here an strace of one chunk of the request in the disker:

21:49:28 epoll_wait(7, [{events=EPOLLIN, data={u32=26, u64=26}}], 65536, 
827) = 1 <0.000013>
21:49:28 recvmsg(26, {msg_name=0x557d7c4f06b8, msg_namelen=110 => 0, 
msg_iov=[{iov_base="\7\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
iov_len=4112}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 
MSG_DONTWAIT) = 4112 <0.000027>
21:49:28 pread64(19, 
"\266E\337\37\374\201b\215\240\310`\216\366\242\350\210\215\22\377zu\302\244Tb\317\255K\10\"p\327"..., 
4096, 10747944) = 4096 <0.000015>
21:49:28 socket(AF_UNIX, SOCK_DGRAM, 0) = 11 <0.000021>
21:49:28 fcntl(11, F_GETFD)             = 0 <0.000011>
21:49:28 fcntl(11, F_SETFD, FD_CLOEXEC) = 0 <0.000011>
21:49:28 fcntl(11, F_GETFL)             = 0x2 (flags O_RDWR) <0.000011>
21:49:28 fcntl(11, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000012>
21:49:28 epoll_ctl(7, EPOLL_CTL_ADD, 11, 
{events=EPOLLOUT|EPOLLERR|EPOLLHUP, data={u32=11, u64=11}}) = 0 <0.000023>
21:49:28 epoll_wait(7, [{events=EPOLLOUT, data={u32=11, u64=11}}], 
65536, 826) = 1 <0.000015>
21:49:28 sendmsg(11, {msg_name={sa_family=AF_UNIX, 
sun_path="/tmp/local/var/run/squid/squid-kid-2.ipc"}, msg_namelen=42, 
msg_iov=[{iov_base="\7\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
iov_len=4112}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 
MSG_NOSIGNAL) = 4112 <0.000022>
21:49:28 epoll_ctl(7, EPOLL_CTL_DEL, 11, 0x7ffef63da174) = 0 <0.000014>
21:49:28 close(11)                      = 0 <0.000018>

Pocking around a bit in the code I have found that by increasing the 
HTTP_REQBUF_SZ in src/http/forward.h to 32KiB also affects the read size 
on the disker making it 8 times more efficient which is ok (but not great).
(This does not work the same anymore with 
https://github.com/squid-cache/squid/pull/1335 recently added to 6.x 
backports, but the 4KiB issue remains in current master)

This problem is very noticeable on large objects but the extrem overhead 
per disk cache request should affect most disk cached objects.

Is it necessary to have these read chunks so small and the processes 
opening and closing sockets for every single request instead of reusing 
an open socket?
At least the 4 fcntl calls could be removed/reduced to 1 though that 
only gains 10-30% compared to 800% of increasing the read size.
Reducing the 4112 byte ipc message with only has 4 bytes of data to 
lower values also results in measurable improvements (though dangerous 
as squid crashes if its too low and receives cachemanager requests which 
seem to be around 600 bytes in length).

If the small chunk sizes are needed for certain use cases I would love a 
configuration flag to set it to higher values (higher even that the 
current maximum of mem::pagessize 32KiB) if that fits the use case. In 
the case I noticed this the average object size in the cache was in the 
megabyte range.

Currently without recompiling squid using the rock cache (the only one 
supported for SMP) utilizing modern hardware with 10G or more network 
and SSD disks does not seem feasible unless I missed some configuration 
option which may help here.

Cheers,
Julian