[squid-users] very poor performance of rock cache ipc

Alex Rousskov rousskov at measurement-factory.com
Sat Oct 14 15:40:51 UTC 2023


On 2023-10-13 16:01, Julian Taylor wrote:

> When using squid for caching using the rock cache_dir setting the 
> performance is pretty poor with multiple workers.
> The reason for this is due to the very high number of systemcalls 
> involved in the IPC between the disker and workers.

Please allow me to rephrase your conclusion to better match (expected) 
reality and avoid misunderstanding:

By design, a mostly idle SMP Squid should use a lot more system calls 
per disk cache hit than a busy SMP Squid would:

* Mostly idle Squid: Every disk I/O may require a few IPC messages.
* Busy Squid: Bugs notwithstanding, disk I/Os require no IPC messages.


In your single-request test, you are observing the expected effects 
described in the first bullet. That does not imply those effects are 
"good" or "desirable" in your use case, of course. It only means that 
SMP Squid was no optimized for that use case; SMP rock design was 
explicitly targeting the opposite use case (i.e. a busy Squid).

Roughly speaking, here, "busy" means "there are always some messages in 
the disk I/O queue [maintained by Squid in shared memory]".


You may wonder how it is possible that an increase in I/O work results 
in decrease (and, hopefully, elimination) of related IPC messages. 
Roughly speaking, a worker must send an IPC "you have a new I/O request" 
message only when its worker->disker queue is empty. If the queue is not 
empty, then there is no reason to send an IPC message to wake up disker 
because disker will see the new message when dequeuing the previous one. 
Same for the opposite direction: disker->worker...


 > Is it necessary to have these read chunks so small

It is not. Disk I/O size should be at least the system I/O page size, 
but it can be larger. The optimal I/O size is probably very dependent on 
traffic patterns. IIRC, Squid I/O size is at most one Squid page 
(SM_PAGE_SIZE or 4KB).

FWIW, I suspect there are significant inefficiencies in disk I/O related 
request alignment: The code does not attempt to read from and write to 
disk page boundaries, probably resulting in multiple low-level disk I/Os 
per one Squid 4KB I/O in some (many?) cases. With modern non-rotational 
storage these effects are probably less pronounced, but they probably 
still exist.

BTW, please note that, IIRC, workers and diskers do not send HTTP bytes 
using IPC messages. Those IPC messages only carry small metainformation 
about I/O. HTTP bytes are stored in shared memory pages. I do not recall 
why the corresponding disk I/O IPC messages are so big, but it is 
probably just a code simplification (because larger IPC messages are 
needed for cache manager queries).


HTH,

Alex.


> You can reproduce this very easily with a simple setup with following 
> configuration in the current git HEAD and older versions:
> 
> maximum_object_size 8 GB
> cache_dir rock /cachedir/cache 1024
> cache_peer some.host parent 80 3130 default no-query no-digest
> http_port 3128
> 
> Now download a larger file from some.host through the cache so it cached 
> and repeat.
> 
> curl --proxy localhost:3128  http://some.host/file >  /dev/null
> 
> The download of the cached file from the local machine will be performed 
> with a very low rate, on my not ancient machine 35mb/s with everything 
> is being cached in memory.
> 
> If you check what is happening in the disker you see that it reads a 
> 4112 byte ipc message from the worker, performs a read of 4KiB size then 
> opens a new socket to notifies the worker, does 4 fcntl calls on the 
> socket and then sends a 4112 byte (2 x86 pages) size ipc message and 
> then closes the socket, this repeats for every 4KiB read and you have 
> the same thing in the receiving worker side.
> 
> Here an strace of one chunk of the request in the disker:
> 
> 21:49:28 epoll_wait(7, [{events=EPOLLIN, data={u32=26, u64=26}}], 65536, 
> 827) = 1 <0.000013>
> 21:49:28 recvmsg(26, {msg_name=0x557d7c4f06b8, msg_namelen=110 => 0, 
> msg_iov=[{iov_base="\7\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4112}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 4112 <0.000027>
> 21:49:28 pread64(19, 
> "\266E\337\37\374\201b\215\240\310`\216\366\242\350\210\215\22\377zu\302\244Tb\317\255K\10\"p\327"..., 4096, 10747944) = 4096 <0.000015>
> 21:49:28 socket(AF_UNIX, SOCK_DGRAM, 0) = 11 <0.000021>
> 21:49:28 fcntl(11, F_GETFD)             = 0 <0.000011>
> 21:49:28 fcntl(11, F_SETFD, FD_CLOEXEC) = 0 <0.000011>
> 21:49:28 fcntl(11, F_GETFL)             = 0x2 (flags O_RDWR) <0.000011>
> 21:49:28 fcntl(11, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000012>
> 21:49:28 epoll_ctl(7, EPOLL_CTL_ADD, 11, 
> {events=EPOLLOUT|EPOLLERR|EPOLLHUP, data={u32=11, u64=11}}) = 0 <0.000023>
> 21:49:28 epoll_wait(7, [{events=EPOLLOUT, data={u32=11, u64=11}}], 
> 65536, 826) = 1 <0.000015>
> 21:49:28 sendmsg(11, {msg_name={sa_family=AF_UNIX, 
> sun_path="/tmp/local/var/run/squid/squid-kid-2.ipc"}, msg_namelen=42, 
> msg_iov=[{iov_base="\7\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4112}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4112 <0.000022>
> 21:49:28 epoll_ctl(7, EPOLL_CTL_DEL, 11, 0x7ffef63da174) = 0 <0.000014>
> 21:49:28 close(11)                      = 0 <0.000018>
> 
> 
> Pocking around a bit in the code I have found that by increasing the 
> HTTP_REQBUF_SZ in src/http/forward.h to 32KiB also affects the read size 
> on the disker making it 8 times more efficient which is ok (but not great).
> (This does not work the same anymore with 
> https://github.com/squid-cache/squid/pull/1335 recently added to 6.x 
> backports, but the 4KiB issue remains in current master)
> 
> This problem is very noticeable on large objects but the extrem overhead 
> per disk cache request should affect most disk cached objects.
> 
> Is it necessary to have these read chunks so small and the processes 
> opening and closing sockets for every single request instead of reusing 
> an open socket?
> At least the 4 fcntl calls could be removed/reduced to 1 though that 
> only gains 10-30% compared to 800% of increasing the read size.
> Reducing the 4112 byte ipc message with only has 4 bytes of data to 
> lower values also results in measurable improvements (though dangerous 
> as squid crashes if its too low and receives cachemanager requests which 
> seem to be around 600 bytes in length).
> 
> If the small chunk sizes are needed for certain use cases I would love a 
> configuration flag to set it to higher values (higher even that the 
> current maximum of mem::pagessize 32KiB) if that fits the use case. In 
> the case I noticed this the average object size in the cache was in the 
> megabyte range.
> 
> Currently without recompiling squid using the rock cache (the only one 
> supported for SMP) utilizing modern hardware with 10G or more network 
> and SSD disks does not seem feasible unless I missed some configuration 
> option which may help here.
> 
> Cheers,
> Julian
> _______________________________________________
> squid-users mailing list
> squid-users at lists.squid-cache.org
> https://lists.squid-cache.org/listinfo/squid-users



More information about the squid-users mailing list