[squid-dev] [PATCH] Increase request buffer size to 64kb

Thu Mar 31 10:45:07 UTC 2016

Thank you for some excellent testing (and results :-))

On 31/03/2016 6:50 p.m., Nathan Hoad wrote:
> Responding to both emails, including my findings, so apologies in
> advance for the extremely long email.
> 
> I've gone through the places that use HTTP_REQBUF_SZ, and it seems to
> be Http::Stream::pullData() that's benefiting from this change. To
> simplify all Http::Stream-related uses of HTTP_REQBUF_SZ, I've
> attached a work-in-progress patch that unifies them all into a method
> on Http::Stream and increases only its buffer size, so people are
> welcome to try and replicate my findings.
> 
> Alex, I've tried 8, 16, 32, 128 and 512 KB values - all sizes leading
> up to 64 KB scaled appropriately. 128 and 512 were the same or
> slightly worse than 64, so I think 64 KB is the "best value".
> 
> My page size and kernel buffer sizes are both stock - I have not
> tweaked anything on this machine.
> 
> $ uname -a
> Linux nhoad-laptop 4.4.3-1-ARCH #1 SMP PREEMPT Fri Feb 26 15:09:29 CET
> 2016 x86_64 GNU/Linux
> $ getconf PAGESIZE
> 4096
> $ cat /proc/sys/net/ipv4/tcp_wmem /proc/sys/net/ipv4/tcp_rmem
> 4096    16384   4194304
> 4096    87380   6291456
> 
> The buffer size on Http::Stream does not grow dynamically, it is a
> simple char[HTTP_REQBUF_SZ]. I could look into making it grow
> dynamically if we're interested in that, but it would be a lot of work
> (to me - feel free to suggest somewhere else this is done and I can
> try to learn from that). I can't definitively say that increasing this
> constant has no impact on smaller objects, however using Apache bench
> indicated no impact in performance, maintaining ~6k requests a second
> pre- and post-patch for a small uncached object.
> 
> Amos, replies inline.
> 
> On 30 March 2016 at 21:29, Amos Jeffries wrote:
>> On 30/03/2016 6:53 p.m., Alex Rousskov wrote:
>>
>> One thing you need to keep in mind with all this is that the above
>> macros *does not* configure the network I/O buffers.
> 
> I don't think this is quite true - I don't think it's intentional, but
> I am lead to believe that HTTP_REQBUF_SZ does influence network IO
> buffers in some way. See below.

Nod. It does seem to be effecting them by being a bottleneck. What I
meant was that it was not the size of those buffers. As your testing
shows, the network I/O buffers can take more data when the bottleneck is
widened.

> 
>> The network HTTP request buffer is controlled by request_header_max_size
>> - default 64KB.
>>
>> The network HTTP reply buffer is controlled by reply_header_max_size -
>> default 64KB.
>>
>> The HTTP_REQBUF_SZ macro configures the StoreIOBuffer object size. Which
>> is mostly used for StoreIOBuffer (client-streams or disk I/O) or local
>> stack allocated variables. Which is tuned to match the filesystem page
>> size - default 4KB.
>>
>> If your system uses non-4KB pages for disk I/O then you should tune that
>> alignment of course. If you are memory-only caching or even not caching
>> that object at all - then the memory page size will be the more
>> important metric to tune it against.
> 
> As shown above, I have 4 KB pages for my memory page size. There is no
> disk cache configured, so disk block size should be irrelevant I think
> - see the end of this mail for the squid.conf I've been using for this
> testing. I also don't have a memory cache configured, so the default
> of 256 MB is being used. Seeing as the object I'm testing is a 5 GB
> file, I don't think the memory cache should be coming into play. To be
> sure, I did also run with `cache_mem none`.
> 
>>
>> How important I'm not sure. I had thought the relative difference in
>> memory and network I/O speeds made the smaller size irrelevant (since we
>> are data-copying from the main network SBuf buffers anyway). But
>> perhapse not. You may have just found that it needs to be tuned to match
>> the network I/O buffer default max-size (64KB).
>>
>> NP: perhapse the real difference is how fast Squid can walk the list of
>> in-memory buffers that span the object in memory cache. Since it walks
>> the linked-list from head to position N with each write(2) having larger
>> steps would be relevant.
> 
> Where in the code is this walking done? Investigating this would be
> helpful I think.

IIRC a function somewhere in mem_node. Might even be called 'walker'.

> 
>> Make sure you have plenty of per-process stack space available before
>> going large. Squid allocates several buffers using this size directly on
>> the stack. Usually at least 2, maybe a half dozen.
> 
> Ensuring I'm being explicit here, in all my testing I haven't messed
> with stack sizes, again using the default on my system, which is:
> 
> Max stack size            8388608              unlimited            bytes
> 
> Which seems to have been enough, I think? What would I see if I had
> run out of stack space? A crash?
> 

Yes a SEGFAULT crash judging by the recent stack explosions I've been
playing with yesterday.

That 8MB seems to be okay for 512 might even cope with 1MB. But not much
more.

>>
>>
>> It would be page size (memory pages or disk controller I/O pages). Since
>> the network is tuned already and defaulting to 64KB.
>>
>>
>>
>> It is used primarily for the disk I/O and Squid internal client-streams
>> buffers.
>>
>> In the long-term plan those internal uses will be replaced by SBuf which
>> are controlled by the existing squid.conf options and actual message
>> sizes more dynamically.
>>
>> A new option for tuning disk I/O buffer size might be useful in both
>> long- and short- terms though.
>>
>> Amos
>>
> 
> Alright, so my findings so far:
> 
> Looking purely at system calls, it shows the reads from the upstream
> server are being read in 16 KB chunks, where as writes to the client
> are done in 4 KB chunks. With the patch, the writes to the client
> increase to 16 KB, so it appears that HTTP_REQBUF_SZ does influence
> network IO in this way.
> 
> Without patch:
> 
> read(14, "...", 16384) = 16384
> write(11, "...", 4096) = 4096
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  86.69    0.050493           0   1310723           write
>  13.31    0.007753           0    327683           read
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.058246               1638406           total
> 
> 
> With patch:
> 
> read(14, "...", 16384) = 16384
> write(11, "...", 16384) = 16384
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  70.55    0.015824           0    327681           write
>  29.45    0.006604           0    327683           read
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.022428                655364           total
> 
> 
> Given that the patch seems to increase the write buffer to 64 KB, the
> 16 KB buffer sizes interested me. So I looked at configuration options
> that default to 16 KB, and found read_ahead_gap. Looking at strace
> output, it showed that increasing this number increased the size of
> the buffer given to the read(2) calls, and with the patch, the
> write(2) calls as well, so I decided to compare read_ahead_gap 16 KB
> and read_ahead_gap 64 KB, with and without the patch.
> 
> Without patch, 16 KB:
> 100 5120M  100 5120M    0     0   104M      0  0:00:48  0:00:48 --:--:-- 96.0M
> 
> Without patch, 64 KB:
> 100 5120M  100 5120M    0     0   102M      0  0:00:50  0:00:50 --:--:-- 91.8M
> 
> With patch, 16 KB:
> 100 5120M  100 5120M    0     0   347M      0  0:00:14  0:00:14 --:--:--  352M
> 
> With patch, 64 KB:*
> 100 5120M  100 5120M    0     0   553M      0  0:00:09  0:00:09 --:--:--  517M
> 

Nice. :-)

Okay that convinces me we should do this. And change the default
read-ahead gap as well.

> As above shows, this directive does not have much of a performance
> impact pre-patch for this test, as the number and size of write(2)
> calls is still fixed. However post-patch the improvement is quite
> substantial, as the write(2) calls are now using the full 64 KB
> buffer. The strace output above suggests (to me) that the improvement
> in throughput comes from the reduction in syscalls, and possibly less
> work on Squid's behalf. If people are interested in the exact syscall
> numbers, I can show the strace summaries of the above comparisons.
> 
> *: There is quite a lot of variance between runs at this point - the
> averages go from 465MB/s up to 588MB/s.
> 
> Also, the squid.conf I've been using (yes it is 5 lines):
> 
> cache_log /var/log/squid/cache.log
> http_access allow all
> http_port 8080
> cache_effective_user squid
> read_ahead_gap 64 KB
> 
> At this stage, I'm not entirely sure what the best course of action
> is. I'm happy to investigate things further, if people have
> suggestions. read_ahead_gap appears to influence downstream write
> buffer sizes, at least up to the maximum of HTTP_REQBUF_SZ. It would
> be nice if that buffer size was independently run-time configurable
> instead of compile-time, but I don't have a real feel for how much
> work that would be. I'm interested in other people's thoughts here.

Making it configurable is fairly trivial. Might be useful to do so for
further testing.
Just an entry in SquidConfig.h and src/cf.data.pre, and using the
Config.X member. Perhapse as a MemBuf init() parameter instead of a
straight char[] buffer, they can go up quite large sizes if needed.

The goalpost is to have all the I/O and buffer handling using SBuf
and/or other MemBlob childs for data storage though. That should
completely remove the buffer itself from being a bottleneck limit.

That is quite a lot more work though since the operational design of
StoreIOBuffer is so different to SBuf.

PS. If you like I will take your current patch and merge it tomorrow.

Amos