[squid-users] squid centos and osq_lock

Sat Aug 1 11:45:39 UTC 2015

On 07/31/2015 03:56 PM, Amos Jeffries wrote:
> On 1/08/2015 4:06 a.m., Josip Makarevic wrote:
>> Marcus, tnx for your info.
>> OS is centos 6 w kernel  2.6.32-504.30.3.el6.x86_64
>> Yes, cpu_affinity_map is good and with 6 instances there is load only on
>> first 6 cores and the server is 12 core, 24 HT
>
> Then I suspect that mutex and locking will be the kernel scheduling work
> on the HT cores.
>   In high performance Squid will max out a physical cores worth of
> cycles. HT essentially tries to over-clock physical cores. But trying to
> reach 200% capacity into a physical core with Squid workloads only leads
> to trouble.
>   It is far better to tie Squid with affinity to one instance per
> physical core and let the extra HT capacity be available to the OS and
> other supporting things the Squid instance needs to have happen externally.
>
>
>> each instance is bound to 1 core. Instance 1 = core1, instance 2 = core 2
>> and so on so that should not be the problem.
>> I've tried with 12 workers but that's even worse.
>
> You do need to be very careful about which core numbers are the HT core
> vs the physical core ID. Last time I saw anyone doing it, every second
> number was a real physical core ID. YMMV.

There are 2 mappings and I have seen them both but I do not recall which I saw where.
You can do the following to find out which CPU# is a sibling (HT core):
cd /sys/devices/system/cpu
for cpu in cpu[0-9]* ; do
    cat $cpu/topology/thread_siblings_list
done

>> Let me try to explain:
>> on non-smp with traffic at ~300mbits we have load of ~4 (on 6 workers).
>> in that case, actual user time is about 10-20% and 70-80% is sys time
>> (osq_lock) and there are no connection timeouts.

The CPU time in osq_lock is not easy to explain but it is not likely caused by Squid itself.
Googling about osq_lock led me to a kernel patch discussion where 500 dd processes on ext4/multipath or a file system repair with 125 threads caused the system to use 70+% CPU in osq_lock.
The general believe was that a lot of outstanding IO caused it.
This brings me to these questions:
- what is your testing method ?
- are there simply too many concurrent connections per instance of Squid ?
- are the bonded 10G interfaces supported by CentOS 6 ?
- can you test with unbonded ethernet? (the bonding driver code uses 2 locks)

You may or may not get better results with CentOS 7 or the custom kernel (try latest or before 3.12 since some issues started with 3.12).

>> If I switch to SMP 6 workers user time goes up but sys time goes up too and
>> there are connection timeouts and the load jumps to ~12.
>> If I give it more workers only load jumps and more connections are being
>> dropped to the point that load goes to 23/24 and the entire server is slow
>> as hell.
>>
>> So, best performance so far are with 6 non-smp workers.

'workers' is a term used by Squid SMP.
To have less confusion, in a non-SMP Squid config, I suggest to use the term 'instance'.

Marcus

>> For now I have 2 options:
>> 1. Install older squid (3.1.10 centos repo) and try it then
>> 2. build custom 64bit kernel with RCU and specific cpu family support (in
>> progress).
>>
>> The end idea is to be able to sustain 1gig of traffic on this server :)
>> Any advice is welcome
>
> I agree with Marcus then. The non-SMP then is the way to go at present.
> The main benefit of SMP support in current Squid is for caching
> de-duplication (ie rock store).
>
>
> Also some things to note:
>
> * a good percentage of the speed of Squid is the 20-40% caching HIT rate
> normal HTTP traffic has. Albeit memory-only caching on highest
> performance boxen. Memory hits are 4-6 orders of magnitude faster than
> network fetches. This has little to do with anything you can control
> (normally). The (relatively) slow speed of origin servers creating the
> content is the bottleneck. Even "static" content may be encoded to the
> clients requested desire on each fetch, which takes time.
>
>
> * Going by out lab tests and real-world results so far I rate Squid
> per-worker at ~50Mbps on 3.1GHz core, and ~70Mbps on 3.7GHz. Your 12
> cores will only get you up around 800 Mbits IMHO (thats after tuning). I
> would gladly be proven wrong though :-)
>
>
> * Squid effectively *polls* all the listening ports every 10ms or once
> every 10 I/O events (whichever is faster). So running with 1024
> listening ports is a bit counter-productive, more time could be spent
> checking those ports than doing work.
>   That said going from one to multiple listening ports does make a speed
> improvement. Finding the sweet spot between those trends is something
> else to tune for.
>   <http://wiki.squid-cache.org/MultipleInstances#Tips>
>
>
>> 2015-07-31 14:53 GMT+02:00 Marcus Kool:
>>
>>> osq_lock is used in the kenel for the implementation of a mutex.
>>> It is not clear which mutex so we can only guess.
>>>
>>> Which version of the kernel and distro do you use?
>>>
>>> Since mutexes are used by Squid SMP, I suggest to switch for now to Squid
>>> non-SMP.
>>>
>>> What is the value of cpu_affinity_map in all config files?
>>> You say they are static. But do you allocate each instance on a different
>>> core?
>>> Does 'top' show that all CPUs are used?
>>>
>>> Do you have 24 cores or 12 hyperthreaded cores?
>>> In case you have 12 real cores, you might want to experiment with 12
>>> instances of Squid and then try to upscale.
>>>
>>> Make maximum_object_size large, a max size of 16K will prohibit the
>>> retrieval of objects larger than 16K.
>>> I am not sure about 'maximum_object_size_in_memory 16 KB' but let it be
>>> infinite and do not worry since
>>> cache_mem is zero.
>>>
>>> Marcus
>>>
>>>
>>>
>>> On 07/31/2015 03:52 AM, Josip Makarevic wrote:
>>>
>>>> Hi Amos,
>>>>
>>>>    cache_mem 0
>>>>    cache deny all
>>>>
>>>> already there.
>>>> Regarding number of nic ports we have 4 10G eth cards 2 in each bonding
>>>> interface.
>>>>
>>>> Well, entire config would be way too long but here is the static part:
>>>> via off
>>>> cpu_affinity_map process_numbers=1 cores=2
>>>> forwarded_for delete
>>>> visible_hostname squid1
>>>> pid_filename /var/run/squid1.pid
>
> Remove these...
>
>>>> icp_port 0
>>>> htcp_port 0
>>>> icp_access deny all
>>>> htcp_access deny all
>>>> snmp_port 0
>>>> snmp_access deny all
>
> ... to here. They do nothing but slow Squid-3 down.
>
>>>> dns_nameservers x.x.x.x
>>>> cache_mem 0
>>>> cache deny all
>>>> pipeline_prefetch on
>
> In Squid-3.4 and later this is set to the length of pipeline you want to
> accept.
>
> NP: 'on' traditionally has meant pipeline length of 1 (two parallel
> requests). Longer lengths are not yet well tested but generally it seems
> to work okay.
>
>
>>>> memory_pools off
>>>> maximum_object_size 16 KB
>>>> maximum_object_size_in_memory 16 KB
>
> Like Marcus said. Without even memory caching these two have no useful
> effects.
>
> There is one related setting "read_ahead_gap" which affects performance
> by tuning the amount of undelivered object data Squid will buffer in
> transient memory. Higher value for that mean faster servers can finish
> sending earlier and resources for them released for other uses.
>   Tuning this is a fine art since it modulates how much Squid internal
> buffers (and pipieline prefetching) read off TCP buffers. And all of
> those buffers have limits of their own and may contain multiple requests
> data.
>
>
>>>> ipcache_size 0
>
> Remove this. Without IP cache Squid will be forced to do about 4x remote
> DNS lookup for every single HTTP request - *minimum*. Maybe more if you
> apply any access controls to the traffic.
>   If anything increase the ipcache size to store more results.
>
>
>>>> cache_store_log none
>
> Not needed in Squid-3. You can remove.
>
>>>> half_closed_clients off
>>>> include /etc/squid/rules
>>>> access_log /var/log/squid/squid1-access.log
>
> Logging I/O slows Squid down. I suggest making that a daemon, TCP or UDP
> log output.
>
>
>>>> cache_log /var/log/squid/squid1-cache.log
>>>> coredump_dir /var/spool/squid/squid1
>>>> refresh_pattern ^ftp:           1440    20%     10080
>>>> refresh_pattern ^gopher:        1440    0%      1440
>>>> refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
>>>> refresh_pattern .               0       20%     4320
>
> Without caching you can remove these *entirely*.
>
>>>>
>>>> acl port0 myport 30000
>
> Mumble. Less reliable than myportname, but it is infintessimally faster
> when it does work at all.
>
>>>> http_access allow testhost
>>>> tcp_outgoing_address x.x.x.x port0
>>>>
>>>> include is there for basic ACL - safe ports and so on - to minimize
>>>> config file footprint since it's static and same for every worker.
>>>>
>>>> and so on 44 more times in this config file
>
> Only put allow testhost once. Every time you test ACLs Squid slows down.
>
> Some ACLs are worse drag than others. You can probably optimize even the
> default recommended security settings you shuffled into "rules" file to
> operate better.
>
>
>>>>
>>>> Do you know of any good article hot to tune kernel locking or have any
>>>> idea why is it happening?
>>>> I cannot find any good info on it and all I've found are bits and peaces
>>>> of kernel source code.
>
> Sorry no. All I found was the same.
>
> Though I do know that one of the big differences between Linux 2.6 and
> 3.0 was the removal of the "Big Kernel Lock" system that allowed Linux
> to run on multi-core systems properly. It could be CentOS 6 itelf biting
> you with its ancient kernel version.
>
>
> Amos
> _______________________________________________
> squid-users mailing list
> squid-users at lists.squid-cache.org
> http://lists.squid-cache.org/listinfo/squid-users
>