[squid-users] Issues with SSLBumped high traffic forward caching

Wed Jun 9 14:04:09 UTC 2021

Hi,

We have a fairly simple (in theory) use case where we have a bunch of
headless Chromium browsers connecting to websites on the Internet
through various geo-specific proxies. To speed things up, we'd like to
add a caching layer, since it's perfectly acceptable for us to honor all
max-age/expires/etc. headers for all of the accessed content.

Nearly all accesses use https, so we've had to implement SSLBump, and
we went with squid 5. That part seems to work well enough.

We initially went with multiple servers configured as cache peers, but
since we've been seeing a lot of different problems, we're now focusing
on a single squid 5.0.6 server.

It has 128GB RAM, a 16 core EPYC CPU, 3TB+ of NVMe storage and 1Gbps
Internet bandwidth, which we'd obviously like to use as much as
possible.

What we have configured is:
* Multiple http_port with a cache_peer each to access remote geo
  specific proxies. We've had to rebuild with -DMAXTCPLISTENPORTS=512
  to increase the 128 default. Example (sorry for the line breaks):

acl port_usa1 localport 21083
http_port 21083 ssl-bump cert=/etc/squid/ssl_cert/myCA.pem \
  generate-host-certificates=on dynamic_cert_mem_cache_size=32MB
cache_peer 198.51.100.66 parent 443 3130 no-query no-digest no-delay \
  name=usa1 cache_peer_access usa1 allow port_usa1

* Simple SSLBump setup, where we don't check origins for cached objects
  to avoid the added latency, use a local CA and have Chromium
  configured to ignore all SSL/TLS mismatches:

sslcrtd_program /usr/lib64/squid/security_file_certgen -s \
  /var/cache/squid/ssl_db -M 32MB
acl step1 at_step SslBump1
acl step2 at_step SslBump2
acl step3 at_step SslBump3
ssl_bump client-first

* Memory and disk cache to try and use resources as much as possible:

workers 4
cache_mem 81920 MB
memory_cache_shared on
shared_transient_entries_limit 65536
minimum_object_size 0 KB
maximum_object_size 20 MB
maximum_object_size_in_memory 2048 KB
#cache_dir rock /var/spool/squid 3453640 
max_filedescriptors 16384

We've tried a lot of other configuration options, read a lot of
documentation, but we're still getting a lot of errors in the logs.
Here are the most worrying:

assertion failed: Transients.cc:221: "old == e"

When that "assertion failed" happens, the kid dies and a new one gets
forked in its place. We can see that happen multiple times per minute.

ERROR: Collapsed forwarding queue overflow for kid1 at 1024 items

This one seems to be impossible for us to track down. It doesn't show
up immediately, but always ends up coming back, and can be multiple
times per second when we have a high usage peak. We've tried:
 * Enabling/disabling "collapsed_forwarding", nothing changes. It should
   be off by default, but this message is there nevertheless.
 * Recompiling squid with the value raised to 4096. Same message with
   the new value.
 * Disabling the "cache_dir rock". It seems to then take longer to
   appear, but does ultimately appear again.

Could anyone provide pointers on how to track down what could be
causing these two errors? We can provide configuration, logs, traces
and dumps as needed.

Cheers,
Matthias

-- 
            Matthias Saou                  ██          ██
                                             ██      ██
Web: http://matthias.saou.eu/              ██████████████
Mail/XMPP:  matthias at saou.eu             ████  ██████  ████
                                       ██████████████████████
GPG: 4096R/E755CC63                    ██  ██████████████  ██
     8D91 7E2E F048 9C9C 46AF          ██  ██          ██  ██
     21A9 7A51 7B82 E755 CC63                ████  ████