[squid-users] 100% cpu after about an hour

Fri Dec 2 01:27:23 UTC 2016

On 2/12/2016 1:18 p.m., Michael Gibson wrote:
> Hello,
> 
> Having about 100% CPU usage after about an hour running. We operate Squid v
> 3.5 on multiple nodes. We range from 10 users, up through 200 on various
> nodes. We recently updated from 3.3 to 3.5 and I've been unable to contain
> the core usage of Squid. I attempted to get multiple Squid workers resulted
> in both cores getting pegged on our small servers.

Which 3.5 release exactly?

Any hint about what its doing with all those cycles?
 cache.log might have something.

> 
> Dual core Intel(R) Celeron(R) M processor         1.50GHz
> 4GB RAM
> 
> Here's the config we're currently running:
> 
> # Managed by Chef
> # Changes will be overwritten
> 
> # Crazy debug
> #debug_options ALL,0 11,5 20,5 17,5 23,5 26,5 28,5 44,5 55,5 61,5 78,5 83,5
> 
> # Debug ACL issues
> #debug_options ALL,1 28,4
> 
> # Debug ACL issues full access
> #debug_options ALL,1 28,2 28,9
> 
> # Setup our local networks ACLs
> acl VPN_Net src 10.1.2.0/24
> acl No_Auth_Net src 10.10.1.0/24
> acl Android_Server src 10.10.1.1/32
> 
> acl SSL_ports port 443
> acl Safe_ports port 80    # http
> acl Safe_ports port 443   # https
> acl CONNECT method CONNECT
> 
> # Custom acl for Telmate-controlled sites
> acl telmate_domains dstdomain .telmate.com .telmate.cc
> request_header_add ****************************
> request_header_add **********************
> 
> #
> # Content filtering
> #
> icap_enable on
> 
> # unlimited icap failure
> icap_service_failure_limit -1
> icap_retry allow all
> icap_send_client_ip on
> icap_retry_limit -1
> 
> icap_service service_req reqmod_precache bypass=0 icap://
> 127.0.0.1:1344/request
> adaptation_access service_req allow VPN_Net !CONNECT
> 
> icap_service service_resp respmod_precache bypass=0 icap://
> 127.0.0.1:1344/response
> adaptation_access service_resp allow VPN_Net
> 
> # Only allow cachemgr access from Android Server
> http_access allow Android_Server manager
> http_access deny manager
> 

In 3.5 move the manager stuff down below the CONNECT line...

> # Deny requests to ports we don't allow
> http_access deny !Safe_ports
> 
> # Deny CONNECT to other than secure SSL ports
> http_access deny CONNECT !SSL_ports
> 

... here. It is a small bit faster that way.

> # URL Filtering
> 
> # acl No_Auth_Whitelist dstdomain "/etc/squid3/approved-sites.squid"
> acl No_Auth_Whitelist dstdomain "/etc/squid3/no-auth-approved-sites.squid"
> 
> # dedicated, no exception URL blacklist managed by chef only
> acl blacklist-urls url_regex "/etc/squid3/blacklist-urls.squid"
> http_access deny blacklist-urls
> 
> # Allow localhost access in case of misconfigured application
> http_access allow localhost
> 
> # Allow No_Auth_Net access to only the No auth whitelist
> http_access allow No_Auth_Whitelist No_Auth_Net
> 
> # CONNECT method requests only have an IP address, allow all SSL CONNECT
> handshakes
> http_access allow No_Auth_Net CONNECT
> 
> # Allow VPN_Net to anything as ICAP will be consulted for approval
> http_access allow VPN_Net
> 
> # Default catch all to deny access not specifically granted
> http_access deny all
> 
> 
> # Squid proxy interception config
> http_port 10.10.1.1:3128
> http_port 10.10.1.1:3126 intercept
> https_port 10.10.1.1:3127 intercept ssl-bump generate-host-certificates=on
> dynamic_cert_mem_cache_size=4MB cert=/etc/squid3/telmate-gk-CA.pem
> 
> # Proxy public hiding, don't tell site we are using a proxy
> via off
> forwarded_for off
> 
> # ssl-bump goodies
> always_direct allow all

Obsolete 3.1 hack for client-first SSL-Bump. Remove.

> ssl_bump server-first all

Deprecated bumping rules. Please update this to the 3.5 feature syntax.

IIRC, the 3.5 equivalent of the above line is:
 ssl_bump stare all
 ssl_bump bump all

> 
> # the following two options are unsafe and not always necessary:
> sslproxy_cert_error allow all
> sslproxy_flags DONT_VERIFY_PEER
> 
> # Prepare ssl_db: Done in Chef
> # /usr/lib/squid3/ssl_crtd3 -c -s /var/spool/squid3/ssl_db -M 4MB
> # chown -R proxy:proxy /var/spool/squid3/ssl_db
> sslcrtd_program /usr/lib/squid3/ssl_crtd3 -s /var/spool/squid3/ssl_db -M 4MB
> sslcrtd_children 32 startup=5 idle=1
> 
> # Shutdown Squid after 2 seconds to flush current connections, default is 10
> shutdown_lifetime 2 seconds
> 
> # Leave coredumps in the cache dir
> coredump_dir /var/spool/squid3
> 
> # Object size and lifetime settings
> cache_mem 256 MB
> maximum_object_size 1024 MB
> range_offset_limit 200 MB

Since rock cache also breaks objects down into chunks of ~32KB it is
inefficient for large MB (or GB) sized objects. That is not going to do
much good for CPU performance.

> quick_abort_min -1
> read_ahead_gap 50 MB
> 

So 50+ MB of buffer per client connection.

 How many connections per second does your proxy service?
 How many of those are concurrent after this "1 hour" ?

  200 clients x 50 MB buffer => up to 10 GB of buffered data.
 Easily more than the ~3 GB RAM this machine might have spare.
 And thats assuming just one connection per client, usually browser
clients have 8-100 connections open at a time.
 There is a good reason the default 'gap' is set to 64KB :-)

If Squid starts to use swap memory performance does down the drain,
really, really badly.

> # cache the health_check to give poor snap a break :(
> refresh_pattern ^https://*************/health_check 10 80% 30
> override-expire override-lastmod ignore-reload ignore-no-store
> ignore-must-revalidate ignore-private ignore-auth store-stale

Ah, you know that 30 means *minutes* right?

So the health checker will have a 30min delay between identifying
problems. Or counterwise, a 30min delay before identiying problems resolved.

> 
> refresh_pattern -i (/cgi-bin/|\?) 0 0%  0
> refresh_pattern -i \.(iso|avi|wav|mp3|mp4|mpeg|
> swf|flv|x-flv|mpg|wma|ogg|wmv|asx|asf)$ 260000 90% 260009 override-expire
> refresh_pattern -i zip$ 432000 100% 864000 override-expire

You seem to be trying to cache video and multimedia objects in a
default/small RAM cache and a rock cache.

rock cache type is optimized for use caching very small objects in the
order of KB at most. While it can store larger ones now, that is quite
inefficient use of the rock cache design.

UFS/AUFS/diskd are the cache types to use for large MB/GB sized objects.

> refresh_pattern .   0 20% 4320
> 
> # Logging
> # Following logformat WITH request headers, VERY chatty, debug only
> # logformat squid %tl %5trms %>a %Ss/%03>Hs %<st %rm %>ru %mt %>ha
> logformat squid-full %tl %5trms %>a %Ss/%03>Hs %<st %rm %>ru %mt
> logformat squid %tl %5trms %>a %Ss/%03>Hs %<st %rm %ru %mt

Re-defining the built-in "squid" format does not do what you expect, and
is yet another drain on the CPU.

> # Log query params for telmate traffic only
> access_log daemon:/var/log/squid3/telmate.log squid-full telmate_domains
> access_log daemon:/var/log/squid3/access.log squid
> 
> # Cache_dir must be after maximum_object_size
> cache_dir rock /var/spool/squid3 51200
> 

HTH
Amos