[squid-users] Inconsistent accessing of the cache, craigslist.org images, wacky stuff.

Amos Jeffries squid3 at treenet.co.nz
Wed Oct 28 04:06:45 UTC 2015


On 28/10/2015 2:05 p.m., Jester Purtteman wrote:
> So, here is the problem:  I want to cache the images on craigslist.  The
> headers all look thoroughly cacheable, some browsers (I'm glairing at you
> Chrome) send with this thing that requests that they not be cachable,

"this thing" being what exactly?

I am aware of several nasty things Chrome sends that interfere with
optimal HTTP use. But nothing that directly prohibits caching like you
describe.


> but
> craigslist replies anyway and says "sure thing! Cache that sucker!" and
> firefox doesn't even do that.  An example of URL:
> http://images.craigslist.org/00o0o_3fcu92TR5jB_600x450.jpg
> 
>  
> 
> The request headers look like:
> 
> Host: images.craigslist.org
> 
> User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:41.0) Gecko/20100101
> Firefox/41.0
> 
> Accept: image/png,image/*;q=0.8,*/*;q=0.5
> 
> Accept-Language: en-US,en;q=0.5
> 
> Accept-Encoding: gzip, deflate
> 
> Referer: http://seattle.craigslist.org/oly/hvo/5288435732.html
> 
> Cookie: cl_tocmode=sss%3Agrid; cl_b=hlJExhZ55RGzNupTXAYJOAIcZ80;
> cl_def_lang=en; cl_def_hp=seattle
> 
> Connection: keep-alive
> 
>  
> 
> The response headers are:
> 
> Cache-Control: public, max-age=2592000  <-- doesn't that say "keep that a
> very long time"?
> 

Not exactly. It says only that you are *allowed* to store it for 30
days. Does not say you have to.

Your refresh_pattern rules will use that as the 'max' limit along with
the below Date+Last-Modified header values when determining whether the
response can be cached, and for how long.


> Content-Length: 49811
> 
> Content-Type: image/jpeg
> 
> Date: Tue, 27 Oct 2015 23:04:14 GMT
> 
> Last-Modified: Tue, 27 Oct 2015 23:04:14 GMT
> 
> Server: craigslist/0
> 
>  
> 
> Access log says:
> 1445989120.714    265 192.168.2.56 TCP_MISS/200 50162 GET
> http://images.craigslist.org/00Y0Y_kMkjOhL1Lim_600x450.jpg -
> ORIGINAL_DST/208.82.236.227 image/jpeg
> 

This is intercepted traffic.

I've run some tests on that domain and it is another one presenting only
1 single IP address on DNS results, but rotating through a whole set in
the background depending on from where it gets queried. As a result
different machines get different results.


What we found just the other day was that domains doing this have big
problems when queried through Google DNS servers. Due to the way Google
DNS servers are spread around the world and load balancing their traffic
these sites can return different IPs on each and very lookup.

The final outcome of all that is when Squid tries to verify the
intercepted traffic was actually going where the client intended, it
cannot confirm the ORIGINAL_DST server IP is one belonging to the Host
header domain.


The solution is to setup a DNS resolver in your network and use that
instead of the Google DNS. You may have to divert clients DNS queries to
it if they try to go to Google DNS still. The result will be much more
cacheable traffic and probably faster DNS as well.


> 
> And Store Log says:
> 1445989120.714 RELEASE -1 FFFFFFFF 27C2B2CEC9ACCA05A31E80479E5F0E9C   ?
> ?         ?         ? ?/? ?/? ? ?
> 
>  
> 
> I started out with a configuration from here:
> http://wiki.sebeka.k12.mn.us/web_services:squid_update_cache but have made a
> lot of tweaks to it.  In fact, I've dropped all the updates, all the
> rewrite, store id, and a lot of other stuff.  I've set cache allow all
> (which, I suspect I can simply leave blank, but I don't know)  I've cut it
> down quite a bit, the one I am testing right now for example looks like
> this:
> 
>  
> 
> My squid.conf (which has been hacked mercilously trying stuff, admittedly)
> looks like this:
> 
>  
> 
> <BEGIN SQUID.CONF >
> 
> acl localnet src 10.0.0.0/8     # RFC1918 possible internal network
> 
> acl localnet src 172.16.0.0/12  # RFC1918 possible internal network
> 
> acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
> 
> acl localnet src fc00::/7       # RFC 4193 local private network range
> 
> acl localnet src fe80::/10      # RFC 4291 link-local (directly plugged)
> machines
> 
>  
> 
> acl SSL_ports port 443
> 
> acl Safe_ports port 80          # http
> 
> acl Safe_ports port 21          # ftp
> 
> acl Safe_ports port 443         # https
> 
> acl Safe_ports port 70          # gopher
> 
> acl Safe_ports port 210         # wais
> 
> acl Safe_ports port 1025-65535  # unregistered ports
> 
> acl Safe_ports port 280         # http-mgmt
> 
> acl Safe_ports port 488         # gss-http
> 
> acl Safe_ports port 591         # filemaker
> 
> acl Safe_ports port 777         # multiling http
> 
> acl CONNECT method CONNECT
> 
>  

You are missing the default security http_access lines. They should be
re-instated even on intercepted traffic.

 acl SSL_Ports port 443

 http_access deny !Safe_ports
 http_access deny CONNECT !SSL_Ports



> 
> http_access allow localnet
> 
> http_access allow localhost
> 
>  
> 
> # And finally deny all other access to this proxy
> 
> http_access deny all
> 
>  
> 
> http_port 3128
> 
> http_port 3129 tproxy
> 

Okay, assuming you have the proper iptables/ip6tables TPROXY rules setup
to accompany it.


> 
> cache_dir aufs /var/spool/squid/ 40000 32 256
> 
>  
> 
> cache_swap_low 90
> 
> cache_swap_high 95
> 
>  
> 
> dns_nameservers 8.8.8.8 8.8.4.4
> 

See above.

>  
> 
> cache allow all

Not useful. That is the default action when "cache" directive is
nomitted entirely.

> 
> maximum_object_size 8000 MB
> 
> range_offset_limit 8000 MB
> 
> quick_abort_min 512 KB
> 
>  
> 
> cache_store_log /var/log/squid/store.log
> 
> access_log daemon:/var/log/squid/access.log squid
> 
> cache_log /var/log/squid/cache.log
> 
> coredump_dir /var/spool/squid
> 
>  
> 
> max_open_disk_fds 8000
> 
>  
> 
> vary_ignore_expire on
> 

The above should not be doing anything in current Squid which are
HTTP/1.1 compliant. It is just a directive we have forgotten to remove.

> request_entities on
> 
>  
> 
> refresh_pattern -i .*\.(gif|png|jpg|jpeg|ico|webp)$ 10080 100% 43200
> ignore-no-store ignore-private ignore-reload store-stale
> 
> refresh_pattern ^ftp: 1440 20% 10080
> 
> refresh_pattern ^gopher: 1440 0% 1440
> 
> refresh_pattern -i .*\.index.(html|htm)$ 2880 40% 10080
> 
> refresh_pattern -i .*\.(html|htm|css|js)$ 120 40% 1440
> 
> refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
> 
> refresh_pattern . 0 40% 40320
> 
>  
> 
> cache_mgr <my address>
> 
> cache_effective_user proxy
> 
> cache_effective_group proxy
> 
>  
> 
> <END SQUID.CONF>
> 
>  
> 
> There is a good deal of hacking that has gone into this configuration, and I
> accept that this will eventually be gutted and replaced with something less,
> broken.

It is surprisingly good for all that :-)


>  Where I am pulling my hair out is trying to figure out why things
> are cached and then not cached.  That top refresh line (the one looking for
> jpg, gifs etc) has taken many forms, and I am getting inconsistent results.
> The above image will cache just fine, a couple times, but if I go back,
> clear the cache on the browser, close out, restart and reload, it releases
> the link and never again shall it cache.  What is worse, it appears to get
> getting worse over time until it isn't really picking up much of anything.
> What starts out as a few missed entries piles up into a huge list of cache
> misses over time.
> 

What Squid version is this? 0.1% seems to be extremely low. Even for a
proxy having those Google DNS problems.

>  
> 
> Right now, I am running somewhere in the 0.1% hits rate, and I can only
> assume I have buckled something in all the compile and re-compiles, and
> reconfigurations.  What started out as "gee, I wonder if I can cache
> updates" has turned into quite the rabbit hole!
> 
>  
> 
> So, big question, what debug level do I use to see this thing making
> decisions on whether to cache, and any tips anyone has about this would be
> appreciated.  Thank you!

debug_options 85,3 22,3


Amos


More information about the squid-users mailing list