[squid-users] Inconsistent accessing of the cache, craigslist.org images, wacky stuff.
Jester Purtteman
jester at optimera.us
Wed Oct 28 15:06:47 UTC 2015
> -----Original Message-----
> From: squid-users [mailto:squid-users-bounces at lists.squid-cache.org] On
> Behalf Of Amos Jeffries
> Sent: Tuesday, October 27, 2015 9:07 PM
> To: squid-users at lists.squid-cache.org
> Subject: Re: [squid-users] Inconsistent accessing of the cache, craigslist.org
> images, wacky stuff.
>
> On 28/10/2015 2:05 p.m., Jester Purtteman wrote:
> > So, here is the problem: I want to cache the images on craigslist.
> > The headers all look thoroughly cacheable, some browsers (I'm glairing
> > at you
> > Chrome) send with this thing that requests that they not be cachable,
>
> "this thing" being what exactly?
>
Thing -> rest of the request, (you'd think someone who spoke a language their entire life could use it, but clearly still need practice :)
> I am aware of several nasty things Chrome sends that interfere with optimal
> HTTP use. But nothing that directly prohibits caching like you describe.
>
The chrome version of the headers have two lines that make my eye twitch:
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
Which (unless I don't understand what's going on, which is quite possible) means "I don't want the response cached, and if possible, could we securely transfer this picture of an old overpriced tractor? It's military grade intelligence information here that bad guys are trying to steal". Am I interpreting that wrong?
>
> > but
> > craigslist replies anyway and says "sure thing! Cache that sucker!"
> > and firefox doesn't even do that. An example of URL:
> > http://images.craigslist.org/00o0o_3fcu92TR5jB_600x450.jpg
> >
> >
> >
> > The request headers look like:
> >
> > Host: images.craigslist.org
> >
> > User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:41.0)
> > Gecko/20100101
> > Firefox/41.0
> >
> > Accept: image/png,image/*;q=0.8,*/*;q=0.5
> >
> > Accept-Language: en-US,en;q=0.5
> >
> > Accept-Encoding: gzip, deflate
> >
> > Referer: http://seattle.craigslist.org/oly/hvo/5288435732.html
> >
> > Cookie: cl_tocmode=sss%3Agrid; cl_b=hlJExhZ55RGzNupTXAYJOAIcZ80;
> > cl_def_lang=en; cl_def_hp=seattle
> >
> > Connection: keep-alive
> >
> >
> >
> > The response headers are:
> >
> > Cache-Control: public, max-age=2592000 <-- doesn't that say "keep
> > that a very long time"?
> >
>
> Not exactly. It says only that you are *allowed* to store it for 30 days. Does
> not say you have to.
>
> Your refresh_pattern rules will use that as the 'max' limit along with the
> below Date+Last-Modified header values when determining whether the
> response can be cached, and for how long.
>
Gotcha.
>
> > Content-Length: 49811
> >
> > Content-Type: image/jpeg
> >
> > Date: Tue, 27 Oct 2015 23:04:14 GMT
> >
> > Last-Modified: Tue, 27 Oct 2015 23:04:14 GMT
> >
> > Server: craigslist/0
> >
> >
> >
> > Access log says:
> > 1445989120.714 265 192.168.2.56 TCP_MISS/200 50162 GET
> > http://images.craigslist.org/00Y0Y_kMkjOhL1Lim_600x450.jpg -
> > ORIGINAL_DST/208.82.236.227 image/jpeg
> >
>
> This is intercepted traffic.
>
> I've run some tests on that domain and it is another one presenting only
> 1 single IP address on DNS results, but rotating through a whole set in the
> background depending on from where it gets queried. As a result different
> machines get different results.
>
>
> What we found just the other day was that domains doing this have big
> problems when queried through Google DNS servers. Due to the way Google
> DNS servers are spread around the world and load balancing their traffic
> these sites can return different IPs on each and very lookup.
>
> The final outcome of all that is when Squid tries to verify the intercepted
> traffic was actually going where the client intended, it cannot confirm the
> ORIGINAL_DST server IP is one belonging to the Host header domain.
>
>
> The solution is to setup a DNS resolver in your network and use that instead
> of the Google DNS. You may have to divert clients DNS queries to it if they try
> to go to Google DNS still. The result will be much more cacheable traffic and
> probably faster DNS as well.
>
>
So getting lazy and using 8.8.8.8 because I don't have to remember which server I installed bind or dnsmasq on has finally come back to haunt me... I actually had a nightmare of a time getting another system working over the same problem, I'm giving this a rating of highly plausible. I'll revise the structure, if that fixes the issue, I'll let you know.
> >
> > And Store Log says:
> > 1445989120.714 RELEASE -1 FFFFFFFF
> 27C2B2CEC9ACCA05A31E80479E5F0E9C ?
> > ? ? ? ?/? ?/? ? ?
> >
> >
> >
> > I started out with a configuration from here:
> > http://wiki.sebeka.k12.mn.us/web_services:squid_update_cache but
> have
> > made a lot of tweaks to it. In fact, I've dropped all the updates,
> > all the rewrite, store id, and a lot of other stuff. I've set cache
> > allow all (which, I suspect I can simply leave blank, but I don't
> > know) I've cut it down quite a bit, the one I am testing right now
> > for example looks like
> > this:
> >
> >
> >
> > My squid.conf (which has been hacked mercilously trying stuff,
> > admittedly) looks like this:
> >
> >
> >
> > <BEGIN SQUID.CONF >
> >
> > acl localnet src 10.0.0.0/8 # RFC1918 possible internal network
> >
> > acl localnet src 172.16.0.0/12 # RFC1918 possible internal network
> >
> > acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
> >
> > acl localnet src fc00::/7 # RFC 4193 local private network range
> >
> > acl localnet src fe80::/10 # RFC 4291 link-local (directly plugged)
> > machines
> >
> >
> >
> > acl SSL_ports port 443
> >
> > acl Safe_ports port 80 # http
> >
> > acl Safe_ports port 21 # ftp
> >
> > acl Safe_ports port 443 # https
> >
> > acl Safe_ports port 70 # gopher
> >
> > acl Safe_ports port 210 # wais
> >
> > acl Safe_ports port 1025-65535 # unregistered ports
> >
> > acl Safe_ports port 280 # http-mgmt
> >
> > acl Safe_ports port 488 # gss-http
> >
> > acl Safe_ports port 591 # filemaker
> >
> > acl Safe_ports port 777 # multiling http
> >
> > acl CONNECT method CONNECT
> >
> >
>
> You are missing the default security http_access lines. They should be re-
> instated even on intercepted traffic.
>
> acl SSL_Ports port 443
>
> http_access deny !Safe_ports
> http_access deny CONNECT !SSL_Ports
>
The problem with an intermittent issue is that sometimes unrelated changes make it "work", and then later don't, and you end up with some quirky stuff.
>
>
> >
> > http_access allow localnet
> >
> > http_access allow localhost
> >
> >
> >
> > # And finally deny all other access to this proxy
> >
> > http_access deny all
> >
> >
> >
> > http_port 3128
> >
> > http_port 3129 tproxy
> >
>
> Okay, assuming you have the proper iptables/ip6tables TPROXY rules setup
> to accompany it.
>
>
That part at least, is working, it gets and caches *some* traffic, just an oddly small amount of it.
> >
> > cache_dir aufs /var/spool/squid/ 40000 32 256
> >
> >
> >
> > cache_swap_low 90
> >
> > cache_swap_high 95
> >
> >
> >
> > dns_nameservers 8.8.8.8 8.8.4.4
> >
>
> See above.
>
> >
> >
> > cache allow all
>
> Not useful. That is the default action when "cache" directive is nomitted
> entirely.
>
See above comments about desperation and errors :)
> >
> > maximum_object_size 8000 MB
> >
> > range_offset_limit 8000 MB
> >
> > quick_abort_min 512 KB
> >
> >
> >
> > cache_store_log /var/log/squid/store.log
> >
> > access_log daemon:/var/log/squid/access.log squid
> >
> > cache_log /var/log/squid/cache.log
> >
> > coredump_dir /var/spool/squid
> >
> >
> >
> > max_open_disk_fds 8000
> >
> >
> >
> > vary_ignore_expire on
> >
>
> The above should not be doing anything in current Squid which are
> HTTP/1.1 compliant. It is just a directive we have forgotten to remove.
>
> > request_entities on
> >
> >
> >
> > refresh_pattern -i .*\.(gif|png|jpg|jpeg|ico|webp)$ 10080 100% 43200
> > ignore-no-store ignore-private ignore-reload store-stale
> >
> > refresh_pattern ^ftp: 1440 20% 10080
> >
> > refresh_pattern ^gopher: 1440 0% 1440
> >
> > refresh_pattern -i .*\.index.(html|htm)$ 2880 40% 10080
> >
> > refresh_pattern -i .*\.(html|htm|css|js)$ 120 40% 1440
> >
> > refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
> >
> > refresh_pattern . 0 40% 40320
> >
> >
> >
> > cache_mgr <my address>
> >
> > cache_effective_user proxy
> >
> > cache_effective_group proxy
> >
> >
> >
> > <END SQUID.CONF>
> >
> >
> >
> > There is a good deal of hacking that has gone into this configuration,
> > and I accept that this will eventually be gutted and replaced with
> > something less, broken.
>
> It is surprisingly good for all that :-)
>
>
> > Where I am pulling my hair out is trying to figure out why things are
> > cached and then not cached. That top refresh line (the one looking
> > for jpg, gifs etc) has taken many forms, and I am getting inconsistent
> results.
> > The above image will cache just fine, a couple times, but if I go
> > back, clear the cache on the browser, close out, restart and reload,
> > it releases the link and never again shall it cache. What is worse,
> > it appears to get getting worse over time until it isn't really picking up much
> of anything.
> > What starts out as a few missed entries piles up into a huge list of
> > cache misses over time.
> >
>
> What Squid version is this? 0.1% seems to be extremely low. Even for a proxy
> having those Google DNS problems.
I've compiled bleeding edge (squid-3.5.10-20151001-r13933). I had been using the Ubuntu 14.04 pre-installed library for a while but there were enough things released in 3.4 and 3.5 that I decided to just suck it up and compile it. Since this is running on a virtual machine anyway, I just reinstalled to keep from having any contamination between packaged squid and compiled squid.
Update on statistics: In my access log, which ran for a couple hours (it wasn't catching much, but it wasn't breaking anything either, so I let it run), in that time there was about 4.2 GB of traffic, the cache grew to 500MB (which seems probably right actually), but there were only 3050 total HIT/REFRESH entries combined in the access log out of 173,915 lines. So I'd like to revise that number to 1.7% (the 0.1% was with a nearly empty cache, not a real fair example!). That still seems pretty low, but is a LOT BETTER than 0.1%!
So, bottom line, I may have cried wolf too soon in some respects (its doing something), but the fact that pretty cacheable looking stuff from craigslist is being dropped makes me think that DNS issue is doing bad things to me. I'll fix that!
>
> >
> >
> > Right now, I am running somewhere in the 0.1% hits rate, and I can
> > only assume I have buckled something in all the compile and
> > re-compiles, and reconfigurations. What started out as "gee, I wonder
> > if I can cache updates" has turned into quite the rabbit hole!
> >
> >
> >
> > So, big question, what debug level do I use to see this thing making
> > decisions on whether to cache, and any tips anyone has about this
> > would be appreciated. Thank you!
>
> debug_options 85,3 22,3
>
I have used 22,3 which I gleaned from another post on this list, I find a lot of this in my cache.log:
2015/10/27 18:23:18.402| ctx: enter level 0: 'http://images.craigslist.org/00707_cL1v48AjUBR_300x300.jpg'
2015/10/27 18:23:18.402| 22,3| http.cc(328) cacheableReply: NO because e:=p2XDIV/0x24afa00*3 has been released.
2015/10/27 18:23:18.409| ctx: exit level 0
I'll let you know if fixing DNS takes that out.
>
> Amos
> _______________________________________________
> squid-users mailing list
> squid-users at lists.squid-cache.org
> http://lists.squid-cache.org/listinfo/squid-users
More information about the squid-users
mailing list