[squid-users] Inconsistent accessing of the cache, craigslist.org images, wacky stuff.
Jester Purtteman
jester at optimera.us
Wed Oct 28 16:30:27 UTC 2015
> -----Original Message-----
> From: squid-users [mailto:squid-users-bounces at lists.squid-cache.org] On
> Behalf Of Amos Jeffries
> Sent: Tuesday, October 27, 2015 9:07 PM
> To: squid-users at lists.squid-cache.org
> Subject: Re: [squid-users] Inconsistent accessing of the cache, craigslist.org
> images, wacky stuff.
>
> On 28/10/2015 2:05 p.m., Jester Purtteman wrote:
> > So, here is the problem: I want to cache the images on craigslist.
> > The headers all look thoroughly cacheable, some browsers (I'm glairing
> > at you
> > Chrome) send with this thing that requests that they not be cachable,
>
> "this thing" being what exactly?
>
> I am aware of several nasty things Chrome sends that interfere with optimal
> HTTP use. But nothing that directly prohibits caching like you describe.
>
>
> > but
> > craigslist replies anyway and says "sure thing! Cache that sucker!"
> > and firefox doesn't even do that. An example of URL:
> > http://images.craigslist.org/00o0o_3fcu92TR5jB_600x450.jpg
> >
> >
> >
> > The request headers look like:
> >
> > Host: images.craigslist.org
> >
> > User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:41.0)
> > Gecko/20100101
> > Firefox/41.0
> >
> > Accept: image/png,image/*;q=0.8,*/*;q=0.5
> >
> > Accept-Language: en-US,en;q=0.5
> >
> > Accept-Encoding: gzip, deflate
> >
> > Referer: http://seattle.craigslist.org/oly/hvo/5288435732.html
> >
> > Cookie: cl_tocmode=sss%3Agrid; cl_b=hlJExhZ55RGzNupTXAYJOAIcZ80;
> > cl_def_lang=en; cl_def_hp=seattle
> >
> > Connection: keep-alive
> >
> >
> >
> > The response headers are:
> >
> > Cache-Control: public, max-age=2592000 <-- doesn't that say "keep
> > that a very long time"?
> >
>
> Not exactly. It says only that you are *allowed* to store it for 30 days. Does
> not say you have to.
>
> Your refresh_pattern rules will use that as the 'max' limit along with the
> below Date+Last-Modified header values when determining whether the
> response can be cached, and for how long.
>
>
> > Content-Length: 49811
> >
> > Content-Type: image/jpeg
> >
> > Date: Tue, 27 Oct 2015 23:04:14 GMT
> >
> > Last-Modified: Tue, 27 Oct 2015 23:04:14 GMT
> >
> > Server: craigslist/0
> >
> >
> >
> > Access log says:
> > 1445989120.714 265 192.168.2.56 TCP_MISS/200 50162 GET
> > http://images.craigslist.org/00Y0Y_kMkjOhL1Lim_600x450.jpg -
> > ORIGINAL_DST/208.82.236.227 image/jpeg
> >
>
> This is intercepted traffic.
>
> I've run some tests on that domain and it is another one presenting only
> 1 single IP address on DNS results, but rotating through a whole set in the
> background depending on from where it gets queried. As a result different
> machines get different results.
>
>
> What we found just the other day was that domains doing this have big
> problems when queried through Google DNS servers. Due to the way Google
> DNS servers are spread around the world and load balancing their traffic
> these sites can return different IPs on each and very lookup.
>
> The final outcome of all that is when Squid tries to verify the intercepted
> traffic was actually going where the client intended, it cannot confirm the
> ORIGINAL_DST server IP is one belonging to the Host header domain.
>
>
> The solution is to setup a DNS resolver in your network and use that instead
> of the Google DNS. You may have to divert clients DNS queries to it if they try
> to go to Google DNS still. The result will be much more cacheable traffic and
> probably faster DNS as well.
>
>
> >
> > And Store Log says:
> > 1445989120.714 RELEASE -1 FFFFFFFF
> 27C2B2CEC9ACCA05A31E80479E5F0E9C ?
> > ? ? ? ?/? ?/? ? ?
> >
> >
> >
> > I started out with a configuration from here:
> > http://wiki.sebeka.k12.mn.us/web_services:squid_update_cache but
> have
> > made a lot of tweaks to it. In fact, I've dropped all the updates,
> > all the rewrite, store id, and a lot of other stuff. I've set cache
> > allow all (which, I suspect I can simply leave blank, but I don't
> > know) I've cut it down quite a bit, the one I am testing right now
> > for example looks like
> > this:
> >
> >
> >
> > My squid.conf (which has been hacked mercilously trying stuff,
> > admittedly) looks like this:
> >
> >
> >
> > <BEGIN SQUID.CONF >
> >
> > acl localnet src 10.0.0.0/8 # RFC1918 possible internal network
> >
> > acl localnet src 172.16.0.0/12 # RFC1918 possible internal network
> >
> > acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
> >
> > acl localnet src fc00::/7 # RFC 4193 local private network range
> >
> > acl localnet src fe80::/10 # RFC 4291 link-local (directly plugged)
> > machines
> >
> >
> >
> > acl SSL_ports port 443
> >
> > acl Safe_ports port 80 # http
> >
> > acl Safe_ports port 21 # ftp
> >
> > acl Safe_ports port 443 # https
> >
> > acl Safe_ports port 70 # gopher
> >
> > acl Safe_ports port 210 # wais
> >
> > acl Safe_ports port 1025-65535 # unregistered ports
> >
> > acl Safe_ports port 280 # http-mgmt
> >
> > acl Safe_ports port 488 # gss-http
> >
> > acl Safe_ports port 591 # filemaker
> >
> > acl Safe_ports port 777 # multiling http
> >
> > acl CONNECT method CONNECT
> >
> >
>
> You are missing the default security http_access lines. They should be re-
> instated even on intercepted traffic.
>
> acl SSL_Ports port 443
>
> http_access deny !Safe_ports
> http_access deny CONNECT !SSL_Ports
>
>
>
> >
> > http_access allow localnet
> >
> > http_access allow localhost
> >
> >
> >
> > # And finally deny all other access to this proxy
> >
> > http_access deny all
> >
> >
> >
> > http_port 3128
> >
> > http_port 3129 tproxy
> >
>
> Okay, assuming you have the proper iptables/ip6tables TPROXY rules setup
> to accompany it.
>
>
> >
> > cache_dir aufs /var/spool/squid/ 40000 32 256
> >
> >
> >
> > cache_swap_low 90
> >
> > cache_swap_high 95
> >
> >
> >
> > dns_nameservers 8.8.8.8 8.8.4.4
> >
>
> See above.
>
> >
> >
> > cache allow all
>
> Not useful. That is the default action when "cache" directive is nomitted
> entirely.
>
> >
> > maximum_object_size 8000 MB
> >
> > range_offset_limit 8000 MB
> >
> > quick_abort_min 512 KB
> >
> >
> >
> > cache_store_log /var/log/squid/store.log
> >
> > access_log daemon:/var/log/squid/access.log squid
> >
> > cache_log /var/log/squid/cache.log
> >
> > coredump_dir /var/spool/squid
> >
> >
> >
> > max_open_disk_fds 8000
> >
> >
> >
> > vary_ignore_expire on
> >
>
> The above should not be doing anything in current Squid which are
> HTTP/1.1 compliant. It is just a directive we have forgotten to remove.
>
> > request_entities on
> >
> >
> >
> > refresh_pattern -i .*\.(gif|png|jpg|jpeg|ico|webp)$ 10080 100% 43200
> > ignore-no-store ignore-private ignore-reload store-stale
> >
> > refresh_pattern ^ftp: 1440 20% 10080
> >
> > refresh_pattern ^gopher: 1440 0% 1440
> >
> > refresh_pattern -i .*\.index.(html|htm)$ 2880 40% 10080
> >
> > refresh_pattern -i .*\.(html|htm|css|js)$ 120 40% 1440
> >
> > refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
> >
> > refresh_pattern . 0 40% 40320
> >
> >
> >
> > cache_mgr <my address>
> >
> > cache_effective_user proxy
> >
> > cache_effective_group proxy
> >
> >
> >
> > <END SQUID.CONF>
> >
> >
> >
> > There is a good deal of hacking that has gone into this configuration,
> > and I accept that this will eventually be gutted and replaced with
> > something less, broken.
>
> It is surprisingly good for all that :-)
>
>
> > Where I am pulling my hair out is trying to figure out why things are
> > cached and then not cached. That top refresh line (the one looking
> > for jpg, gifs etc) has taken many forms, and I am getting inconsistent
> results.
> > The above image will cache just fine, a couple times, but if I go
> > back, clear the cache on the browser, close out, restart and reload,
> > it releases the link and never again shall it cache. What is worse,
> > it appears to get getting worse over time until it isn't really picking up much
> of anything.
> > What starts out as a few missed entries piles up into a huge list of
> > cache misses over time.
> >
>
> What Squid version is this? 0.1% seems to be extremely low. Even for a proxy
> having those Google DNS problems.
>
> >
> >
> > Right now, I am running somewhere in the 0.1% hits rate, and I can
> > only assume I have buckled something in all the compile and
> > re-compiles, and reconfigurations. What started out as "gee, I wonder
> > if I can cache updates" has turned into quite the rabbit hole!
> >
> >
> >
> > So, big question, what debug level do I use to see this thing making
> > decisions on whether to cache, and any tips anyone has about this
> > would be appreciated. Thank you!
>
> debug_options 85,3 22,3
>
>
> Amos
> _______________________________________________
> squid-users mailing list
> squid-users at lists.squid-cache.org
> http://lists.squid-cache.org/listinfo/squid-users
Well that (debug_options 85,3 22,3) worked like a charm! I had the info I needed in about two seconds flat!
I am getting:
2015/10/28 09:16:54.075| 85,3| client_side_request.cc(532) hostHeaderIpVerify: FAIL: validate IP 208.82.238.226:80 possible from Host:
2015/10/28 09:16:54.075| 85,3| client_side_request.cc(543) hostHeaderVerifyFailed: SECURITY ALERT: Host header forgery detected on local=208.82.238.226:80 remote=192.168.2.56 FD 20 flags=17 (local IP does not match any domain IP) on URL: http://seattle.craigslist.org/favicon.ico
Based on http://wiki.squid-cache.org/KnowledgeBase/HostHeaderForgery I believe this is saying the IP address requested and the one Squid found are not the same. Bottom line, I think it is time for me to host a DNS server, that way at least the request IP and the squid IP will be more consistent. It looks like this won't actually completely fix the issue, it is just a problem with transparent proxies. Time to read up on autoconfiguration of proxies it appears.
Thank you again!
More information about the squid-users
mailing list