[squid-users] Inconsistent accessing of the cache, craigslist.org images, wacky stuff.

Fri Oct 30 03:09:48 UTC 2015

> -----Original Message-----
> From: squid-users [mailto:squid-users-bounces at lists.squid-cache.org] On
> Behalf Of Amos Jeffries
> Sent: Thursday, October 29, 2015 1:28 AM
> To: squid-users at lists.squid-cache.org
> Subject: Re: [squid-users] Inconsistent accessing of the cache, craigslist.org
> images, wacky stuff.
> 
> On 29/10/2015 3:02 p.m., Jester Purtteman wrote:
> > but my bigger question is:  if I setup a parent proxy that ONLY grabs
> > the big updates down on my big-fast-cheap connection, then set my
> > little-slow-expensive-connection up to pull from that connection,
> > would that have a higher chance of success?
> > Since the proxy on the slow system is requesting the same object, I'm
> > wondering if that may work out better.  Not sure that will have the
> > desired effect, but I'm going to try it out, I'll let you know how
> > that works out.
> 
> I dont quite grok that sorry. Can you diagram what you are thinking?
> 
> With a front-end proxy you would start to see revalidation requests
> happening between the proxies. Due to many origin servers still sending out
> new content even if it has not changed, this setup can result in small
> bandwidth savings just by existing. The main gain is helping to optimize the
> traffic bandwidth and reducing TCP connection count over long connections
> like satellite links, where the target is optimal g the bandwidth behaviour
> (reducing it is just part of that).
> 
> If the frontend has a bigger cache than the backend you will see churn and
> extra bandwidth consumption as repeats get served from the frontend
> cache. But the origin traffic upstream of it will stay low. This is good if the
> internal links are fast and upstream is slow. Like most LAN situations. It is
> usually best to have a cache on the client side of a choke point (slow
> connection).
> 
> Amos
> 
> _______________________________________________
> squid-users mailing list
> squid-users at lists.squid-cache.org
> http://lists.squid-cache.org/listinfo/squid-users

We've got a couple thoughts going at once here, so let me condense it a bit.  First, yes, this is coming in over a satellite and that is part of the bugger.  Nothing like 560 ms to bring a connection to a halt.  Part of my plan is exactly as you say, optimize the links by setting huge tcp_windows and all the rest so that I can get full bandwidth.  The other part of the story (and I could just be misunderstanding this too) is that it appears that if I have say, 3 or 4 clients connect for a file over the course of the period of the download, if any one of them (or maybe just the last one, again, insufficient testing so I don't know the exactly course of events here) ends up requesting an IP different than what is looked up, it appeared to drop the file.

>I think a worse problem is if the DNS TTL is shorter than a client connections TCP connected time. 
>Then requests arriving after the DNS TTL expired would no longer match the initial dst-IP.

That is what I think I was seeing:  if by that you mean, clients A, B, and C all request a large file (few hundred MB), it downloads but takes more than 300 seconds (which has become a pretty common TTL, when did that happen?), and then D requests it too, but the DNS updates while its coming in and suddenly gets flagged as a host forgery and is no longer cacheable.  I could be wrong, so I need to experiment, but I think that’s what I am seeing.

My crazy solution is, I have a server on a fast connection on which I setup a cache there with a pretty big minimum and maximum file size (say, 10 MB minimum object size, 8,000MB maximum) and set it up as a parent cache to the cache out at the slow end of the universe, which is a transparent proxy.  The transparent proxy then uses the parent proxy to request the files, and when the files happen to be very big, I set up the connection to do a pre-cache (because a 100 MB file is a piece of cake for a 100 mbps connection) and it stores it, because the time to download was trivial compared to the DNS TTL.  I set the cache up no the slow end to cache more aggressively, but the point is that once the cache down south has the file, the cache up north is requesting the file from a system much more optimized to pull big files over, and that improves the odds that the DNS has not updated before the transfer completes.

I'm not convinced my idea is valid, so I'll have to ponder it a bit, but I'm going to give it a shot and let you know if it makes a difference.  Bottom line is, it is a pretty nasty work around, and there is probably a better solution if someone that knows C out there worth beans is into it.  I don't think there are ANY answers that don't involve setting up your own DNS, but after configuring BIND in about 7 minutes last night, I am thinking that’s not a big issue.  The obvious answers I can think of are (1) to maintain a short table of IPs associated with a specific domain request until all transfers referring back to it have passed and rewrite the DNS resolution calls to refer to that table or (2) tag the requested IP and resolved IP.

The last line of C I wrote was in the 90s, but I'll dig in and see if I can find the right place to start making a mess :).

In any event, you and Eliezer have helped me get farther since Tuesday night than I had since August, Thank you both!