[squid-dev] [PATCH] Retry cache peer DNS failures more frequently

Fri Jun 24 03:44:09 UTC 2016

On 24/06/2016 10:50 a.m., Nathan Hoad wrote:
> Hello,
> 
> Did anyone have any thoughts on the issues I had with this? I don't want
> this to slip through the cracks :)
> 

Sorry, half-wrote this reply a while back and didn't get time to finish
it ...

> 
> On 17 May 2016 at 15:57, Nathan Hoad wrote:
> 
>> Hello,
>>
>> Attached is a patch which makes the changes recommended by Amos - each
>> peer now gets its own event for retrying resolution, dependent on the DNS
>> TTL. This should also fix up the concerns up by Alex. A few caveats though:
>>
>>  - the cache manager shows generic "peerRefreshDNS" names for each event.
>> I can't find any examples that give it a dynamic name, e.g. I'd like
>> something like "peerRefreshDNS(example.com)", but I can't think of how
>> I'd do that without leaking memory or making some significant changes to
>> the event handler system.
>>
>> - I can't figure out how to reproduce the second failure case, where a
>> result comes back but it has no IP addresses. I _think_ using the TTL would
>> be valid instead of negative_dns_ttl would be valid in that situation, but
>> I can't be sure. I figured this was the safest option.

The positive vs negative is about whether the response code is NXDOMAIN.
With NX being the negative-TTL, and failures defaulting to that timer
since it is required to be the shorter one.

Outside the DNS code itself an empty set is usually a failure. Inside
DNS it might be a partial response with A or AAAA etc. still pending.

>>
>>  - eventDelete does not appear to be clearing out events as I expect it
>> to, so if you reconfigure Squid you end up with some dead events, like so:

It is based on the function pointer value IIRC from the days when events
were unique and serial things.

If you have a Call object store somewher ehandy (the CachePeer?) then
using call->cancel() is the way to go and let the events queue do its
own cleaning.

>>
>> [root at xxx ~]# squidmgr events | grep peerRefresh
>> Last event to run: peerRefreshDNS
>> peerRefreshDNS                  0.331 sec           1    yes
>> peerRefreshDNS                  0.679 sec           1    yes
>> peerRefreshDNS                  47.649 sec          1    yes
>> peerRefreshDNS                  61.619 sec          1    yes
>> peerRefreshDNS                  207.682 sec         1    yes
>> peerRefreshDNS                  207.682 sec         1    yes
>> peerRefreshDNS                  207.682 sec         1    yes
>> peerRefreshDNS                  207.682 sec         1    yes
>> peerRefreshDNS                  207.682 sec         1    yes
>> [root at xxx ~]# squid -k
>> reconfigure
>> [root at xxx ~]# squidmgr events | grep peerRefresh
>> Last event to run: peerRefreshDNS
>> peerRefreshDNS                  0.763 sec           1    yes
>> peerRefreshDNS                  0.763 sec           1    yes
>> peerRefreshDNS                  41.755 sec          1    yes
>> peerRefreshDNS                  55.755 sec          1    yes
>> peerRefreshDNS                  56.187 sec          1    no
>> peerRefreshDNS                  202.250 sec         1    no
>> peerRefreshDNS                  202.250 sec         1    no
>> peerRefreshDNS                  3599.758 sec        1    yes
>> peerRefreshDNS                  3599.758 sec        1    yes
>> peerRefreshDNS                  3599.758 sec        1    yes
>> peerRefreshDNS                  3599.758 sec        1    yes
>> peerRefreshDNS                  3599.758 sec        1    yes
>>
>> If I run squid -k reconfigure again, then the events with invalid callback
>> data are cleared out, so it doesn't grow indefinitely at least. I'm not
>> sure how or if I should fix this.

The above post-reconfigure list tells me that there is still at least
one of the old 3600 timeout eventAdd being scheduled somewhere. That
needs to be removed now that peers do their own.

Amos