[squid-dev] [PATCH] Retry cache peer DNS failures more frequently

Tue May 10 08:03:04 UTC 2016

On 05/09/2016 08:41 PM, Nathan Hoad wrote:
> +NAME: cache_peer_negative_dns_ttl
> +COMMENT: time-units
> +TYPE: time_t
> +LOC: Config.cachePeerNegativeDnsTtl
> +DEFAULT: 1 minutes
> +DOC_START
> +	How often to retry failed DNS lookups for cache peers.

It is not actually clear what this means because a DNS lookup failure is
defined for a single DNS query but the text above appears to be talking
about a group of cache peers (multiple failed queries). Does "How often"
refer to the frequency of retries for any given failing peer? I do not
think so.

AFAICT, the code actually matches this description:

"A delay before retrying a failed DNS lookup for a cache peer. The
option support is currently unreliable if multiple peers experience DNS
lookup failures: Some of those failed lookups may be retried much sooner
or much later than the configured value".

> +	It is not recommended to set this lower than negative_dns_ttl, as the
> +	cached negative responses will prevent prevent the DNS lookups from
> +	succeeding.

s/prevent prevent/prevent/

A subtle problem here is that even if I set cache_peer_negative_dns_ttl
to exceed negative_dns_ttl, Squid may retry the failed lookup much
sooner (if there were lookup failures for another peer earlier), hitting
that negative DNS cache for the just-failed peer lookup.

> + eventAdd("peerRefreshFailedDNS", peerRefreshFailedDNS, NULL, Config.cachePeerNegativeDnsTtl, 1);

Nitpick: I know you are copying this code, but this is not a
cbdata-protected event -- the last/missing parameter should be "false"
instead of the default "true", right?

> +    for (p = Config.peers; p; p = p->next) {
> +        if (p->in_addr.isAnyAddr()) {
> +            ipcache_nbgethostbyname(p->host, peerDNSConfigure, p);
> +        }
> +    }

This creates two semi-competing event-based loops: The new
peerRefreshFailedDNS() and the old peerRefreshDNS() that might fire in
close proximity of each other, creating multiple peerDNSConfigure()
calls with the same DNS lookup result. I hope Squid can handle this, but
I see peerDNSConfigure scheduling more events and calling other global
functions so the situation might become iffy in some cases...

> +    if (eventFind(peerRefreshFailedDNS, NULL))
> +        eventDelete(peerRefreshFailedDNS, NULL);

IIRC, this code still allows an already scheduled/queued
peerRefreshFailedDNS async call to fire, also causing multiple
peerDNSConfigure() calls with the same DNS lookup result.

Also, if the timing of various events is just right, I suspect the above
eventDelete() may delete the peerRefreshFailedDNS event meant for a
failed lookup that has been waiting for a while but not enough for the
ipcache_nbgethostbyname() to succeed right away. AFAICT, this
combination will lead to that peerRefreshFailedDNS event rescheduled,
effectively increasing that peer wait time.

Overall, I think the code suffers from applying the near-copy of the
peerRefreshDNS() meant for all peers at once to _individual_ cache peer
lookup failures. This misapplication leads to messy or uncertain
situations. I would not be surprised if some of these problems exist in
the current code as well, but they seem to be significantly more likely
in the patched code.

Is there a better way?

Please note that this code quality matches that of a lot of other Squid
code -- it works when "things are OK" and requires a growing set of
hacks to avoid problems when external events do not go exactly as
planned. If you insist, it can be accepted, I guess.

Thank you,

Alex.