[squid-dev] [RFC] Do not use idle dead peers

Wed Mar 22 16:01:20 UTC 2017

On 03/21/2017 10:27 PM, Amos Jeffries wrote:
> On 22/03/2017 10:27 a.m., Alex Rousskov wrote:
>>     This Request For Comments proposes to remove a subtle Squid
>> (mis)feature. If you happen to use the corresponding feature, please
>> speak up to protect it! If nobody defends this feature, we may remove it
>> (in order to get rid of its bad side effects).

> FYI: fixing the probe
> behaviour would be useful, complete removal is not an option (yet).

I suspect your "not an option" opinion is based on incorrect assumption
about what I propose to remove (it is not the probing!) and/or what
probes remain after the proposed changes (all of them remain unchanged!).

> From squid-users traffic is appears still quite common for
> administrators to disable all of ICP, HTCP, netDB, digests, and ICMP.

Agreed.

> the probes in question are the only way of the peer to be
> detected as alive again.

This is incorrect. TCP probes (which cannot be disabled) are the primary
mechanism for dead peer revival. I do not propose to remove them (or any
other probe). This RFC does not even propose to change them. TCP probes
are initiated in peerProbeConnect(). That function does not go anywhere;
at the end of this email, I have re-quoted the summary diff showing that
function call for your reference.

> IFAIK touching this feature will result in changes to what status code
> and/or error page Squid produces.

Yes, an individual transaction outcome might change in some cases
(usually for the better). No, the set of possible status codes/errors
across all cases remains exactly the same as before the change. That set
will continue to contain "cannot find a way to forward" and various "I
tried to forward but failed" errors.

> It currently should (modulo bugs) result in connection-failed or timeout
> types of final error going to the client as the victum requests actively
> fail to be forwarded to the truely-dead peer.

Yes, for the first failure (or a few first concurrent failures) after
the dead peer becomes "idle". After the first failure (with an idle dead
peer), the dead peer is no longer considered idle, and Squid stops using
it for a while (until the peer becomes idle again). To paraphrase the
current code:

  const time_t ctimeout = peerConnectTimeout(p);
  const bool thisDeadPeerIsIdle =
    (squid_curtime - p->stats.last_connect_failure) > 10*ctimeout;

> After the change I would expect to see more being successful trnsactions
> - but only in the presence of other live peers or DIRECT being
> permitted.

Yes. The exact preconditions for "more successful transactions" are more
complex than summarized above, but I do not think those details are
important for this discussion.

> The ones which remain unsuccessful
> no-forwarding-path errors since now no paths would be detected as usable.
>  That remaining error case needs to be checked for exact status code and
> how it impacts on caching hierarchies and retry logic.

The set of possible errors for unsuccessful transactions does not
change: Some will try to connect and fail (because the peer died after
it was selected), and some will fail because there is still no
forwarding path left (because all peers are dead). Only the number of
each error in specific use cases may change, and even that change in
numbers should be minor in most use cases because only the first failure
(or the first few concurrent failures) when talking to an idle dead peer
are affected.

> Since the proposed change is so small has any of that behaviour been
> tested already?

The change has not been implemented or tested (I would have posted a
PATCH instead of an RFC if it were), but the expected behavior already
happens everyday in unpatched Squid.

> Other possibilities (if we have to) would be;
>  a) to flag the situation in the FwdServer object used to construct the
> destination list and ensure these peers are put as absolute last-resort
> destinations instead of where they would normally be placed.

IMO, it is conceptually wrong to use dead peers just because they are
"idle". While we can indeed add dead peers as "last resort", doing so
should not depend on the dead peer being "idle". Thus, this kind of
feature is outside this RFC scope (but it will become easier to
implement if the proposed change is committed).

> Or maybe simpler;
>  b) a new peer select algorithm to find only this type of peer and add
> them only if all other algorithms fail to produce a list.

Yes, this is how the "use dead peers as last resort" idea in (a) should
be implemented if it is implemented. It is outside this RFC scope though.

>> Does anybody need this "use idle dead peers" feature?

> I expect everyone who has explicitly disabled all other types of
> peer-ALIVE detection and where never_direct forbids bypassing the peer.

Does the fact that TCP checks cannot be disabled and remain the primary
way of reviving dead peers change your answer?

Thank you,

Alex.

>  /// OK to add this peer to the list of possible request destinations?
>  bool
>  neighborUp(const CachePeer *p)
>  {
>      if (!p->tcp_up) { // the peer is DEAD
> -        if (!peerProbeConnect(p))
> +        peerProbeConnect(p);
>              return false;
>      }
>  
>      if (p->options.no_query)
>          return true; // yes, add this peer