[squid-users] [RFC] Do not use idle dead peers

Tue Mar 21 21:07:40 UTC 2017

Hello,

    This Request For Comments proposes to remove a subtle Squid
(mis)feature. If you happen to use the feature detailed below or know
somebody who does, please speak up to protect it! If nobody defends this
feature, we may remove it (to get rid of its bad side effects).

If you use cache_peers, you know that when a peer cannot be reached,
Squid tries a few times (see cache_peer connect-fail-limit; default is
10 times) and then declares the peer "dead". For example:

> 2017/03/21 10:11:46.380| TCP connection to 127.0.0.4/80 failed
> 2017/03/21 10:11:46.380| TCP connection to 127.0.0.4/80 failed
...
> 2017/03/21 10:11:46.394| TCP connection to 127.0.0.4/80 failed
> 2017/03/21 10:11:46.394| Detected DEAD Parent: peer4

Normally, Squid does not forward HTTP transactions to dead peers because
doing so is likely to cause timeouts and other problems. Squid has
mechanisms that detect revived (i.e., no longer dead) peers without
sending regular HTTP requests to peers considered dead. One such
mechanism is TCP probes that check whether opening a TCP connection to
the dead peer started to work.

There are several problems with dead peer handling, and we are working
on fixing some of them, but this RFC focuses on one specific feature:

* Squid may forward an HTTP request to an otherwise eligible but dead
peer that was idle[1] for some time[2].

This "use idle dead peer" feature was introduced as a small part of a
much bigger bug #14 fix. AFAICT, the stated goal of the feature was
speeding up failure recovery:

> revno: 6631
> timestamp: Sat 2004-04-03 21:07:38 +0000
> message:
>   Bug #14: connection setup may look like syn flood attack if server is
>   refusing connection
>   
>   If the contacted server refuses connection then the repeated attempts to
>   connect to the server may look like a syn flood attack. This patch makes
>   Squid behave a little friendler in such case and:
> ...
>    * Cleanup of peer TCP probing to correct timeout management etc and to
>   more promptly recover after a failure.

The "more promptly recover after a failure" phrase probably refers to
the elimination of a single TCP connect(2) peer usage delay or, to be
more precise, the delay between the following two events:

* Start:  An HTTP transaction initiates a background TCP connect probe
          (but is not sent to the dead idle peer).

* Finish: A successful result of a TCP probe initiated above
          (allowing future transactions to use the revived peer).

AFAICT, the feature justification/logic goes something like this: If
there were no failures for a while then perhaps the peer is not dead
anymore. Let's try using it for the current HTTP transaction and see
what happens. If we are lucky, we will start using the peer sooner!

Since the lack of failures does not imply success, the feature may lead
to regular HTTP client transactions being sent to a truly dead peer.
Such transactions may experience delays (at best) or client
disconnects/errors (at worst), depending on Squid and client
configurations/state.

IMO, Squid should not risk regular HTTP transactions this way, and the
actual benefits of such risks are slim in most environments. Thus, we
should remove this feature and simply let existing TCP probes to revive
dead peers. This feature removal does not increase the number of TCP
probes. This feature removal does not delay HTTP transactions as such
(it only delays the time when Squid can resume peer usage).

Does anybody need this "use idle dead peers" feature?

[1] Here, "idle" essentially means a peer that Squid did not probe or
otherwise contact for a while[2]. Peers become idle if they are not
selected by peering algorithms as potential forwarding destinations
(e.g., a dead round-robin parent with very low weight is likely to
become idle even if its "heavy" cousins remain very busy).

[2] The inactivity time associated with becoming idle is calculated as
ten times the peer_connect_timeout (or ten times cache_peer
connect-timeout when set). It defaults to 10*30 seconds or 5 minutes.

Thank you,

Alex.
P.S. Please resist the temptation to discuss other peering problems on
this thread, including other problems associated with detection and
revival of dead peers. Let's focus on this specific feature proposed for
removal.