[squid-users] 6.x gives frequent connection to peer failed - spurious?

Tue Nov 21 15:55:13 UTC 2023

On 2023-11-21 08:38, Stephen Borrill wrote:
> On 15/11/2023 21:55, Alex Rousskov wrote:
>> On 2023-11-10 05:46, Stephen Borrill wrote:
>>
>>> With 6.x (currently 6.5) there are very frequent (every 10 seconds or 
>>> so) messages like:
>>> 2023/11/10 10:25:43 kid1| ERROR: Connection to 127.0.0.1:8123 failed
>>
>>
>>> why is this logged as a connection failure
>>
>> The current error wording is too assuming and, in your case, evidently 
>> misleading. The phrase "Connection to X failed" should be changed to 
>> something more general like "Cannot contact cache_peer X" or "Cannot 
>> communicate with cache_peer X".
>>
>> CachePeer::countFailure() patches welcome.

> But the point is that it _can_ communicate with the peer, but the peer 
> itself can't service the request. The peer returning 503 shouldn't be 
> logged as a connection failure

My bad. I missed the fact that the described DNS error happens at a 
_peer_ Squid. Sorry.

Currently, Squid v6 treats most CONNECT-to-peer errors as a sign of a 
broken peer. In 2022, 4xx errors were excluded from that set[1]. At that 
time, we also proposed to make that decision configurable using a new 
cache_peer_fault directive[2], but the new directive was blocked as an 
"overkill"[3], so we hard-coded 4xx exclusion instead.

Going forward, you have several options, including these two:

1. Convince others that Squid should treat all 503 CONNECT errors from 
peers as it already treats all 4xx errors. Hard-code that new logic.

2. Convince others that cache_peer_fault or a similar directive is a 
good idea rather than an overkill. Resurrect its implementation[2].

[1] 
https://github.com/squid-cache/squid/commit/022dbabd89249f839d1861aa87c1ab9e1a008a47

[2] 
https://github.com/squid-cache/squid/commit/25431f18f2f5e796b8704c85fc51f93b6cc2a73d

[3] https://github.com/squid-cache/squid/pull/1166#issuecomment-1295806530

HTH,

Alex.

>>  > do I need to worry about it beyond it filing up the logs needlessly?
>>
>> In short, "yes".
>>
>> I cannot accurately assess your specific needs, but, in most 
>> environments, one should indeed worry that their cache_peer server 
>> names cannot be reliably resolved because failed resolution attempts 
>> waste Squid resources and increase transaction response time. 
>> Moreover, if these failures are frequent enough (relative to peer 
>> usage attempts), the affected cache_peer will be marked as DEAD (as 
>> you have mentioned):
>>
>>  > 2023/11/09 08:55:22 kid1| Detected DEAD Parent: 127.0.0.1:8123
> 
> Problem seems to be easily reproducible:
> 
> 1# env https_proxy=http://127.0.0.1:8084 curl https://www.invalid.domain/
> curl: (56) CONNECT tunnel failed, response 503
> 2# grep invalid /usr/local/squid/logs/access.log|tail -1
> 1700573429.015      4 127.0.0.1:8084 TCP_TUNNEL/503 0 CONNECT 
> www.invalid.domain:443 - FIRSTUP_PARENT/127.0.0.1:8123 -
> 3# date -r 1700573429 '+%Y/%m/%d %H:%M:%S'
> 2023/11/21 13:30:29
> 4# grep '2023/11/21 13:30:29' /usr/local/squid/logs/cache.log
> 2023/11/21 13:30:29 kid1| ERROR: Connection to 127.0.0.1:8123 failed
> 
>>> With 4.x there were no such messages.
>>>
>>> By comparing to the peer squid logs, these seems to tally with DNS 
>>> failures:
>>> peer_select.cc(479) resolveSelected: PeerSelector1688 found all 0 
>>> destinations for bugzilla.tucasi.com:443
>>>
>>> Full ALL,2 log at the time of the reported connection failure:
>>>
>>> 2023/11/10 10:25:43.162 kid1| 5,2| TcpAcceptor.cc(214) doAccept: New 
>>> connection on FD 17
>>> 2023/11/10 10:25:43.162 kid1| 5,2| TcpAcceptor.cc(316) acceptNext: 
>>> connection on conn3 local=127.0.0.1:8123 remote=[::] FD 17 flags=9
>>> 2023/11/10 10:25:43.162 kid1| 11,2| client_side.cc(1332) 
>>> parseHttpRequest: HTTP Client conn13206 local=127.0.0.1:8123 
>>> remote=127.0.0.1:57843 FD 147 flags=1
>>> 2023/11/10 10:25:43.162 kid1| 11,2| client_side.cc(1336) 
>>> parseHttpRequest: HTTP Client REQUEST:
>>> 2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(707) 
>>> clientAccessCheckDone: The request CONNECT bugzilla.tucasi.com:443 is 
>>> ALLOWED; last ACL checked: localhost
>>> 2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(683) 
>>> clientAccessCheck2: No adapted_http_access configuration. default: ALLOW
>>> 2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(707) 
>>> clientAccessCheckDone: The request CONNECT bugzilla.tucasi.com:443 is 
>>> ALLOWED; last ACL checked: localhost
>>> 2023/11/10 10:25:43.162 kid1| 44,2| peer_select.cc(460) 
>>> resolveSelected: Find IP destination for: bugzilla.tucasi.com:443' 
>>> via bugzilla.tucasi.com
>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(479) 
>>> resolveSelected: PeerSelector1526 found all 0 destinations for 
>>> bugzilla.tucasi.com:443
>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(480) 
>>> resolveSelected:    always_direct = ALLOWED
>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(481) 
>>> resolveSelected:     never_direct = DENIED
>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(482) 
>>> resolveSelected:         timedout = 0
>>> 2023/11/10 10:25:43.163 kid1| 4,2| errorpage.cc(1397) buildBody: No 
>>> existing error page language negotiated for ERR_DNS_FAIL. Using 
>>> default error file.
>>> 2023/11/10 10:25:43.163 kid1| 33,2| client_side.cc(617) swanSong: 
>>> conn13206 local=127.0.0.1:8123 remote=127.0.0.1:57843 flags=1
>>>
>>> If my analysis is correct why is this logged as a connection failure 
>>> and do I need to worry about it beyond it filing up the logs needlessly?
>>>
>>> My concern is that this could lead to the parent being incorrectly 
>>> declared DEAD thus impacting other traffic:
>>>
>>> 2023/11/09 08:55:22 kid1| Detected DEAD Parent: 127.0.0.1:8123
>>>      current master transaction: master4581234
>>>
>>