[squid-dev] [PATCH] Reuse reserved Negotiate and NTLM helpers after an idle timeout.

Mon Jul 31 15:24:54 UTC 2017

On 31/07/17 22:24, Christos Tsantilas wrote:
> Στις 30/07/2017 06:48 πμ, ο Amos Jeffries έγραψε:
>> On 27/07/17 18:52, Christos Tsantilas wrote:
>>> The patch.
>>>
>>> Στις 26/07/2017 12:37 μμ, ο Christos Tsantilas έγραψε:
>>>> Squid can be killed or maimed by enough clients that start 
>>>> multi-step connection authentication but never follow up with the 
>>>> second HTTP request while keeping their HTTP connection open. 
>>>> Affected helpers remain in the "reserved" state and cannot be reused 
>>>> for other clients. Observed helper exhaustion has happened without 
>>>> any malicious intent.
>>>>
>>>> To address the problem, we add a helper reservation timeout. Timed 
>>>> out reserved helpers may be reused by new clients/connections. To 
>>>> minimize problems with slow-to-resume-authentication clients, timed 
>>>> out reserved helpers are not reused until there are no unreserved 
>>>> running helpers left. The reservations are tracked using unique 
>>>> integer IDs.
>>>>
>>
>> Since NTLM and Negotiate are both very stateful security protocols 
>> this re-use is only possible if the helper is using concurrency in its 
>> communications with Squid. To do so otherwise would randomly allow 
>> replay attacks to succeed - in a way that would be extremely nasty to 
>> troubleshoot. Attackers are fully able to flood an auth backend with 
>> traffic outside of Squid and slow it down sufficiently for this attack 
>> to become a problem.
>>   Note that the type-1 tokens where the TCP connection can be closed 
>> also should not reserve a helper - if that is happening it is a bug 
>> regardless of this patch.
>>
>>
>> To reach the desired behaviour it would be better to actually 
>> implement concurrency support for the stateful helpers interfaces. So 
>> we can ensure the helper is fully aware of the separation between 
>> clients auth sessions regardless of whether any given token gets 
>> replayed.
>>
>> It should not be much of a step to use the concurrency channel-ID as 
>> part of the reservationId. Helpers that support concurrency should not 
>> have trouble servicing auth on other channel-ID until the total number 
>> of reserved channels gets extremely high. At that point restarting the 
>> helper is sane and the existing on-persistent-overload logics can be 
>> applied to whether Squid dies or simply kills the blocked helper (ERR).
>>
>>
>> This patch is going most of the way towards that already by using 
>> reservation-ID to replace hardcoded class dependencies, and moving a 
>> lot of code to the Base class. But it does not go far enough to make 
>> the change safe to add to Squid IMO.
>>
>>
>> NP: I have only briefly checked the code to be sure this was not 
>> actually doing the concurrency change yet without mentioning it. 
>> However, one other major issue stands out;
>>
>> StatefulGetFirstAvailable() is a C-style accessor method for the 
>> stateful_helper, we simply have not managed to make it one properly 
>> yet. The value it is given is the context within/for which the caller 
>> needs it to find a usable *_server. Changing that context in order to 
>> find a different server result (for some other context) is not an 
>> appropriate thing to be doing.
>>
>> The only thing in the new code for StatefulGetFirstAvailable() which 
>> might lead to it needing non-const is the HelperServerBase::reserved() 
>> method being indirectly called. That method should be "virtual bool 
>> reserved() const = 0". Fixing that should make the 
> 
> Yes this method can be const.
> 
>> StatefulGetFirstAvailable() API const'ness change unnecessary.
> 
> This change was not required in any case. Sorry.
> 
>>
>>
>> -1 for now, and please do not apply this until the replay attack 
>> problem is fully resolved.
> 
> My sense is that the attack problem can never be "fully" resolved.
> But the patch does not try to solve any attack problem. It try to solve 
> problems on normal squid operation.

The problem you described of helpers being hung in reserved state cannot 
happen during *normal* NTLM/Negotiate operations. Normal clients a) 
complete the auth sequence, or b) terminate the connection quickly and 
thus trigger cleanups to happen in Squid, or c) are actually still going 
to produce the token this helper is reserved and waiting for. None of 
the _normal_ traffic cases are served well by this change.

So either there is a bug in Squid client-connection cleanup (case 'b') 
which you are not addressing at all. Or, the use-case you are solving 
for here are ones where clients are performing a DoS attack against the 
proxy, whether you understood that or not.

What your patch does is changing a basic DoS vulnerability to an account 
hijacking / replay vulnerability. Even though the replay may be much 
more difficult to achieve than the original DoS the damage and severity 
when it happens is far, far worse. Sufficiently so that the original DoS 
is preferable to have happening.

At the very least the DoS situation is easy to see and remedy. Probably 
that visibility is why it gets so much complaining and the other 
(several!) common problems with these helpers get hardly a mention.

> 
> This patch plus the client_ip_max_connections option, and a small 
> reservation-timeout value should be enough for most cases.
> 
> Using concurrency for stateful helpers require changes in existing 
> stateful helpers distributed with squid and other custom helpers used by 
> many users. The changes investigated by this patch required even if we 
> are going to support concurrency to allow users using their existing 
> helpers.

I'm not proposing that we change the interface to concurrency always-on. 
It should be consistent with the concurrency on the other helper 
interfaces, which is default-off but available for those who need and 
can use it. For all those same reasons you mention.

For NTLM the common (only?) helper is the Samba one. That apparently 
already supports concurrency. So if that works with our latest type of 
concurrency numbering, then the change should not be a huge problem for 
NTLM users.

For Kerberos the common helpers are Marcus ones which we ship with 
Squid. Some coordination will be needed there to keep his separately 
distributed copy in sync, but relatively easy.

> 
> As you said "this patch is going most of the way towards".
> I do not think it is good idea to reject this patch only because it 
> "does not go far enough". In a second step and if required, we can 
> implement concurrency on stateful helpers.

In this case not going far enough leaves open a far worse vulnerability 
than the one being resolved. I'm rejecting it primarily because of the 
replay/hijack addition being worse than the DoS.

If you can come up with some other way to avoid the DoS pains without 
getting into worse problems I am open to temporary measures. But given 
the direction this code has gone it seems to me that going for the 
concurrency support is less additional work than the analysis required 
to find+code good alternatives would take.

FWIW; I am tempted by the old idea of just letting Squid abort the 
helpers that are hung and starting new ones afresh. But experiences 
coming back from people using the dynamic-helpers feature show that even 
that does not help much. Instead of Squid aborting with the FATAL 
message it continues until the kernel oom-killer aborts it instead 
anyway, or for some very high-performance proxies the TCP stack can 
still run out of sockets in the half second or so before the new helpers 
have started producing responses.
  Up to you whether you think implementing that as a quick-fix is worth it.

Amos