[squid-users] squid, SMP and authentication and service regression over time

Fri May 20 10:40:00 UTC 2016

On 17/05/2016 6:27 a.m., Eugene M. Zheganin wrote:
> Hi.
> 

I dont see any mention of the Squid version. Which one are you having
this issue in?

> I'm using squid for a long time, I'm using it to authenticate/authorize
> users accessing the Internet with LDAP in a Windows corporate
> enviromnent (Basic/NTLM/GSS-SPNEGO) and recently (about several months
> ago) I had to switch to the SMP scheme, because one process started to
> eat the whole core sometimes, thus bottlenecking users on it.

This might be a version-specific problem. We've had a few bugs solved
that could match that description.

> Situation
> with CPU effectiveness improved, however I discovered several issues.
> The first I was aware of, it's the non-functional SNMP (since there's no
> solution, I just had to sacrifice it).

Do you mean its fully non-functional?

Or that you are just getting randomly different responses from different
workers when they share an SNMP receiving port?

That latter is worked around by configuring per-worker SNMP ports and
querying each individually for its details.

> But the second one is more
> disturbing. I discovered that after a several uptime (usually couple of
> weeks, a month at it's best) squid somehow degrades and stops
> authorizing users.

Which auth scheme are those users using?

> I have about active 600 users on my biggest site
> (withount SNMP I'm not sure how many simultaneous users I got) but

The mgr:client_db report can help give a good ballpark number there if
you have it enabled.

> usually this starts like this: someone (this starts with one person)
> complains that he lost his access to the internet - not entirely, no. At
> first the access is very slow, and the victim has to wait several
> minutes for the page to load. Others are unaffected at this time. From
> time to time the victim is able to load one of two tabs in the browser,
> eventually, but at the end of the day this becomes unuseable, and my
> support has to come in. Then this gots escalated to me. First I was
> debugging various kerberos stuff, NTLM, victim's machine domain
> membership and so on. But today I managed to figure out that all I have
> to do is just restart squid, yeah (sounds silly, but I don't like to
> restart things, like in the "IT Crowd" TV Series, this is kinda last
> resort measure, when I'm desperate).

That could be either one of four bugs I'm aware of:

1) NTLM connection limit to AD.
 Winbind access to AD cannot make more than concurrent 256 connections
to any given AD. Thats aggregate across all the NTLM + Negotiate helpers
and any other proceses also running on the Squid machine.
 This can result in an ever growing queue of pending auth requests until
the proxy is treading water just trying to catch up on which clients
have not yet disconnected.

2) NTLM helper limits exceeded.
 NTLM handshake duration is not limited. If for any reason it pauses for
a long time between the multiple HTTP requests involved, that helper is
blocked from use by any other users.
 This can result in both an ever growing queue, and ever fewer helpers
available to service that queue.

Don't you just love NTLM?

3) NTLM and Negotiate involve the helper passing Squid a unique token
with every HTTP request made on an new connection. The annotations
feature in Squid for quite a few releases was adding these to each
username's auth state.
 The number of these unique token Notes could build up over a few hours
to a day or two depending on the clients activity rate - to a number big
enough to cause noticable delays on every request they made, and others.

4) Recent versions of Firefox are known to begin NTLM handshakes badly.
They work find for Kerberos handshakes, and sometimes for NTLM. But for
certain requests they advertise keep-alive on the type-1 message then
just hang.

Fortunately this is a behaviour seen with MSIE 5.x back in the day, so
the auth_param "keepalive off" setting is already available to resolve
that. Though it does mean the NTLM handshakes require a TCP teardown and
reconnect, which can make issue (2) above hurt more.

> If I'm stubborn enough to continue
> the investigation, soon I got 2 users complaining, then 3, then more.
> During previous outages eventually I used to restart squid (to change
> the domain controller in kerberos config, if I blame one; to disable the
> external Kerberos/LDAP helper connection pooling, if I blame one) - so
> each time there was a candidate to blame. But this time I just decided
> to restart squid, since I started to think it's the main reason, et
> voila. I should also mention that I run this AAA scheme in squid for
> years, and I didn't have this issue previously.

Keep in mind that if you have been keeping up with important
patches/updates to Squid AD and/or Samba. Or just client OS updates.
Then a lot of things have been changing from all sides of the process
across those years.

> I also have like dozen
> of other squids running same (very similar) config, - same AAA stuff -
> Basic/NTLM/GSS-SpNego, same AD group checking, but only for the
> different groups membership - and none of it has this issue. I'm
> thinking there's SMP involved, really.

Maybe. Each worker does its own auth, with no sharing. So they should be
operating same as if they were different instances which happened to
have identical config.
 That itself can make problem (1) happen as the Winbind count multiplies
by the number of workers.

Other than that each TCP connection might end up going to a different
worker. BUT, re-auth is always needed on new TCP connections anyway. So
if the client is using HTTP properly that should not be causing any
issue. Might be a big "IF" there though.
 I have to keep reassuring myself that NTLM can handle the TCP
re-connect going to a different worker. The bits prior to type-1
handshake doesn't need a helper, so it should not have issues, but Im
not completely confident about it.

> 
> I realize this is a poor problem report. "Something degrades, I restart
> squid, please help, I think it's SMP-related". But the thing is - I
> don't know where to start to narrow this stuff. If anyone's having a
> good idea please let me know.

The above might give you ideas. Otherwise I can only suggest turning on
debug for the authentication section and see if anything odd shows up.
 debug_options ALl,1 29,4

Amos