[squid-dev] [PATCH] url_rewrite_timeout directive

Tue Nov 18 09:44:48 UTC 2014

On 11/18/2014 01:27 AM, Amos Jeffries wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 18/11/2014 6:10 a.m., Tsantilas Christos wrote:
>> On 11/16/2014 01:05 PM, Amos Jeffries wrote:
>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>>
>>> On 16/11/2014 7:38 a.m., Tsantilas Christos wrote:

 > ....

>>>
>>> For the record I am still extremely skeptical about the use-case
>>> behind this feature. Timeout i a clear sign that the helper or
>>> system behind it is broken (capacity overload / network
>>> congestion). Continuing to use such systems in a way which
>>> further increases network load is very often a bad choice. Hiding
>>> the issue an even worse choice.
>>
>>
>> Maybe the administrator does not care about unanswered queries.
>
> I posit that admin who truely do not care about unanswered questions
> will not care enough to bother configuring this feature.

So he will not do it.
The default behaviour is to not use timeout.
The default behaviour is not changed.

>
> The administrator should always care about these unanswered questions.
> Each one means worse proxy performance. Offering to hide them silently
> and automatically without anybody having to do any work about fixing
> the underlying cause implies they are not actually problems. Which is
> false implication.
>

This option can be used to increase proxy performance. This is the 
purpose of this option.

>
>> Imagine a huge squid proxy, which may have to process thousands of
>> URLs per minute and the administrator decided that a 10% of helpers
>> request can fail. In this case we are giving him a way to configure
>> the squid behaviour in this case: may ignore the probelm, or use a
>> predefined answer etc
>
> Now that same Squid proxy suddenly starts randomly wrongly rejecting
> just one request per minute out of all those thousands.
>   Which one, why, and WTF can anybody do about it?

Just only requests which timed out and only if configured to do it...

>
> Previous Squid would log a client request with _TIMEOUT, leave a
> helper in Busy or Pending state with full trace of the pending lookup
> in mgr reports, possibly even cache.log warnings about helpers queue
> length.

The patch does not change squid behaviour, if the timeout is not 
configured (default).

>
> Now all that is reduced to an overly simple aggregated "Requests timed
> out:" hiding in a rarely viewed manager report and a hidden level-3
> debug message that lists an anonymous requestId from N identical
> requestIds spread over N helpers.
>

IF configured you will see in mgr report a "Requests timed out:"  for 
each running server.
If you see that a server has many timedout requests, more than the other 
servers then you can kiil it if you consider it as a problem.

>>
>> The reason the helper is not answered enough fast, maybe is a
>> database or an external lookup failure (for example categorized
>> urls list as a DNS service). In these cases the system admin or
>> service provider, may prefer a none answer from the helper, or a
>> preconfigured answer, instead of waiting too long for the answer.
>
> What to do if the internal dependencies are going badly is something
> that should be handled *inside the helper*.
>   After all Squid has absolutely no way to know if its generic lookup
> helper has the DNS lookup on the DB server name or the DB query itself
> broken. Each of which might have a different "best" way to respond.
>
> The intention of the BrokenHelper code was to have the helpers
> explicitly inform Squid that they were in trouble and needed help
> shifting the load elsewhere.
>
> Squid silently "Forgetting" that requests have been sent and then
> sending *more* is a truely terrible way to fix all the cases of helper
> overload.

Again, the timeout is optional. It is an extra option.
If the customer/squid-user, wants to use BH code, still can do it.

>>
>> The helper can be implemented to not answer at all after a timeout
>> period. A policy of "if you do not answer in a day, please do not
>> answer at all, I am not interested any mode" is common in human
>> world, and in business.
>>
>
> Yes, it is also being fingered as a major reason for businesses dying
> off during the recent recession....

:-)
Well, we can do a huge discussion about the recent recession, but I am 
sure I have more examples than you on failing businesses under an 
recessionary environment!

>..... Late respondants lost out on work and
> thus cashflow.

Yes but you can not avoid such cases. A stop-loss, after a reasonable 
configured timeout, is not a bad tactic in such cases.

>>>
>>> in src/cache_cf.cc
>>>
>>> * please move the *_url_rewrite_timeout() functions to
>>> redirect.cc - that will help reduce dependency issues when we get
>>> around to making the redirector config a standalone FooConfig
>>> object.
>>
>> Not so easy, because of dependencies on time-related parsing
>> functions.... I let it for now. If required we must move time parse
>> functions to a library file, for example Config.cc
>>
>
> Pity. Oh well.
>
>>>
>>> * it would be simpler and easier on users to omit the
>>> "on_timeout=use_configured_response response=" and just have the
>>> existence of a response= kv-pair imply the on_timeout action. -
>>> that also allows the above suggested removal of bypass and retry
>>> special actions without loosing any functionality. - that removes
>>> the FATAL errors if admin gets the options ordering different
>>> from what you expect. - and allows for forward-compatibility with
>>> unknown options. A while loop fetching kv-pair and processing
>>> them as found regardess of order seems best for a field
>>> documented a " [options] ".
>>>
>>
>> True. This is fixed in this patch
>
> Added a memory leak when setting Config.onUrlRewriteTimeout.response
> multiple times.
>
> Also in free_url_rewrite_timeout() use safe_free() to release and NULL
> the pointer in one call.
>
> Also in dump_url_rewrite_timeout()please do not assert(). This will do
> nasty things in SMP mode manager reports. Handle the error instead.

ok for these.

>
>
>>>
>>> * please define the new handleAdaptationFailure() *above* the
>>> new calloutsError(). - that will reduce the diff and make it
>>> clearer what actually is changing and what not.
>>
>
>
> Still not doing this. It adds 40 lines to your diff when there is only
> 5-6 lines of change.

sorry. Looks that I  forgot to do save before make the diff...

>
>>
>>>
>>> * its now possible for adaptation error to silently continue
>>> processing. Before it would hang. I am not sure that is a good or
>>> bad change. But it may be worth noting the difference in
>>> handleAdaptationFailure() where it not unconditionally calls
>>> doCallouts() at the end.
>>
>> There is not any change on handling adaptation errors, or should
>> not be. Am I loosing something?
>
> The new handleAdaptationFailure() runs doCallout unconditionally:
>    ...
>    calloutsError(ERR_ICAP_FAILURE, errDetail);
>    doCallouts();
>
> The old code was:
>   ...
>   if (calloutContext) {
>     ... <body of new calloutsError> ...
>     doCallouts();
>   }
>
>
> The new code for calloutsError() uses the if-condition internally now.
> But this leaves the terminal doCallouts() from
> handleAdaptationFailure() unprotected.

I think I had check and I found that it is not possible to have 
calloutContext==NULL.
I need to put an "if (calloutContext)" check before call doCallouts to 
be safe.

>
>>> in src/helper.cc:
>>>
>>> * adding multiple double-empty lines in the chunk @195
>>>
>>> * helper_server::checkForTimedOutRequests() is deducting from
>>> stats.pending pending. This is incorrect, the helper request is
>>> still pending.
>>
>> Amos, I think it is better as is.... Maybe in a point of view the
>> request is still pending, but we are not waiting an answer for this
>> request any more.
>
> Any helper which is obeying the helper protocol *will* still be
> consuming resources trying to satisfy it.
> This is what I mean by forgetting and sending more requests to the
> helper "making the overload problem worse".

Maybe we should document this behaviour in option documentation. The 
administrators who want to use timeout they have to take care of it.

>>
>>> - your act of removing the requests from the queue invalidates
>>> the logic in helperReturnBuffer() recording and managing late
>>> helper responses.
>>
>> why?
>
> Non-concurrent helpers:
>   * #1 query times out and gets discarded.
>   * Lookup logics finding a helper to use see a helper without pending
> requests and deliver a second lookup (#2).
>   * #1 response arrives late and is used as response to the current
> pending (#2) lookup.

This is will be discarded.
This is because the request will get new requestId before retried.  The 
old requestId is not valid any more.

>   * #2 response arrives and is discarded.

This is will be accepted.

> - --> helper responses are hereafter off-by-1 until another timeout
> makes it off-by-2 or lucky random timing makes it off-by-0 again.
>
> Please notice that the helpers with most common long duration lookups
> are NTLM and Kerberos crypto validation. Which do not support
> concurrency channels.

Yes. They can not use timeout feature.

>>
>>> - both of the above also corrupts the mgr report content for
>>> expected pending responses. By all means call (and clear) the
>>> callback on a timeout but please do not remove the queue entries
>>> for them until the helper actually responds to that query.
>>
>> Some of the queries will never  answered. This is may result to
>> memory leaks.
>
> see the point I made next:
>
>>> - if a helper gets too many un-responded queries the traffic
>>> being served by it is f'ckd and we should shutdown the helper. If
>>> an auto-restart does not work Manual intervention is probably
>>> required from the admin to resolve why it is timing out *at
>>> all*.

Yes this is true. But the admin can find out if many requests are timed 
out for a server using mgr. And then he can kill problematic server.

>
>>
>>
>>> NP: when doing a re-try we still may get a response to this
>>> lookup from *either* of the helpers which have been asked for it.
>>> The frst helper is in fact more likely to respond first. IMO we
>>> should be taking advantage of that and leaving both helpers with
>>> shared pointer which either can call+unset the callback for.
>>
>> Maybe this is good optimization. However requires some work.... We
>> can add it later if required as optimization...
>>
>
> Ok. Please add an TODO about it at least.
ok

>
>>>
>>> Please also document clearly in the release notes that:
>>>
>>> A) helper-mux.pl we have been distributing for the past few years
>>> to encourage use of concurrency is no longer compatible with
>>> Squid. If used it will spawn up to 2^64 helpers and DoS the Squid
>>> server.
>>
>> What is this helper? I can not find it inside squid sources....
>
> Under tools/
> http://wiki.squid-cache.org/Features/HelperMultiplexer

We need to fix this helper to work with new changes too....
OK.

>>> B) helpers utilizing arrays to handle fixed amounts of
>>> concurrency channels MUST be re-written to use queues and capable
>>> of handling a 64-bit int as index or they will be vulnerable to
>>> buffer overrun and arbitrary memory accesses.
>>
>>
>> OK for this
>>
>>>
>>> C) 32-bit helpers need re-writing to handle the concurrency
>>> channel ID as a 64-bit integer value. If not updated they will
>>> cause proxies to return unexpected results or timeout once
>>> crossing the 32-bit wrap boundary. Leading to undefined behaviour
>>> in the client HTTP traffic.
>>>
>>
>> OK.
>>
>>>
>>>
>>> My vote on this is -0. Too many problems and still no clear
>>> use-case has been described[1] for this to be a reasonable
>>> addition.
>>
>>
>> OK. I tried to do most  of the fixes you are suggested. I hope it
>> is OK.
>>

Thank you for your comments.
Regards,
    Christos

>
> Amos