[squid-dev] [PATCH] url_rewrite_timeout directive

Mon Nov 17 23:27:55 UTC 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 18/11/2014 6:10 a.m., Tsantilas Christos wrote:
> On 11/16/2014 01:05 PM, Amos Jeffries wrote:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>> 
>> On 16/11/2014 7:38 a.m., Tsantilas Christos wrote:
>>> Hi all,
>>> 
>>> This patch adds the url_rewrite_timeout directive.
>>> 
>>> When configured, Squid keeps track of active requests and
>>> treats timed out requests to redirector as failed requests.
>>> 
>>> url_rewrite_timeout format: url_rewrite_timeout timeout
>>> time-units 
>>> on_timeout=<fail|bypass|retry|use_configured_response> 
>>> [response=<quoted-string>]
>>> 
>>> The url_rewrite_timeout directive can accept the on_timeout 
>>> argument to allow user configure the action when the helper
>>> request times out. The available actions are: - fail: squid
>>> return a ERR_GATEWAY_FAILURE error page
>> 
>> Why? It seems to me this should be doing one of the below:
>> 
>> on_timeout=use_configured_response response=ERR
>> 
>> on_timeout=use_configured_response response=BH
>> 
>> although being able to redirect to a Squid default error page
>> template has its attractions regardless of this. Perhapse BH with
>> an error=ERR_GATEWAY_FAILURE kv-pair ?
> 
> The on_timeout= option is easier to understand and configure.
> 
> Also this patch add the retries operation inside helpers.cc code
> and make it easier to implement similar features for other helpers
> too...
> 
> Also the BH retry can be implemented now using helpers retry
> operation.
> 
>> 
>> 
>>> - bypass: the url is not rewritten.
>> 
>> Identical to: on_timeout=use_configured_response response=OK
>> 
>>> - retry: retry the request to helper
>> 
>> Equivalent to what is supposed to happen for: 
>> on_timeout=use_configured_response response=BH
>> 
>> NP: if retry with different helper process is not already being
>> done on BH then it needs to be added.
> 
> It is not implemented for BH redirects. Also although it is make
> sense to not use the same server for a BH response, it is not clear
> why it is needed for a timedout server?

We dont know why the timeout happened. There are a few cases where it
may have happened due to internal helper state. Moving to a different
helper guarantees that those cases will have changed in some ways -
reducing the overall probability that it will repeat.

> 
> I believe this is should have a different form. If there are many 
> timedout responses from a server, then do not sent more requests
> for a while, or maybe shut it down. We can add a todo for this
> one.
> 

Both are worth doing.

>> 
>>> - use_configured_response: use a response which can be
>>> configured using the the response= option
>>> 
>> 
>> So as you can see, with a change to make the URL-rewriter capable
>> of redirecting to one of the built-in error page templates we
>> could completely drop the on_timeout setting.
> 
> I still believe that the "on-timeout" is better options and easier
> to understand and configure.
> 

Whatever it gets called the main point was there is no need for 2
different options.

>>> 
>>> Technical details =====================
>>> 
>>> This patch: - adds mechanism inside helpers.cc code to handle 
>>> timeouts: - return a pre-configured response on timeout - or 
>>> retries on timeouts. - or timedout (Helper::TimedOut code)
>>> response to the  caller. The caller can select to ignore the
>>> timedout request, or produce an error.
>>> 
>>> - modify the client_side_request.cc to return
>>> ERR_GATEWAY_FAILURE error page. Also the error detail
>>> ERR_DETAIL_REDIRECTOR_TIMEDOUT is set to identify the type of
>>> error.
>>> 
>> 
>> For the record I am still extremely skeptical about the use-case 
>> behind this feature. Timeout i a clear sign that the helper or
>> system behind it is broken (capacity overload / network
>> congestion). Continuing to use such systems in a way which
>> further increases network load is very often a bad choice. Hiding
>> the issue an even worse choice.
> 
> 
> Maybe the administrator does not care about unanswered queries.

I posit that admin who truely do not care about unanswered questions
will not care enough to bother configuring this feature.

The administrator should always care about these unanswered questions.
Each one means worse proxy performance. Offering to hide them silently
and automatically without anybody having to do any work about fixing
the underlying cause implies they are not actually problems. Which is
false implication.

> Imagine a huge squid proxy, which may have to process thousands of
> URLs per minute and the administrator decided that a 10% of helpers
> request can fail. In this case we are giving him a way to configure
> the squid behaviour in this case: may ignore the probelm, or use a
> predefined answer etc

Now that same Squid proxy suddenly starts randomly wrongly rejecting
just one request per minute out of all those thousands.
 Which one, why, and WTF can anybody do about it?

Previous Squid would log a client request with _TIMEOUT, leave a
helper in Busy or Pending state with full trace of the pending lookup
in mgr reports, possibly even cache.log warnings about helpers queue
length.

Now all that is reduced to an overly simple aggregated "Requests timed
out:" hiding in a rarely viewed manager report and a hidden level-3
debug message that lists an anonymous requestId from N identical
requestIds spread over N helpers.

> 
> The reason the helper is not answered enough fast, maybe is a
> database or an external lookup failure (for example categorized
> urls list as a DNS service). In these cases the system admin or
> service provider, may prefer a none answer from the helper, or a
> preconfigured answer, instead of waiting too long for the answer.

What to do if the internal dependencies are going badly is something
that should be handled *inside the helper*.
 After all Squid has absolutely no way to know if its generic lookup
helper has the DNS lookup on the DB server name or the DB query itself
broken. Each of which might have a different "best" way to respond.

The intention of the BrokenHelper code was to have the helpers
explicitly inform Squid that they were in trouble and needed help
shifting the load elsewhere.

Squid silently "Forgetting" that requests have been sent and then
sending *more* is a truely terrible way to fix all the cases of helper
overload.

> 
> The helper can be implemented to not answer at all after a timeout
> period. A policy of "if you do not answer in a day, please do not
> answer at all, I am not interested any mode" is common in human
> world, and in business.
> 

Yes, it is also being fingered as a major reason for businesses dying
off during the recent recession. Late respondants lost out on work and
thus cashflow.

>> 
>> in src/cache_cf.cc
>> 
>> * please move the *_url_rewrite_timeout() functions to
>> redirect.cc - that will help reduce dependency issues when we get
>> around to making the redirector config a standalone FooConfig
>> object.
> 
> Not so easy, because of dependencies on time-related parsing
> functions.... I let it for now. If required we must move time parse
> functions to a library file, for example Config.cc
> 

Pity. Oh well.

>> 
>> * it would be simpler and easier on users to omit the 
>> "on_timeout=use_configured_response response=" and just have the 
>> existence of a response= kv-pair imply the on_timeout action. -
>> that also allows the above suggested removal of bypass and retry 
>> special actions without loosing any functionality. - that removes
>> the FATAL errors if admin gets the options ordering different
>> from what you expect. - and allows for forward-compatibility with
>> unknown options. A while loop fetching kv-pair and processing
>> them as found regardess of order seems best for a field
>> documented a " [options] ".
>> 
> 
> True. This is fixed in this patch

Added a memory leak when setting Config.onUrlRewriteTimeout.response
multiple times.

Also in free_url_rewrite_timeout() use safe_free() to release and NULL
the pointer in one call.

Also in dump_url_rewrite_timeout()please do not assert(). This will do
nasty things in SMP mode manager reports. Handle the error instead.

>> 
>> * please define the new handleAdaptationFailure() *above* the
>> new calloutsError(). - that will reduce the diff and make it
>> clearer what actually is changing and what not.
> 

Still not doing this. It adds 40 lines to your diff when there is only
5-6 lines of change.

> 
>> 
>> * its now possible for adaptation error to silently continue 
>> processing. Before it would hang. I am not sure that is a good or
>> bad change. But it may be worth noting the difference in 
>> handleAdaptationFailure() where it not unconditionally calls 
>> doCallouts() at the end.
> 
> There is not any change on handling adaptation errors, or should
> not be. Am I loosing something?

The new handleAdaptationFailure() runs doCallout unconditionally:
  ...
  calloutsError(ERR_ICAP_FAILURE, errDetail);
  doCallouts();

The old code was:
 ...
 if (calloutContext) {
   ... <body of new calloutsError> ...
   doCallouts();
 }

The new code for calloutsError() uses the if-condition internally now.
But this leaves the terminal doCallouts() from
handleAdaptationFailure() unprotected.

>> in src/helper.cc:
>> 
>> * adding multiple double-empty lines in the chunk @195
>> 
>> * helper_server::checkForTimedOutRequests() is deducting from 
>> stats.pending pending. This is incorrect, the helper request is
>> still pending.
> 
> Amos, I think it is better as is.... Maybe in a point of view the
> request is still pending, but we are not waiting an answer for this
> request any more.

Any helper which is obeying the helper protocol *will* still be
consuming resources trying to satisfy it.
This is what I mean by forgetting and sending more requests to the
helper "making the overload problem worse".

> 
>> - your act of removing the requests from the queue invalidates
>> the logic in helperReturnBuffer() recording and managing late
>> helper responses.
> 
> why?

Non-concurrent helpers:
 * #1 query times out and gets discarded.
 * Lookup logics finding a helper to use see a helper without pending
requests and deliver a second lookup (#2).
 * #1 response arrives late and is used as response to the current
pending (#2) lookup.
 * #2 response arrives and is discarded.
- --> helper responses are hereafter off-by-1 until another timeout
makes it off-by-2 or lucky random timing makes it off-by-0 again.

Please notice that the helpers with most common long duration lookups
are NTLM and Kerberos crypto validation. Which do not support
concurrency channels.

> 
>> - both of the above also corrupts the mgr report content for
>> expected pending responses. By all means call (and clear) the
>> callback on a timeout but please do not remove the queue entries
>> for them until the helper actually responds to that query.
> 
> Some of the queries will never  answered. This is may result to
> memory leaks.

see the point I made next:

>> - if a helper gets too many un-responded queries the traffic
>> being served by it is f'ckd and we should shutdown the helper. If
>> an auto-restart does not work Manual intervention is probably
>> required from the admin to resolve why it is timing out *at
>> all*.
> 
> 

> 
> 
>> NP: when doing a re-try we still may get a response to this
>> lookup from *either* of the helpers which have been asked for it.
>> The frst helper is in fact more likely to respond first. IMO we
>> should be taking advantage of that and leaving both helpers with
>> shared pointer which either can call+unset the callback for.
> 
> Maybe this is good optimization. However requires some work.... We
> can add it later if required as optimization...
> 

Ok. Please add an TODO about it at least.

>> 
>> Please also document clearly in the release notes that:
>> 
>> A) helper-mux.pl we have been distributing for the past few years
>> to encourage use of concurrency is no longer compatible with
>> Squid. If used it will spawn up to 2^64 helpers and DoS the Squid
>> server.
> 
> What is this helper? I can not find it inside squid sources....

Under tools/
http://wiki.squid-cache.org/Features/HelperMultiplexer

> 
>> 
>> B) helpers utilizing arrays to handle fixed amounts of
>> concurrency channels MUST be re-written to use queues and capable
>> of handling a 64-bit int as index or they will be vulnerable to
>> buffer overrun and arbitrary memory accesses.
> 
> 
> OK for this
> 
>> 
>> C) 32-bit helpers need re-writing to handle the concurrency
>> channel ID as a 64-bit integer value. If not updated they will
>> cause proxies to return unexpected results or timeout once
>> crossing the 32-bit wrap boundary. Leading to undefined behaviour
>> in the client HTTP traffic.
>> 
> 
> OK.
> 
>> 
>> 
>> My vote on this is -0. Too many problems and still no clear
>> use-case has been described[1] for this to be a reasonable
>> addition.
> 
> 
> OK. I tried to do most  of the fixes you are suggested. I hope it
> is OK.
> 

Amos

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJUaoR7AAoJELJo5wb/XPRjpQoIANqegwrcMVhGcAEoVBJ67yBo
7m7YmoY6vd1gI24bHTG89HkoBfiwNUW4lOA6kLlhuE+TKV2ART/jWkuEgwf1aqmh
xvg5aan/Ku0YfqAHCgn65n70dSQM3ucX2BkQ7udOHaKpLY3sAeQmrKB7I0kolber
KW4cxD4Xk6p+mxEeOkPyNysTbYYic8HcFd34zM/LbJXWn82Xu2nD5u6mYKU5mrYO
OhBYZG6y+7a9M2JdAytGbRLWhb0ocycmeL1+rB6TWrJUnYdKvKIgzqFzdMAXkc1d
Fa3wNwo6KjMK8JgYZCKbM6UBev9CaaTZgC6JMKOJXa958LDnL8gWvGuvW18iEXM=
=wPtr
-----END PGP SIGNATURE-----