[squid-dev] [PATCH] Temporary fix to restore compatibility with Amazon

Alex Rousskov rousskov at measurement-factory.com
Fri Jun 26 15:40:48 UTC 2015


On 06/26/2015 06:44 AM, Amos Jeffries wrote:
> On 26/06/2015 7:55 a.m., Alex Rousskov wrote:
>> Tokenizer cannot handle URIs with whitespaces directly, like your patch
>> attempts to do: Tokenizer alone cannot handle ambiguous grammars. To
>> handle such URIs well, you have two options IMO:
>>
>> A. The old "do it by hand" way:
>>
>>   0. Trim request line to remove trailing whitespace and CR*LF.
>>   1. Find the *last* whitespace on the trimmed request line.
>>   2. To get the request method and URI, apply the tokenizer to the
>>      request line prefix before that last whitespace.
>>   3. To get the protocol parts, apply the tokenizer to the
>>      request line suffix after that last whitespace.
>>   4. Make the code even messier to handle HTTP/0.9 cases if needed.
>>
>>
>> B. The new optimized and simplified "optimistic Tokenizer" way:
>>
>> Here is a sketch:
>>
>>     SBuf uri; // accumulates request URI characters
>>     if (!tok.prefix(uri, StrictUriChars))
>>         return false; // URI does not start with a valid character
>>
>>     // in the order of correctness (and popularity!):
>>     const bool parsedSuffix =
>>         // canonical HTTP/1 format
>>         parseCanonicalReqLineTail(uri, tok) ||
>>         // HTTP/1 format but with bad characters in URI
>>         parseBadUriReqLineTail(uri, tok) ||
>>         // HTTP/1 format but with bad characters and spaces in URI
>>         parseSpacedReqLineTail(uri, tok) ||
>>         // HTTP/0 format (without the protocol part at all)
>>         parseHttpZeroReqLineTail(uri, tok);
>>
>>     Assert(parsedSuffix); // parseHttpZeroReqLineTail() cannot fail
>>     ...


> C. locate LF character and parse in reverse. HTTP-Version label, WSP
> then remainder is URI.

Yes, this is a better variant of option A, using Tokenizer::suffix(); I
missed that method. It is still manual labor to accommodate rare special
cases while complicating and slowing down parsing of the common case.



> (B) is Interesting alternative to what we have now (C). Although you are
> missing the relaxed vs strict characters which the RFC mandates (for
> both URI and whitspace separately permutating). That complicates things
> a bit.

I do not see where design B is missing something. Obviously(?), if there
are more special cases than the sketch shows, they should be added. I
was just illustrating the approach. The actual implementation should be
different.


> If I can find some time I'll give this (B) style a bit of testing and
> see how it matches up against the current code. Could prove faster on
> very long URIs.

Given correct implementations of A and B, B should be a tiny bit faster
for an average URI because it does less work in common cases.


In your patch, please add support for all URI characters that we can
support (or at least all the "unwise" ones from RFC 2396), not just the
characters that recent deployments have already confirmed "as necessary
to accommodate". We do not want to come back to this every time some app
starts sending slightly malformed URIs.

I would also recommend accommodating whitespace after HTTP-version.

Ideally, parseRequestFirstLine(), or at least the new code, should be
split into several methods. It is too long, convoluted, and is getting
worse.


I remain opposed to downgrading HTTP/1 requests with slightly malformed
URIs to HTTP/0.


HTH,

Alex.



More information about the squid-dev mailing list