[squid-dev] [PATCH] Simpler and more robust request line parser

Tue Nov 3 19:10:15 UTC 2015

On 4/11/2015 5:27 a.m., Alex Rousskov wrote:
> On 11/03/2015 02:12 AM, Amos Jeffries wrote:
>>> On 24/07/2015 11:13 a.m., Alex Rousskov wrote:
>>>> "I am happy to restrict URIs to what older Squids accepted or more.
>>>> Recent parser changes broke Squid. There is no need to go beyond what we
>>>> were doing before those changes. I will gladly make a simple patch
>>>> change to restrict URIs to either that old subset or to the current
>>>> trunk subset, whichever is larger. Exactly what subset do you want me to
>>>> use?"
> 
> 
>>> Current trunk already accepts all ASCII printable characters, and a
>>> bunch of whitespace / control codes.
>>>
>>> "What I would like" is:
>>>
>>> // RFC 3986 URI component character set(s)
>>> // RFC 7230 request-line delimiter
>>> // RFC 7230 HTTP-version character set
>>> auto permit = uriValidCharacters() + CharacterSet::SP;
>>>
>>> if (relaxed_header_parser) {
>>>
>>>  // RFC7230 section 3.5 alternative control-code whitespace
>>>  permit += CharacterSet::HTAB
>>>    + CharacterSet("VT,FF","\x0B\x0C")
>>>    + CharacterSet::CR;
>>>
>>> #if USE_HTTP_VIOLATIONS
>>>   // RFC 2396 unwise character set
>>>   // the "transport agents are known to sometimes modify or use as
>>> delimiter" set
>>>   // which must never be transmitted in un-escaped form
>>>   permit.add('\0');
>>>   permit += CharacterSet("RFC2396-unwise","\"\\|^<>`{}");
>>> #endif
>>> }
>>>
>>>
>>> Which leaves some 27 CTL characters (0x01-0x1F) still forbidden. Things
>>> like delete, backslash, bell, escape, end-of-transmission. I hope you
>>> are not going to argue for re-enabling those inside this parser.
>>>
>>> Of course the bare UTF-8 extended unicode characters are omitted too.
>>> Those is prohibited anywhere in HTTP, and a sure sign that its non-HTTP
>>> being processed. So definitely exit this parser.
>>>
>>> IFF you have no objections to the above charsets then the only thing
>>> this patch is waiting on in the technical sense is polygraph test results.
> 
> 
>> Ping. Do we have agreement on those character sets ?
> 
> Maybe. I asked for the exact "relaxed parser" "HTTP violating" set (the
> only important case in the real world). You gave me code constructing
> various sets under various conditions. Until I start porting the
> proposed patch to trunk, I would not know whether your code is
> compatible with what I think we should do, and what sets it constructs.
> 

Strict parsing:
 0x20-0x21, 0x23-0x3B, 0x3D, 0x3F-0x5B, 0x5D, 0x5F, 0x61-0x7A, 0x7E

Relaxed parsing:
 0x09-0x0d, 0x20-0x21, 0x23-0x3B, 0x3D, 0x3F-0x5B, 0x5D, 0x5F,
0x61-0x7A, 0x7E

Relaxed parsing with USE_HTTP_VIOLATIONS enabled:
 0x09-0x0d, 0x20-0x7E

NOTE 1: violatison is enabled by default. To get non-violations
behaviour people have to have explicitly configured strict parsing, or
built the proxy without violations (reducing their relaxed parser to
just the compliant tolerances).

NOTE 2: HTTP is ASCII-only unless specifically documented. The
request-line is not one of those special cases. The 0x80-0xFF octet
range is not part of ASCII.

I cannot do a proper comparison to the old-old parser because there was
not just one of them. But 3 and bits of partial-parsing in a few places.
They mostly appeared to accept 0x00-0x09, 0x0B-0xFF, with various bits
of the buffer having specific exact-string matches applied or passed
directly to syscalls depending on what code path was being followed.

The biggest problem is the main syscall, isspace() which is locale
dependent. So there are some really weird attacks possible by sending an
HTTP message with locale-specific "whitspace" delimiters around the URL
field, then the first actual SP (0x20) character on another line
somewhere after the first \n character. Sometimes that second line gets
parsed and used as the URL, not the one on the first line of the message.

I can see where your temptation to accept 0x00-0xFF comes from. But I
really think we need to do better with the validation side of parse now,
since this parser is replacing whole code paths of both parse and
validation.

Rejecting the 0x01-0x08, 0x0E-0x1F, 0x7F set even in violations mode is
a good start on that. Those characters are non-printable and can affect
display in logs, terminals, and when debugging.

> You also expressed hope that we will not re-enable support for CTL
> characters. I do not know whether the old parser allowed [some of] them,
> which is one of the primary acceptance conditions for me (by default, we
> want to accept everything old Squid accepted and possibly more; any new
> restrictions should be discussed).
> 
> You also talked about Unicode which sounds completely irrelevant to me
> because we are (or should be) dealing with octets and their 0-255
> integer values, not any smart encoding on top of that.

Unicode: octets 0x80-0xFF.

> 
> In summary, I could not (and still cannot) respond with a quick
> "agreed/disagreed". You did not give me enough information in an
> digestible enough form to do that. You are certainly _not_ obligated to
> provide the info I ask in a form I can quickly grok! I am just
> explaining why I could not respond promptly.
> 
> 
> Moreover, there is still some merging work required because you have
> changed the parser between my review of your parser and my suggested
> replacement patch. I could no longer simply adjust the sets and apply my
> patch back then. That merging work is still required. If you have the
> time to do that, please do it.

The current parser is accepting the same sets above that I am requesting
from yours, in those same above modes. It seems to be working okay so
far as the charactersets go.

Trying to update the patch now.

Thank you.

Amos