[squid-dev] [PATCH] Simpler and more robust request line parser

Wed Aug 5 07:37:59 UTC 2015

On 24/07/2015 11:13 a.m., Alex Rousskov wrote:
> On 07/23/2015 08:16 AM, Amos Jeffries wrote:
>> Please note: most of my comments in this long email are informational.
>> Things which need to be taken into consideration when touching the parser.
> 
> Noted. I will do my best to treat them as such.
> 
> 
>> I request performance testing of the final version of this change vs
>> lastest trunk at the time.
> 
> Noted.
> 
> The next patch will address comments not discussed in this email. Thank
> you for those!
> 
> The rest of this email is the usual waste of time discussing
> non-existent bugs, arguing about mostly irrelevant issues, burning straw
> men, and talking past each other. The only important bit of information
> is probably the following paragraph that I will quote here for those who
> do not want to waste their time reading everything:
> 
> "I am happy to restrict URIs to what older Squids accepted or more.
> Recent parser changes broke Squid. There is no need to go beyond what we
> were doing before those changes. I will gladly make a simple patch
> change to restrict URIs to either that old subset or to the current
> trunk subset, whichever is larger. Exactly what subset do you want me to
> use?"

Current trunk already accepts all ASCII printable characters, and a
bunch of whitespace / control codes.

"What I would like" is:

// RFC 3986 URI component character set(s)
// RFC 7230 request-line delimiter
// RFC 7230 HTTP-version character set
auto permit = uriValidCharacters() + CharacterSet::SP;

if (relaxed_header_parser) {

 // RFC7230 section 3.5 alternative control-code whitespace
 permit += CharacterSet::HTAB
   + CharacterSet("VT,FF","\x0B\x0C")
   + CharacterSet::CR;

#if USE_HTTP_VIOLATIONS
  // RFC 2396 unwise character set
  // the "transport agents are known to sometimes modify or use as
delimiter" set
  // which must never be transmitted in un-escaped form
  permit.add('\0');
  permit += CharacterSet("RFC2396-unwise","\"\\|^<>`{}");
#endif
}

Which leaves some 27 CTL characters (0x01-0x1F) still forbidden. Things
like delete, backslash, bell, escape, end-of-transmission. I hope you
are not going to argue for re-enabling those inside this parser.

Of course the bare UTF-8 extended unicode characters are omitted too.
Those is prohibited anywhere in HTTP, and a sure sign that its non-HTTP
being processed. So definitely exit this parser.

IFF you have no objections to the above charsets then the only thing
this patch is waiting on in the technical sense is polygraph test results.

>> On 23/07/2015 4:05 p.m., Alex Rousskov wrote:
>>> Hello,
>>>
>>>     The attached patch implements a simpler and more robust HTTP request
>>
>> It is less robust as many potential attacks are now accepted by the
>> Parser as valid HTTP messages. 
> 
> The parser is more robust because it correctly accepts a larger variety
> of benign messages (see Robustness Principle). That necessarily includes
> a larger variety of "potential attack messages", of course, because
> "potential attack" is such an open-ended term.
> 
> FWIW, it would be perfectly reasonable to enhance relaxed_header_parser
> (or add a new relaxed_uri_parser directive) to make URI parsing
> restrictions more flexible than virtually-all-or-virtually-nothing
> approach used today. I would not be surprised if we do see use cases
> encouraging such changes. The patched code makes them easier.

The current situation is that loose input accepted by
Http1::RequestParser causes parsing to produce a true, which triggers
allocating a bunch of transaction state.
 This either did not happen before, or did not get so far into the
resource allocations because several repeat-parsing layers of sanity
checks used to exist after the old firstLine() parsing. Which are now
gone / melded into the new first-line parse checking.

 Then the final stage of allocations is to call urlParse() to allocate
the HttpRequest object itself. Which still rejects invalid URI input and
triggers de-allocating all the state again.

 Lots of wasted work and resource activity in the name of tolerance for
the 9! very rarely used characters which are outside the permitted set
for un-escaped transmission *in any protocol*.

FYI: the next stage of refactoring class URL is going to move urlParse()
into the class and I was hoping to have Http1::RequestParser use it
instead of a blind CharacterSet scan of the URI field.

 That way the URI is parsed + validated fully before
Http1::RequestParser accepts the first line and starts allocating more
resources than the buffer + parser already existant.

 We can then also do selective leniency for particular parts. eg
query-string only for {}|, query-string or path for space, etc.

I am okay with making it a tolerant parse. But we need to be as strict
as possible about what is accepted at the first-line parse because in
the end there will not be a second re-parse checking validity, and what
gets past the front-line already means resource allocations. If we screw
up entirely and it gets out the other end it is the remote server that
pays the price more than Squid internal store etc. Though it is popular
to have shell scripts parsing Squid logs, or SQL / JSON logging as well
get screwed up by these 9 chars.

> 
>> Squid may "work" as a consequence only because this code was the primary
>> security protection layer. Internal parsers later on (if any) are
>> sometimes less robust and picky.
> 
> I agree that inner layers may receive URIs they have never seen before
> and that those URIs may crash those layers. However, they did receive
> quite a variety before recent parser restrictions were added, and we
> cannot fix bugs [in those other layers] that we do not know about.

Not always. There are 2-3 layers of sanity checks removed now.

For one example; there are stretches of ConnStateData,
ClientHttpRequest, and ClientSocketContext code that have never before
seen 0xC in a domain name. Now they will because anything be found in
query-string with generic whitespace tolerance (as opposed to just-SP
tolerance).
 Yes there are characters previously forbidden which are now allowed.
Going back is not always more tolerant.

>> In these threads we have had around parsing I am getting the impression
>> you think of the HTTPbis WG documents as restrictive arbitrary committee
>> designs.
> 
> If we are talking about RFCs defining HTTP/1, then I think of them as
> protocol specifications. Characterizing a protocol specification written
> by a committee as "restrictive arbitrary committee design" does not
> really help us do something useful AFAICT.
> 
> As an intermediary operating in an imperfect world, Squid has to achieve
> the right balance of what to accept and what to send. IETF community has
> struggled with finding that balance for a very long time, and I do not
> think that struggle is over. Just because there is a real need to accept
> some protocol-violating messages does not mean that HTTP RFCs should be
> ignored or belittled, of course.
> 

catch-22 at the top of a slippery slope we are halfway down.

> 
>> In the contrary they are explicity tested documents based on what the
>> real-world traffic has been actively observed doing over the past 10
>> years.
> 
> Very true! However, there are things that a protocol specification does
> not and cannot cover. Looking at the specs alone is a doomed approach,
> as illustrated by the restricted parser disaster.

It was no disaster. We now have documented some previously unknown
broken clients. When those are fixed and/or obsolete we can try again.
Eventually the protocol will work as designed and we can further
optimize without having to special-case for them on both input and
output translation to other protocols.

> 
> I have no problems with Squid identifying protocol-violating messages if
> there is some practical need in that. However, my focus is on what to do
> with the received message. Blocking *all* protocol-violating messages is
> not practical in most environments. Thus, we must consider things that
> are not documented in the RFC into the equation and occasionally even
> violate the RFC.

Any URI containing \`() { :; }; echo "hi" | cat >> /tmp/hello_world \`
(with back-quotes included) relays shell-chock vulnerability to backend
servers, log processing scripts, you name it.

... and is accepted by Squid RFC-violating parser but not RFC-compliant
parse. Thats using 7 of the 9 "unwise" chars BTW.  '{ }> and |.

> 
>>  Where the real-world traffic was tracked down to malicious behaviour or
>> causing indirect vulnerability exposure it was explicitly forbidden.
>> Intro things like removal of URI user-info segment, explicit
>> single-digit HTTP version, forbidding of whitespace after "HTTP/1.1"
>> token, single SP delimiter in 'strict' parse, etc.
> 
> And I think Squid follows (or should follow) those protocol changes [at
> least up to a point they start causing interoperability problems].
> 

By following the prescribed behaviour Squid cannot be the cause of
interoperability problem. It is the victim.

By doing expressly forbidden things Squid *does* become the perpetrator
of any unwanted side effects. Acceptance of abuse being the cause.

> 
>>> line parser. After getting a series of complaints and seeing admins
>>> adding more and more exceptional URL characters to the existing trunk
> 
>> I hope you can see the real problem is the admins approach to that
>> traffic.
> 
> We disagree on who to blame as the cause of the problem and who should
> be responsible for fixing it. I doubt you have any new arguments that
> can convince me that Squid should block benign traffic just because it
> violates an RFC.

problem != blame.

The blame is the client softwares.

Problem is the admins choice to let the client get away with it while
punishing the victim (Squid) and forcing all the rest of us Squid users
who now have to put up with Squid accepting raw JSON or shell code
silently in URLs.

> 
>> AFAIK, the only non-permitted URI characters in current trunk are the
>> *binary* CTL characters. 
> 
> Sounds good. Could you define "binary CTL characters" so that I can use
> the same set in the patched code, please?

Its no longer a defined set in its own right. RFC2396 defined it as
0x00-0x1F + 0x7F-0xFF but nowdays its more the disallowed hole left in
that area between the overlapping currently permitted sets for
whitespace and whatever the transport or encoding protocols use as
delimiters.

For HTTP that hole is:  0x1-0x8 + 0xE-0x1F + 0x7F-0xFF

Note that 0x80-0xFF are not disallowed, nor permitted either. "reserved
for future use" by the punycode people. UTF-8 URI support is gradually
being worked out, but not yet so we protect the space from squatters
until it matures.

> 
>> Squid does not even check for correct
>> positioning of delimiters in tollerant parse at this stage of the
>> processing.
> 
> Bugs notwithstanding, the new parser does require delimiters between
> method, URI, and HTTP version.
> 

The entire set of excluded characters are delimiters in one protocol or
another which may or may not be arriving on port 80/443 channels.

HTTP uses WspDelims. Which varies betwene tolerant and strict parsing
and also "required" by these same broken clients *within* the URI field.
So identifying which delimiter is a delimiter and which a un-escaped
data character is the core of the problem. As you should know well.

> 
>> Which makes me wonder exactly what these "more and more" extra
>> characters the admin are adding are? I hope you are referring to the
>> r14157 set, as opposed to people blindly trying to let those
>> request-line injection attacks Section 9.2 talks about work through
>> their proxies.
> 
> Section 9.2 does not talk about injection attacks; its scope is quite
> different from the subject of this thread.

Its scope is an intermediary (ie. Squid) interpreting the protocol in
non-compliant ways (ie accepting invalid URI) and as a result performing
unexpected and undefined behaviours (ie converting invalid request to
valid request for something else). Particularly when used in the form of
MITM functionality (common for Squid).

Attacks are along the lines of accepting the shell-shock URL form and
%-escaping or mangling it in a way that gets it deliverd through an
unsuspecting protection layer at the server end so it can harm the
server backend. Which would not be possible if a) the client had sent
the same message to server directly, or b) Squid rejected the request as
invalid HTTP.

> You may be thinking of
> Sections 9.4 and 9.5. I do not think the relaxed parser makes us more
> susceptible to CRLF injection attacks (9.4). As for request smuggling
> (9.5), I think that was possible before and is possible after the
> changes because it requires two agents to interpret the same request
> differently, something that will always be possible as long as there are
> malformed requests and buggy agents.

Them too. But as you say thats more message framing issues. Your point
on LF detection being first avoids those.

>>> parser, I took the plunge and rewrote the existing code using the
>>> approach recently discussed on this list.
>>
>> This Http1::RequestParser has one purpose, and one only. To identify a
>> standards compliant (RFC 1945 or 7230) HTTP/1 request-line.
>>
>> All other inputs are undefined behaviour intended to be handled by other
>> Parsers. Some in future, some already existing (ie FTP).
> 
> I disagree that Http1::RequestParser purpose is to identify standards
> compliant HTTP/1 request-lines. Moreover, I doubt that a parser with
> that purpose is really needed in Squid (beyond academic exercises and
> experiments).

Then why did I bother calling it an HTTP/1 parser and documenting all
that protocol operations that it does.

Surely its a whiz-bang-anything-goes-actuator. No?
 Perhapse I mistook an IRC protocol parser ?
 Maybe "class RequestParser" is a HTTP *response* parser ?

No. Its purpose is to identify and parse HTTP request-line and identify
the start-end of an (optional) mime-header that may follow.  We have
other FooParser classes for other message types and protocols. If this
parser rejects something it means its not an HTTP message. Try a
different Parser.

>> Non-incremental will also re-introduce admin complaints about high CPU
>> consumption on traffic where they receive long URLs over a number of
>> very small packets. ie the traffic which is normal, valid but looks like
>> Slow Loris (or is actually).
> 
> AFAICT, the incremental parser in trunk does not parse URI fields
> incrementally so the proposed parser is not making the above situation
> worse. A quote from the trunk code:
> 
>   tok.reset(buf_);
>   if (tok.prefix(line, LfDelim) && tok.skip('\n')) { ... }
>   if (!tok.atEnd()) { ... }
>   debugs(74, 5, "Parser needs more data");
>   return 0;
> 
> The "..." code does not run while we accumulate the input buffer to find
> the end of the URI. Thus, we exit with what we came with until the whole
> URI is available. No incremental URI parsing.

The bit you quote is what you are seeking to make the *entire* parser
into. Pull in whole line before validating.

With a quick switch of relaxed_header_parser on. One goes to the other
parsing path which (could) validate each character as it goes and just
globs the resulting buffer section into storage as URL.

> 
>> Note that real-world URLs have a *minimum* accepted limit of 4KB now,
>> and sightings have been measured with URL as high as 200KB in length
>> from some domains. Most origin servers wont accept >32KB though.
> 
> Yes, we must handle long URIs. I already have non-triaged bug reports
> that trunk no longer accepts them. That is a separate issue though.
> 

One that I would like to get to the bottom of. Because old Squid
accepted 0-65534 byte URLs. Current trunk accepts 0-65524 byte URLs.
Hardly a big change to length, and neither actually include the magic
65535 / 64KB number.

BTW that length reduction is directly due to the size of " HTTP/1.1\r"
which has to be pulled into a 64K buffer then reverse-parsed out again
by the relaxed parsing mode. Strict parse accepts full 65535 bytes in
just the URI field.

If we add a 32 byte method in the line buffer too ...

Maybe its the long-polling people. Sending "GET" followed by KB of SP
characters one per second then trying to fit a long URL in as well.

Old Squid used to not count the OSP against request message size.

>> * I believe we can retain incremental parsing despite your changes here.
>> As noted above its use was a fairly major protection against some very
>> trivial attack vectors.
>>  - I agree the current incremental implementation is not optimal, but
>> rather fix than removing completely IMO.
> 
> The old incremental parsing was not offering us anything worth
> preserving it for. I agree that it is possible to add *useful*
> incremental parsing. That work should be driven by specific scenarios we
> want to address though.

It was offering full 65535 byte URI fields limits. As mentioned above.

Amos