[squid-dev] [PATCH] Simpler and more robust request line parser

Thu Jul 23 23:13:55 UTC 2015

On 07/23/2015 08:16 AM, Amos Jeffries wrote:
> Please note: most of my comments in this long email are informational.
> Things which need to be taken into consideration when touching the parser.

Noted. I will do my best to treat them as such.

> I request performance testing of the final version of this change vs
> lastest trunk at the time.

Noted.

The next patch will address comments not discussed in this email. Thank
you for those!

The rest of this email is the usual waste of time discussing
non-existent bugs, arguing about mostly irrelevant issues, burning straw
men, and talking past each other. The only important bit of information
is probably the following paragraph that I will quote here for those who
do not want to waste their time reading everything:

"I am happy to restrict URIs to what older Squids accepted or more.
Recent parser changes broke Squid. There is no need to go beyond what we
were doing before those changes. I will gladly make a simple patch
change to restrict URIs to either that old subset or to the current
trunk subset, whichever is larger. Exactly what subset do you want me to
use?"

> If possible an emulated Slow-Loris DoS attack of up to 1MB of random
> printable ASCII and whitespace characters drip-fed in 1-5 bytes each 15
> sec (followed by 1 LF) would be nice too. Avg socket close rate from
> Squid end is what matters. Though thats a wishlist since I've not done
> such on Squid since ~2010.

In such a test, Squid connection closing time is determined by the
minimum of

  * mean time of the first LF appearance on the connection and
  * configured request waiting timeout.

Both parameters are outside Squid control. By varying the number of
concurrent connections, I can demonstrate virtually any "socket close
rate from Squid end". What would such a test tell us?

Needless to say, an implementation that validates partially received
request lines will close random garbage connections earlier, on average.
There is no need to test that.

> On 23/07/2015 4:05 p.m., Alex Rousskov wrote:
>> Hello,
>>
>>     The attached patch implements a simpler and more robust HTTP request
> 
> It is less robust as many potential attacks are now accepted by the
> Parser as valid HTTP messages. 

The parser is more robust because it correctly accepts a larger variety
of benign messages (see Robustness Principle). That necessarily includes
a larger variety of "potential attack messages", of course, because
"potential attack" is such an open-ended term.

FWIW, it would be perfectly reasonable to enhance relaxed_header_parser
(or add a new relaxed_uri_parser directive) to make URI parsing
restrictions more flexible than virtually-all-or-virtually-nothing
approach used today. I would not be surprised if we do see use cases
encouraging such changes. The patched code makes them easier.

> Squid may "work" as a consequence only because this code was the primary
> security protection layer. Internal parsers later on (if any) are
> sometimes less robust and picky.

I agree that inner layers may receive URIs they have never seen before
and that those URIs may crash those layers. However, they did receive
quite a variety before recent parser restrictions were added, and we
cannot fix bugs [in those other layers] that we do not know about.

> If this Parser is made extremely generic the latter parsing logic is
> going to have to change to find these same problems, which will only
> make Squid reject the same non-compliant traffic - after much wasted
> processing of the message.

This is too general/vague to argue about: Sure, if some inner code
always rejects URI X, it might make sense to reject X in the parser, but
doing so is not _always_ better from both code quality and performance
points of view. Divide and conquer goes a long way in both directions.

> In these threads we have had around parsing I am getting the impression
> you think of the HTTPbis WG documents as restrictive arbitrary committee
> designs.

If we are talking about RFCs defining HTTP/1, then I think of them as
protocol specifications. Characterizing a protocol specification written
by a committee as "restrictive arbitrary committee design" does not
really help us do something useful AFAICT.

As an intermediary operating in an imperfect world, Squid has to achieve
the right balance of what to accept and what to send. IETF community has
struggled with finding that balance for a very long time, and I do not
think that struggle is over. Just because there is a real need to accept
some protocol-violating messages does not mean that HTTP RFCs should be
ignored or belittled, of course.

> In the contrary they are explicity tested documents based on what the
> real-world traffic has been actively observed doing over the past 10
> years.

Very true! However, there are things that a protocol specification does
not and cannot cover. Looking at the specs alone is a doomed approach,
as illustrated by the restricted parser disaster.

I have no problems with Squid identifying protocol-violating messages if
there is some practical need in that. However, my focus is on what to do
with the received message. Blocking *all* protocol-violating messages is
not practical in most environments. Thus, we must consider things that
are not documented in the RFC into the equation and occasionally even
violate the RFC.

>  Where the real-world traffic was tracked down to malicious behaviour or
> causing indirect vulnerability exposure it was explicitly forbidden.
> Intro things like removal of URI user-info segment, explicit
> single-digit HTTP version, forbidding of whitespace after "HTTP/1.1"
> token, single SP delimiter in 'strict' parse, etc.

And I think Squid follows (or should follow) those protocol changes [at
least up to a point they start causing interoperability problems].

>> line parser. After getting a series of complaints and seeing admins
>> adding more and more exceptional URL characters to the existing trunk

> I hope you can see the real problem is the admins approach to that
> traffic.

We disagree on who to blame as the cause of the problem and who should
be responsible for fixing it. I doubt you have any new arguments that
can convince me that Squid should block benign traffic just because it
violates an RFC.

> AFAIK, the only non-permitted URI characters in current trunk are the
> *binary* CTL characters. 

Sounds good. Could you define "binary CTL characters" so that I can use
the same set in the patched code, please?

> Squid does not even check for correct
> positioning of delimiters in tollerant parse at this stage of the
> processing.

Bugs notwithstanding, the new parser does require delimiters between
method, URI, and HTTP version.

> Which makes me wonder exactly what these "more and more" extra
> characters the admin are adding are? I hope you are referring to the
> r14157 set, as opposed to people blindly trying to let those
> request-line injection attacks Section 9.2 talks about work through
> their proxies.

Section 9.2 does not talk about injection attacks; its scope is quite
different from the subject of this thread. You may be thinking of
Sections 9.4 and 9.5. I do not think the relaxed parser makes us more
susceptible to CRLF injection attacks (9.4). As for request smuggling
(9.5), I think that was possible before and is possible after the
changes because it requires two agents to interpret the same request
differently, something that will always be possible as long as there are
malformed requests and buggy agents.

Said that, if there is a known attack vector that we can block without
making things overall worse, then we should block it. Most likely, that
work should be done outside the parser though.

>> parser, I took the plunge and rewrote the existing code using the
>> approach recently discussed on this list.
> 
> This Http1::RequestParser has one purpose, and one only. To identify a
> standards compliant (RFC 1945 or 7230) HTTP/1 request-line.
> 
> All other inputs are undefined behaviour intended to be handled by other
> Parsers. Some in future, some already existing (ie FTP).

I disagree that Http1::RequestParser purpose is to identify standards
compliant HTTP/1 request-lines. Moreover, I doubt that a parser with
that purpose is really needed in Squid (beyond academic exercises and
experiments).

>> The primary changes are: Removed incremental parsing and revised parsing
>> sequence to accept virtually any URI (by default and also configurable
>> as before).

> Firstly, I note that you still have incremental parsing in the code
> itself. Just not for fields internal to the request-line.

The patch is about request-line parser so the description seems accurate
to me. *That* parser is no longer incremental -- it does not preserve
any state while returning the "need more data" result. Are you talking
about some other parsers that this patch does not touch?

> The result of this change is that Squid will become vulnerable to even
> the most trivial of "Slow Loris" attacks.

No, not the most trivial (random garbage). As for the sophisticated
"Slow Loris" attacks, they are available for any parser that accepts
non-trivial inputs. The defenses against them do not belong to the
request line parser.

> Of course Squid
> without input validation will work faster on most/compliant traffic than
> one that spends time validating inputs.

With the exception of HTTP method _length_ limit (which will be
restored), the proposed code does not remove validation. For the URI,
the validation changes what characters are acceptable. We can easily
adjust that set as discussed below.

> Squid which re-parse the message from scratch on every byte received are
> vulnerable to Slow Loris even in common compliant-HTTP message parsing.

So is virtually any parser, including your incremental parser. Virtually
all parsers have to re-parse elements from scratch. It is possible to
avoid that re-parsing completely, but the complexity is usually not
worth the gains. If you want to defend against real Slow Loris attacks,
the defense has to lie elsewhere.

> .. and much more vulnerable Squid. Whole categories of attack against
> both Squid and its backends are now tolerated when they were rejected or
> greatly restricted earlier.

I am happy to restrict URIs to what older Squids accepted or more.
Recent parser changes broke Squid. There is no need to go beyond what we
were doing before those changes. I will gladly make a simple patch
change to restrict URIs to either that old subset or to the current
trunk subset, whichever is larger. Exactly what subset do you want me to
use?

To minimize iterations, please use character ranges or similar precise
definitions instead of "binary characters" and such. Thank you.

>> * Some unit test case adjustments.
>> * The new parser no longer treats some request lines ending with
>>   "HTTP/1.1" as HTTP/0.9 requests for URIs that end with "HTTP/1.1".
> 
> Following 'random garbage' with "HTTP/1.1" does not make it HTTP compliant.

Agreed.

> This adds direct and explicit violations of RFC 1945, 3896 and 7230.

As you know, some of those violations existed before. I have no desire
to add any new violations compared to the old code that worked OK.

>> Also removed hard-coded 16-character method length limit because I did
>> not know why that limit was needed _and_ there was a TODO to make it
>> configurable. Please let me know if there was a good reason for that
>> limit, and I will restore it.

> It is there to restrict damage from active malware in HTTP/1. Namely
> Slow Loris, random method cache busting and use of method names as an
> open botnet C&C channel.

I do not know what the last one means, but the other two you listed do
not seem to be relevant in this context.

I will add a simple 32-character (per your other comment) method
acceptance limit to avoid prolonging this discussion. If you know the
method name that already uses 17 characters, please post so that I can
mention it in a code comment.

> Non-incremental will also re-introduce admin complaints about high CPU
> consumption on traffic where they receive long URLs over a number of
> very small packets. ie the traffic which is normal, valid but looks like
> Slow Loris (or is actually).

AFAICT, the incremental parser in trunk does not parse URI fields
incrementally so the proposed parser is not making the above situation
worse. A quote from the trunk code:

  tok.reset(buf_);
  if (tok.prefix(line, LfDelim) && tok.skip('\n')) { ... }
  if (!tok.atEnd()) { ... }
  debugs(74, 5, "Parser needs more data");
  return 0;

The "..." code does not run while we accumulate the input buffer to find
the end of the URI. Thus, we exit with what we came with until the whole
URI is available. No incremental URI parsing.

> Note that real-world URLs have a *minimum* accepted limit of 4KB now,
> and sightings have been measured with URL as high as 200KB in length
> from some domains. Most origin servers wont accept >32KB though.

Yes, we must handle long URIs. I already have non-triaged bug reports
that trunk no longer accepts them. That is a separate issue though.

> Consider Squid parsing the pipeline of requests which are delivered in a
> single packet, and thus all exists in buffer simultaneously:
> "
> GET http://example.com/ HTTP/1.1\n
> \n
> GET http://attacker.invalid/ HTTP/1.1\n
> X-Custom: HTTP/1.1\n
> "
> 
> All Squid default installations use relaxed parse with violations
> enabled. The new tok.prefix(uriFound, UriCharacters()) will treat this
> all as a single request-line:
>   "GET http://example.com/...attacker.invalid/ HTTP/1.1" label.

No, the code does not work like that. The URI is not parsed until Squid
isolates: the first request line (required), the request method inside
that request line (required), the protocol version inside that request
line (absent in case of HTTP/0), and related delimiters.

In your particular example, the URI parsing code you are worried about
will be called with "http://example.com/" characters.

>> * Accept virtually any URI (when allowed).
>>
>> 1. USE_HTTP_VIOLATIONS now allow any URI characters except spaces.
> 
> Except? Ironic.

No, just controlled by a different setting -- relaxed_header_parser. The
code is trying to preserve the existing split between
USE_HTTP_VIOLATIONS and relaxed_header_parser scopes, nothing more. I
think you interpret that [undocumented?] split differently, and I will
try to adjust the patch to match _your_ interpretation.

> In any case we can't accept bare LF within URL for practical issues
> around finding the end of line.

True. The LF character will never get to the URI parser in the proposed
implementation. While the URI parsing code accepts LF, nothing actually
feeds LF to that code. That will be further restricted in the next patch
version as we tighten the relaxed character set as discussed elsewhere.

>> 2. relaxed_header_parser allows spaces in URIs.

> By "spaces" you mean the extended RFC 7230 compliant whitespace. Good.

>> URIs in what Squid considers an HTTP/0.9 request obey the same rules.
>> Whether the rules should differ for HTTP/0 is debatable, but the current
>> implementation is the simplest possible one, and the code makes it easy
>> to add complex rules.

> Which means we can no longer claim RFC 1945 compliance. 

We can unless USE_HTTP_VIOLATIONS or Config.onoff.relaxed_header_parser
is enabled AFAICT. If somebody really wants RFC 1945 compliance, they
should just configure Squid accordingly. Not that there is any practical
value in doing that, but the patch attempts to preserve that possibility
AFAICT.

This disagreement may be related to how we interpret the split between
USE_HTTP_VIOLATIONS and Config.onoff.relaxed_header_parser
responsibilities. Again, I will do my best to adjust the code to match
your interpretation, which might resolve this issue as well.

> The old RFC is
> remarkably restrictive. MUST level requirements on individual characters
> in the URL (via RFC 1738), their positioning, and method / prefix "GET "
> (with exactly one explicitly SP character delimiter required).

Squid might accept some invalid HTTP/0 requests even when configured to
be strict, but I do not know of any specific cases (other than those
that are HTTP/1 requests). Since the whole strict HTTP/0 compliance
concern is irrelevant for real HTTP traffic (including real HTTP/0
traffic!), I really doubt we should waste time digging any deeper here.

> NP: RFC 7230 formally dropped the requirement to also support RFC 1945
> HTTP/0.9 simple-request syntax. It has not been seen in any of the
> traffic measured by HTTPbis WG in the past 5 years. So is now optional
> for on-wire traffic parsing.
> 
> I left RFC 1945 compliance to reduce the unnecessary restrictions on
> what Squid accepts. Removing it alone will allow us to dramatically
> simplify the logics in the existing parser (~15% ?) without any of the
> other changes you are trying to get through in this patch.

Removing HTTP/0 support can simplify any parser. It would be especially
easy in the proposed parser, but let's not discuss that out-of-scope
change here.

> The strict parse not having to bother with 2-pass reverse parsing offers
> faster REST transaction latency.
> 
> As I mentioned when the topic of this re-write came up earlier, I
> request performance testing for this change. We lost a lot of
> performance optimizations in the last re-write and any further polishing
> needs to counter that one way or another.

Yes, of course. I realize that you will not let my rewrites to cause
known performance regressions. The patched code performance will be tested.

>> * Unit test cases adjustments.
>>
>> Removal of incremental request line parsing means that we should not
>> check parsed fields when parsing fails or has not completed yet.
>>
>> Some test cases made arguably weird decisions apparently to accommodate
>> the old parser. The expectations of those test cases are [more] natural now.
> 
> Which ones?

The examples I remember are the following two requests that test cases
said should be treated as valid HTTP/0 requests [with "HTTP/1.1" as a URI!]:

* GET HTTP/1.1
* GET  HTTP/1.1

They are now treated as invalid requests.

> * I believe we can retain incremental parsing despite your changes here.
> As noted above its use was a fairly major protection against some very
> trivial attack vectors.
>  - I agree the current incremental implementation is not optimal, but
> rather fix than removing completely IMO.

The old incremental parsing was not offering us anything worth
preserving it for. I agree that it is possible to add *useful*
incremental parsing. That work should be driven by specific scenarios we
want to address though.

I will add "early" method validation to address your other, valid
concerns. This validation is not tied to incremental parsing though.

> * RelaxedCharsPlusWhite includes LF character.
>  - this will result in some traffic parsing the entire Mime header
> portion of messages as part of the URL. see above for details.

No, it will not, as discussed above.

> in Http::One::RequestParser::skipDelimiter():
> 
> * please note how HTTP Parser use HttpTokenizer for specific skip*()
> things like what this function does.

So does the patched code: skipDelimiter() is called with specific
skip*() results. This helps avoid code complications related to checking
how many delimiters are allowed depending on the relaxed_header_parser
setting...

> * the "in method" -> "after method" change is incorrect IMO:
>  - the HTTP syntax has SP-delimiter as the end of method.

HTTP defines method syntax as

     method         = token

Everything else, including what may or may not come AFTER the method
does not define the method.

Note that this is just a wording correction to minimize confusion if the
test case fails (after detecting a perfectly fine method token).

These test cases have not changed except I had to replace LF (0x0A) with
0x16 so that the the test case tests what it claims to test and not a
truncated request.

We already have another test case that tests request truncation, but I
can also restore the 0x0A case in that truncation group if you think
that truncation test is useful.

> * now that you have made both relaxed and strict parsers produce
> identical output for the "GET  HTTP/1.1\r\n" test case.
>  - please add a second test for the remaining case of "GET
> HTTP/9.9\r\n" 

OK.

>  - note the parser violates RFC 1945 by accepting two sequential SP
> characters as method delimiter. 

Only when allowed to do so. In strict mode, only one SP is allowed AFAICT.

> I'm happy to see this new code passes all the units with so few changes
> necessary to them. Though that does suggest we need a looped test to
> validate all CTL characters individually instead of just a single
> "GET\x16" method test.

Yes, the existing test cases are not concerned with URIs that were
broken in trunk (naturally!). The loop to test prohibited characters can
be added later, after we settle on the final relaxed character set.

Again, the specific change requests not discussed above will be
addressed with the next patch version.

Thank you,

Alex.