[squid-dev] [PATCH] Simpler and more robust request line parser

Thu Jul 23 14:16:20 UTC 2015

Please note: most of my comments in this long email are informational.
Things which need to be taken into consideration when touching the parser.

Audit proper follows at the end.

I request performance testing of the final version of this change vs
lastest trunk at the time.

If possible an emulated Slow-Loris DoS attack of up to 1MB of random
printable ASCII and whitespace characters drip-fed in 1-5 bytes each 15
sec (followed by 1 LF) would be nice too. Avg socket close rate from
Squid end is what matters. Though thats a wishlist since I've not done
such on Squid since ~2010.

On 23/07/2015 4:05 p.m., Alex Rousskov wrote:
> Hello,
> 
>     The attached patch implements a simpler and more robust HTTP request

It is less robust as many potential attacks are now accepted by the
Parser as valid HTTP messages. These are implied in RFC 7230 section
9.2. Though only the common attacks are called out since there are so
many posibilities.

Squid may "work" as a consequence only because this code was the primary
security protection layer. Internal parsers later on (if any) are
sometimes less robust and picky.

If this Parser is made extremely generic the latter parsing logic is
going to have to change to find these same problems, which will only
make Squid reject the same non-compliant traffic - after much wasted
processing of the message.

In these threads we have had around parsing I am getting the impression
you think of the HTTPbis WG documents as restrictive arbitrary committee
designs.

In the contrary they are explicity tested documents based on what the
real-world traffic has been actively observed doing over the past 10
years. HTTPbis has been luckier than most WG in the regard that more
actual implementers are participating than academics.

 Where the real-world traffic was desirable the behaviour was extended
to be allowed. Intro things like CR, VT, FF as 'tollerant' request-line
delimiters, use of '#' for #fragment in URLs, explicit minimum URL
length of 4KB, request method length limit, CR*CR tolerance, etc.

 Where the real-world traffic was tracked down to malicious behaviour or
causing indirect vulnerability exposure it was explicitly forbidden.
Intro things like removal of URI user-info segment, explicit
single-digit HTTP version, forbidding of whitespace after "HTTP/1.1"
token, single SP delimiter in 'strict' parse, etc.

With that in mind while reading your next point ...

> line parser. After getting a series of complaints and seeing admins
> adding more and more exceptional URL characters to the existing trunk

... I hope you can see the real problem is the admins approach to that
traffic.

They are not just allowing it through Squid (with any interpretation
problems we have). They are also dangerously exposing all those
non-Squid implementations with known vulnerabilities to how this traffic
gets *output* after the syntax interpretation by Squid.

I have caved with r14157 on the URI-internal CharacterSet we accept,
since I don't have such high confidence in the RFC3986 changes matching
real-world as much as HTTPbis documents. That may prove to have been a
mistake, since the key HTTPbis people had a lot of input to that RFC
publication too.

AFAIK, the only non-permitted URI characters in current trunk are the
*binary* CTL characters. Squid does not even check for correct
positioning of delimiters in tollerant parse at this stage of the
processing.

Which makes me wonder exactly what these "more and more" extra
characters the admin are adding are? I hope you are referring to the
r14157 set, as opposed to people blindly trying to let those
request-line injection attacks Section 9.2 talks about work through
their proxies.

> parser, I took the plunge and rewrote the existing code using the
> approach recently discussed on this list.

This Http1::RequestParser has one purpose, and one only. To identify a
standards compliant (RFC 1945 or 7230) HTTP/1 request-line.

All other inputs are undefined behaviour intended to be handled by other
Parsers. Some in future, some already existing (ie FTP).

I have no problem with adding a Parser for handling non-HTTP garbage.
But pretending this traffic is HTTP is plain wrong. It makes more sense
to just TUNNEL the traffic instead of trying to apply a Parser to it.
Anyhow, we have been over that already.

> 
> The primary changes are: Removed incremental parsing and revised parsing
> sequence to accept virtually any URI (by default and also configurable
> as before).

Firstly, I note that you still have incremental parsing in the code
itself. Just not for fields internal to the request-line.

The result of this change is that Squid will become vulnerable to even
the most trivial of "Slow Loris" attacks.

The very old Squid used to pass the request-line buffer through this
first stage parser, then apply a second stage parser for sanity checks,
and a third-stage parser for characterset checking. One of those would
(relatively) quickly reject a whole attack message based on request-line
contents alone.

That was all merged down into the incremental Parser in the form of the
syntax and character set validation you complain about. Of course Squid
without input validation will work faster on most/compliant traffic than
one that spends time validating inputs.

Squid which re-parse the message from scratch on every byte received are
vulnerable to Slow Loris even in common compliant-HTTP message parsing.
This was a minor problem for large 64K-ish URLs, but not significant
enough to fix and issue advisories etc. Now that SBuf raises Squid input
limit to 2MB just for URL we would have to do something about it. I
prefer incremental parse which fixes both that and the CPU load issues.

> 
> Known side effects:
> 
> * Drastically simpler code.

.. and much more vulnerable Squid. Whole categories of attack against
both Squid and its backends are now tolerated when they were rejected or
greatly restricted earlier.

+ side channel attack on backends or cache in method name, raised from
16-byte to arbitrary length of attack code.

+ cache poisoning via request pipeline. Great!

+ XSS attack in both method name and URI. (Also, adding shell code
injection as previously just SQL injection.)

+ Shell-shock vulnerability exposure to previously protected backends.

+ increased SQL-injection exposure to backends via URL.

+ increased JSON-injection exposure to backends.

(though r14157 does these last bunch too, it was restricted not to occur
in POSTs or arbitrary methods)

> * Some unit test case adjustments.
> * The new parser no longer treats some request lines ending with
>   "HTTP/1.1" as HTTP/0.9 requests for URIs that end with "HTTP/1.1".

Following 'random garbage' with "HTTP/1.1" does not make it HTTP compliant.

This adds direct and explicit violations of RFC 1945, 3896 and 7230.

> * The new parser no longer re-allocates character sets while parsing
>   each request.

Okay. I like that bit a lot. :-)

> 
> Also removed hard-coded 16-character method length limit because I did
> not know why that limit was needed _and_ there was a TODO to make it
> configurable. Please let me know if there was a good reason for that
> limit, and I will restore it.

It is there to restrict damage from active malware in HTTP/1. Namely
Slow Loris, random method cache busting and use of method names as an
open botnet C&C channel.

For backward compatibility it is an implementation-decided limit, but
some small-ish size does need to be imposed. The smaller it can be made
the better protected Squid is. So configurable makes sense.

FWIW: All HTTP and extension methods are now registered. At the time I
wrote the parse none exceeded 10 characters and it was looking like none
would be long. I see someone has found an RFC that now pushes it up to 17.

> 
> No changes to parsing HTTP header fields (a.k.a. the MIME block) were
> intended.
> 
> This patch requires "suffix parsing and skipping" patch from my previous
> email to the list. Like that patch, this one is based on trunk r14083. I
> will update to the current trunk if the patches are approved. Please let
> me know if you want to review the updated versions as well.

It needs to be compared against r14157 or later. Particularly since one
of your claims is that it works better for the traffic using the extra
characters where support was only added in r14157 ;-)

FWIW: all my statements regarding "current trunk" are referring to trunk
r14172.

That does not affect the code audit details. Just the applicability of
some of your statements and claims vs my counter points. I think I've
use "current trunk" when r14157 was relevant but may have missed some.

> 
> The patch changes are further detailed below.
> 
> 
> * Removal of incremental request line parsing.
> 
> Squid parsed the request line incrementally. That optimization was
> unnecessary:
>   - most request lines are short enough to fit into one network I/O,
>   - the long lines contain only a single long field (the URI), and
>   - the user code must not use incomplete parsing results anyway.
> 
> Incremental parsing made code much more complex and possibly slower than
> necessary.
> 
> The only place where incremental parsing of request lines potentially
> makes sense is the URI field itself, and only if we want to accept URIs
> exceeding request buffer capacity. Neither the old code, nor the
> simplified one do that right now.

With this patch as-is method length will begin to impact on that
criteria as well.

Non-incremental will also re-introduce admin complaints about high CPU
consumption on traffic where they receive long URLs over a number of
very small packets. ie the traffic which is normal, valid but looks like
Slow Loris (or is actually).

Note that real-world URLs have a *minimum* accepted limit of 4KB now,
and sightings have been measured with URL as high as 200KB in length
from some domains. Most origin servers wont accept >32KB though.

Note also, with SBuf processing the natural limit of Squid buffering is
raised from 64KB to 2MB. We have known Gigabit installations where the
config value is significantly higher than 64KB (old Squid force it
down). With current trunk those installations would have been using
their actually-configured buffer sizes.

Also; a major bug:

Consider Squid parsing the pipeline of requests which are delivered in a
single packet, and thus all exists in buffer simultaneously:
"
GET http://example.com/ HTTP/1.1\n
\n
GET http://attacker.invalid/ HTTP/1.1\n
X-Custom: HTTP/1.1\n
"

All Squid default installations use relaxed parse with violations
enabled. The new tok.prefix(uriFound, UriCharacters()) will treat this
all as a single request-line:
  "GET http://example.com/...attacker.invalid/ HTTP/1.1" label.

This has four nasty side effects:

1) request-injection vulnerability. Where the server named in the first
request now gets to reply with whatever it likes to the second request
and Squid will treat it as the response to whatever third request
arrived in the later packet(s) on the connection.

2) information leak vulnerability #1. Where all the client headers,
credentials, Cookies, or otherwise which would have been sent on the
first requests mime headers is passed in a "URL" which may be logged,
published or whatever (be it Squids own log, or the example.com origin).

3) information leak vulnerability #2. Where all the client headers,
credentials Cookies, or otherwise which would have been sent on the
second requests mime header is passed to the server handling the first
request.

4) information leak vulnerability #3. The combined size of the two
requests may trigger Squids URL-too-big error (or if we are lucky
invalid-request). The error page contains a copy of the "URL". Resulting
in the mime headers for the first request being presented to the
components inside browser sandboxing. Which otherwise may not have been
able to access them.

 - we already have recent complaints about this (4) being used by
advertising malware to scrape user credentials and session cookies.

> 
> * Accept virtually any URI (when allowed).
> 
> 1. USE_HTTP_VIOLATIONS now allow any URI characters except spaces.

Except? Ironic. SP (0x20) in URL is the most common malformed URI
character. When violations are disabled or strict parser tested broken
services using SP in URLs un-encoded are always the first to be identified.

The only 0x00-0xFF character not actively seen in "URLs" in real-world
traffic is LF. And I'm not entirely sure about that either, it may have
been an artifact of LF use in the measurement tools HTTPbis used to
collect the dataset.
 In any case we can't accept bare LF within URL for practical issues
around finding the end of line.

NP: yes it may sound like I'm agreeing with you on accepting any
characters. However I distinguish between characters which are seen
mainly in real-world *attack* traffic versus characters seen mainly in
real-world *user* traffic. That matters a lot, and admin complain about
blocking either in my experience.

> 2. relaxed_header_parser allows spaces in URIs.
> 

By "spaces" you mean the extended RFC 7230 compliant whitespace. Good.

> The combination of #1 and #2 allows virtually any URI that Squid can
> isolate by stripping method (prefix) and HTTP/version (suffix) off the
> request line. This approach allows Squid to forward slightly malformed
> (in numerous ways) URIs instead of misplacing on the Squid admin the
> burden of explaining why something does not work going through Squid but
> works fine when going directly or through another popular proxy (or
> through an older version of Squid!).

The current trunk accepts more real-world traffic malformations than
older Squid did:
 * CR and NUL characters are accepted within URLs.
 * CR, VT, FF are accepted as delimiters between request-line fields.
 * whole empty lines are accepted prior to any request.
 * series of multiple CR (0-N) are accepted in line terminator
 * '#' is now accepted in URL

... and that is with --disable-http-violations.

After you convinced me to open URI field to RFC 3986 violations (trunk
r14157) the only new restrictions in current trunk AFAIK are method
length and binary CTL codes.

> 
> URIs in what Squid considers an HTTP/0.9 request obey the same rules.
> Whether the rules should differ for HTTP/0 is debatable, but the current
> implementation is the simplest possible one, and the code makes it easy
> to add complex rules.

Which means we can no longer claim RFC 1945 compliance. The old RFC is
remarkably restrictive. MUST level requirements on individual characters
in the URL (via RFC 1738), their positioning, and method / prefix "GET "
(with exactly one explicitly SP character delimiter required).

NP: RFC 7230 formally dropped the requirement to also support RFC 1945
HTTP/0.9 simple-request syntax. It has not been seen in any of the
traffic measured by HTTPbis WG in the past 5 years. So is now optional
for on-wire traffic parsing.

I left RFC 1945 compliance to reduce the unnecessary restrictions on
what Squid accepts. Removing it alone will allow us to dramatically
simplify the logics in the existing parser (~15% ?) without any of the
other changes you are trying to get through in this patch.

> 
> * Code simplification.
> 
> RequestParser::parseRequestFirstLine() is now a simple sequence of
> sequential if statements. There is no longer a path dedicated for the
> [primarily academic?] strict parser. The decisions about parsing
> individual fields and delimiters are mostly isolated to the
> corresponding methods.

Strict parsing is popular in some circles outside academia. Namely
anyone interested in reporting and fixing the issues promoted under the
decription "watermelon metrics" a few years back.

CDN and Cloud providers with Squid frontends also have an interest.
Since their RESTful services are explicitly bounded in regards to URI
needs and earlier input validation means less load on backend service.
They are also in a prime position to fix internal logic resulting in
malformed URLs.

The strict parse not having to bother with 2-pass reverse parsing offers
faster REST transaction latency.

As I mentioned when the topic of this re-write came up earlier, I
request performance testing for this change. We lost a lot of
performance optimizations in the last re-write and any further polishing
needs to counter that one way or another.

> 
> * Unit test cases adjustments.
> 
> Removal of incremental request line parsing means that we should not
> check parsed fields when parsing fails or has not completed yet.
> 
> Some test cases made arguably weird decisions apparently to accommodate
> the old parser. The expectations of those test cases are [more] natural now.

Which ones?

The units were *all* testing actual real-world input cases or an
equivalent of a real-world cases. Most of them are actually testing
groups of stuations as spot-checks. I only marked those which seemed
unusually weird to me as "real world" to prevent others also feeling
like removing them.

Your definition of weird may vary and the set may be incomplete
regarding true weirdness. Case in point; the way Bing sends raw JSON
code in URLs is not tested for, and the URI variant of shell-shock
malware PoC is not tested for.

> 
> I did not attempt to fix rampant test code duplication, but I did add
> optional (and disabled by default) debugging, to help pin-point failures
> to test sub-cases that CPPUNIT cannot see.
> 
> Changing request methods to "none" in test sub-cases with invalid input
> was not technically necessary because the new code ignores the method
> when parsing fails, but it may help whoever would decide to reduce test
> code duplication (by replacing hand-written expected outcomes for failed
> test cases with a constant assignment or function call).
> 

Now the audit...

* I believe we can retain incremental parsing despite your changes here.
As noted above its use was a fairly major protection against some very
trivial attack vectors.
 - I agree the current incremental implementation is not optimal, but
rather fix than removing completely IMO.

* please use doxygen '///' comments to document.

* the new CharacterSet methods seem like they should be moved to the
parent Http::Parser since they will also be useful for generic URI
parsing within both Request and Reply mime headers.
 - although note that I was hoping to followup the bug 1961 changes with
a class URI parser that gets called by RequestParser to handle the URI
field properly instead of the multi-pass parse we have for URI now. So
they may not be existing long-term anyway.

* please also note that current trunk does not use ::Parser::Tokenizer
since r14113.

in Http::One::RequestParser::parseMethodField():

* assert() within the Parser is not acceptible.
 - If there is any possibility of that case happening it it a DoS
vulnerability in Squid - which MUST be handled cleanly with a parser error.

* maxMethodLength is a Security Considerations requirement of RFC 7230.
 - Part of the protections against injection attacks implied by section
9.2 (the category of method field injection vulnerabilities).
 - Also more specifically a Slow Loris attack protection.
 - we/admin get to decde what it is, but a limit should be kept.
Probably double it to 32 is better than 16 now?

in Http::One::RequestParser::parseUriField():

* changes to maxUriLength allow SquidString assertions to be
encountered. Sorry if the comment was not clear enough. The entire
request-line (method, URI, version) have to be stored in 64KB String
object later for error page display, storage and/or ICP/HTCP query use, etc.
 - Also, now that method has no length limit the calculation for methods
of 64KB length may produce a negative value for URLs available/accepted
size. This is another reason to restrict methods.
 - I know CoAdvisor does not like it. But the RFC it is testing for does
not actually say we have to support 64KB. Just a minimum of 4KB.
Apparently real-world server limits is 32KB for software other than
Squid anyway. So 64K - 36 bytes URI length should not be an issue.

in Http::One::RequestParser::UriCharacters():

* strict parse should always output StrictChars regardless of violations
build.
 - consider that config directive being OFF to mean the admin wants
strict compliant parsing. Not partially-relaxed parsing.
 - violations are for affecting relaxed parse mode, since its just
extending the relaxation beyond what the standards permit.

* RelaxedCharsPlusWhite includes LF character.
 - this will result in some traffic parsing the entire Mime header
portion of messages as part of the URL. see above for details.

in Http::One::RequestParser::parseHttpVersionField():

* "Rewrite if we ever _need_ to return 505"
 - this has already come to pass. HTTP/2 prefixes are already used on
port 443 and port 80 traffic by all software listed as supporting
"direct" HTTP/2 in
<https://github.com/http2/http2-spec/wiki/Implementations>. Probably
others as well since that set includes frameworks like Perl
Protocol::HTTP2. Not just specific applications.
- the restriction of HTTP-version from N.N numbers to DIGIT.DIGIT
characters was another explicitly tested HTTPbis WG change. Accepting

in Http::One::RequestParser::skipDelimiter():

* please note how HTTP Parser use HttpTokenizer for specific skip*()
things like what this function does.

in Http::One::RequestParser::parseRequestFirstLine():

* the search for LF is done before method validation.
 - This has a major impact on real-world traffic handling;
  + Squid hangs when *any* protocol lacking a LF character in its client
request handshake uses port 80 or 443.
  + oh yeah, and Christos "unknown protocol" handling feature of Squid-4
is now useless for those.
 - The existing code was VERY carefully crafted to validate the prefix
method+delim before trying to find the LF and reverse-parse the
remaining line.
 - validating method first is also significantly faster for rejecting
binary protocols.

in src/tests/testHttp1Parser.cc file:

* please do not use #ifdef.
 - use #if defined() for things we do not control if mere existence is
sufficient.
  - use plain #if for things where 0 vs 1 value matters.

* please use std::endl instead of "\n" debug line terminators
 - it ensures the proper portable newline syntax is used for the OS
being tested on.

* in testResults() one FYI:
 - CPPUINT macros catch assert and exceptions for code run inside them.
This used to be useful for catching unexpected throws and asserts during
CPPUNIT_ASSERT_EQUAL(expect.parsed, output.parse(input))

* the "in method" -> "after method" change is incorrect IMO:
 - the HTTP syntax has SP-delimiter as the end of method. An SP
delimiter is present in the input test. So the \0 byte and \x16 ae
*within* the syntax pre-whitespace delimited area. Bytes in that area
are *all* called "method" in HTTP.

 In the real-world traffic the \0 case was derived from the sender was
intending to send 3 bytes of method but incorrectly sent 4 and included
the method c-string terminator *inside* its emitted method.
 The real-world traffic the \x16 case is derived from includes both a
binary framing protocol with valid chars mistaken for method, and a DoS
with sender just delivering random characters. Both embeded binary in
random prefix/suffix positions for the method string. Again the binary
is inside the method field when interpreted as HTTP.

 - these tests are specifically doing triple-duty testing the case where
binary prefixed by "GET" may be confused with GET method, the case of
binary in methods, and the case where binary is treated as whitespace.
The \x16 is the generic test, the \0 a special-case for injection attack
detection and NUL handling correctness.

 - in summary; if you feel the test case needs another alphanumeric
letter after the binary ccode to suit your ideas of "in method" you can
add one. Otherwise this is needless change.

* now that you have made both relaxed and strict parsers produce
identical output for the "GET  HTTP/1.1\r\n" test case.
 - please add a second test for the remaining case of "GET
HTTP/9.9\r\n" which should now be what produces a HTTP/0.9 request for
URI "HTTP/9.9" (relaxed) or an invalid-request error (strict).
 - note the parser violates RFC 1945 by accepting two sequential SP
characters as method delimiter. The standard defines exactly one. With
the second being part of the URI (yuck). I wont ask for fixing that.

* the too-long method case will need re-instating in appropriate form
with the method length limit.

I'm happy to see this new code passes all the units with so few changes
necessary to them. Though that does suggest we need a looped test to
validate all CTL characters individually instead of just a single
"GET\x16" method test.

I think that is all.

Amos