[squid-dev] [PATCH] HTTP Parser upgrade

Wed Oct 15 09:34:25 UTC 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 15/10/2014 10:19 a.m., Alex Rousskov wrote:
> On 10/11/2014 02:17 PM, Amos Jeffries wrote:
>> On 7/06/2014 4:55 a.m., Amos Jeffries wrote:
>>> Latest version of the Parser-NG project branch. Since last 
>>> message I have dropped the copy-constructor related pieces.
>> 
>>> Polygraph testing on the branch showed almost no performance 
>>> impact (+0.04ms/req, -0.03% request rate). Nevertheless a half
>>>  dozen points of added memory reallocate or data copy are
>>> marked for future removal as the Parser-NG and SBuf
>>> improvements continue.
> 
> Just to double check: Is that really 0.04ms or 40ms? The former is 
> beyond Polygraph precision (i.e. it is the same as "zero"). The
> latter is a significant overhead. I hope it is the former!

Yes the former (insignificant).

ParserNG (Oct 11th):
Mean response number:  1925.64
Mean response time:  103.36

trunk (Oct 11th):
Mean response number:  1926.03
Mean response time:  103.34

trunk (Oct )
Mean response number:  1926.19
Mean response time:  103.33

> 
>> + * \deprecated use SBuf constructor instead */ 
>> -HttpRequestMethod::HttpRequestMethod(char const *begin, char
>> const *end) : theMethod(Http::METHOD_NONE) 
>> +HttpRequestMethod::HttpRequestMethod(char const *begin) : 
>> theMethod(Http::METHOD_NONE)
> 
> Please rework this constructor into a HttpRequestMethodXXX()
> function call. Otherwise, it is very difficult to spot places where
> it is still being used, including in future patches. Since you are
> also changing the constructor profile, it should be possible to do
> that replacement directly in your patch, avoiding re-editing many
> files by hand.
> 
> This is especially important if this old HttpRequestMethod(char*)
> has a somewhat different semantics/compliance than the new 
> HttpRequestMethod(SBuf).
> 

Done. This is not quite as simple as suggested, the renaming causes it
to become a method instead of constructor, so code needs adjusting to
allocate HttpRequestMethod objects separately before calling it.

> 
>> -    method = HttpRequestMethod(start, t); +    SBuf m(start, 
>> start-t); // XXX: SBuf ctor allocates and data-copies.
>> performance regression. +    method = HttpRequestMethod(m);
> 
> 
> What is the purpose of introducing this SBuf performance
> regression?
> 

To avoid repeated strcspn(X, w_space) over the input buffer.

> * If the deprecated HttpRequestMethod constructor works fine, use
> that and avoid this regression while waiting for callers to pass in
> an SBuf.
> 
> * If the deprecated HttpRequestMethod constructor is not as good
> as the new HttpRequestMethod constructor for some reason (other
> than SBuf parameter itself), then should we re-implement the
> deprecated HttpRequestMethod constructor to create an SBuf and then
> call the new HttpRequestMethod constructor with that SBuf instead?
> 

It works fine. Done.

> 
>> + * or from a range of chars such as "GET" from "GETFOOBARBAZ" +
>> * + * Assumes the s parameter contains only the method string
> 
> This description is not clear, especially because GETFOOBARBAZ
> does not "contain only the method string" GET. The
> HttpRequestMethod() code does not seem to match the documented
> "GETFOOBARBAZ" example because it uses caseCmp(s) without any
> prefix size limits.
> 

s is expected to contain an SBuf whose contents is *only* the chars
forming the method (buf="GET", length=3) not the "FOOBARBAZ" chars
which follow in the I/O buffer.

This is assumed/needs to be true whether or not the chars which
preceed or follow are the expected whitespace delimiters or some
"FOOBARBAZ" fluff.

How else would you write that?

> 
>> +    // XXX: still check for missing method name?
> 
> Please rephrase. It is not clear what you are trying to ask here. 
> FWIW, the new code does handle the case of an empty method name as 
> well as a method name not matching any known header.
> 

That was it. Dropped.

> 
>> const char *t = start + strcspn(start, w_space);
> ...
>> -    start = t + strspn(t, w_space); +    start = t + strspn(t, 
>> w_space); // XXX: breaks if whitespace exists in URL
> 
> Wrong XXX? The code appears to skip the space found at t
> declaration time. That is, after the declaration, "t" should point
> to either a space character or, if there was not one in a URL, to a
> terminating zero character. The XXX seems to be wrong in either
> case.

Er yes. mistakes strspn() for strcspn(). corrected.

> 
>> +    SBuf url = hp->requestUri(); // use full provided URI if we 
>> abort +    do { // use a loop so we can break out of it + 
>> ::Parser::Tokenizer tok(url); +        if (tok.remaining()[0] == 
>> '/') +            break;
> 
> Are we guaranteed to have a non-empty URL at this point? If not, it
> is unsafe to dereference its first character.

Right. Using skip('/') instead with current Tokenizer guarantees.

> 
>> +    // reject URI which are not well-formed even after the 
>> processing above +    if (url[0] != '/') {
> 
> Similar concern here.
> 

Thats an old bug in dead code, but adding isEmpty() check just for
completeness when someone gets around to makeing it non-dead.

> 
>> -    char ipbuf[MAX_IPSTRLEN]; +    static char 
>> ipbuf[MAX_IPSTRLEN];
> 
> If you are changing this, please move the declaration down, where
> it is actually needed.
> 

Done.

> 
>> +        static const CharacterSet authority = 
>> CharacterSet("authority","-._~%:@[]!$&'()*+,;=") + + 
>> CharacterSet::HEXDIG + CharacterSet::ALPHA +
>> CharacterSet::DIGIT;
> 
> Syntax error (missing "+" before CharacterSet::HEXDIG)? If yes,
> does that imply that the above code has not been tested?

It is on the end of the preceeding line.

> 
>> +    const char *url = SBuf(hp->requestUri()).c_str();
> 
> Semantically, this returns a pointer inside the now-gone object
> (the temporary anonymous SBuf). Do not do that even if it usually
> "works" with the current SBuf implementation.

Okay, using a pair of local-scope variables instead.

> 
>> -    const bool isFtp = !hp; +    const bool isFtp = (hp ==
>> NULL);
> 
> Why this change? The existing code was correct and arguably more 
> "future proof" if we start using some kind of smart pointers for
> "hp".
> 

Just consistency between X==NULL and X!=NULL checks.

hp is already a smart pointer in the branch and being non-C++11 it
does not provide operator true(), just operator !(). So we end up with
a mix of if(!X) and if(X != NULL).

Reverting since it was the other way to start with.

> 
>> +    const HttpRequestMethod &method = !isFtp ? hp->method() : 
>> Http::METHOD_NONE; // XXX: or should this be GET ?
> 
> The answer is in the caller you modified, I think:
> 
>> -        clientProcessRequest(this, NULL /*parser*/, 
>> context.getRaw(), -                             request->method, 
>> request->http_ver); +        clientProcessRequest(this, 
>> Http1::RequestParserPointer(), context.getRaw());
> 
> 
> So it should be something like http->request->method or
> equivalent. For FTP, we already have a parsed request at this
> point... However, please double check that
> 
> a) you are actually using that new "method" variable -- it is 
> difficult to say by looking from the patch alone, but I see many
> cases where the old "method" is getting replaced with
> "hp->method()". Perhaps all of the uses are gone now?
> 
> b) the new "method" declaration is as close to the uses in (a) as 
> possible.
> 

Confirmed. It looks like the HTTP-only paths should all be using
hp->method(), the shared and FTP-only paths using request->method.

Removing method local.

> 
>> Http::ProtocolVersion http_ver; -        if (ClientSocketContext 
>> *context = parseOneRequest(http_ver)) { +        if 
>> (ClientSocketContext *context = parseOneRequest()) {
> 
> Do you still need the http_ver variable after removing the 
> corresponding parameter?
> 

No. Dropped.

> 
> * clientProcessRequest() no longer consumes parsed input, right?
> Where does that consumption happen now?
> 

Inside parseOneRequest().

 * HttpServer syncs the I/O buffer to the Tokenizer in
parseHttpRequest() after each parse attempt.

 * FtpServer uses byte counting by the Tokenizer to manage them
separately with consumeInput().

> 
>> +     * BUG: limits output to just 1KB when Squid accepts up to 
>> 64KB line length. * -     * \retval -1  an error occurred. 
>> request_parse_status indicates HTTP status result. -     *
>> \retval 1  successful parse. member fields contain the
>> request-line items - * \retval  0  more data is needed to
>> complete the parse +     * \return A pointer to a field-value of
>> the first matching field-name, or NULL.
> 
> In what sense does the Parser limit the output? It is documented
> to return a pointer to the field-value. Are we truncating that
> field-value?
> 

Copying it into a static array of size 1KB.

...
// arbitrary maximum-length for headers which can be found by
Http1Parser::getHeaderField()
#define GET_HDR_SZ      1024

...
    LOCAL_ARRAY(char, header, GET_HDR_SZ);

For this patching I have left that old limit in place. It appears to
be only used for a few things like Host: header value fetching in the
prepare*URL() functions, so not a problem for most users.

> 
> * Http1::Parser class has virtual methods but is missing a virtual 
> destructor.
> 

Gah. C++11'ism. Default d'tor can be assumed virtual and the children
provide the virtuals.

> 
>> +    /// raw copy of the origina client reqeust-line URI field
> 
> "origina" typo
> 

Fixed.

> 
>> +    Parser() : parsingStage_(HTTP_PARSE_NONE) {}
> 
>> +    RequestParser() : Parser() {clear();} +    virtual void 
>> clear();
> 
> 
>> +Http::One::Parser::clear() { -    clear(); // empty the state.
>> - state = HTTP_PARSE_NEW; -    buf = aBuf; -    bufsiz = len; - 
>> debugs(74, 5, HERE << "Request buffer is " << buf); + 
>> parsingStage_ = HTTP_PARSE_NONE; +    buf_ = NULL; + msgProtocol_
>> = AnyP::ProtocolVersion(); + mimeHeaderBlock_.clear(); }
> 
> 
>> +void +Http::One::RequestParser::clear() +{ + 
>> Http1::Parser::clear(); + +    request_parse_status = 
>> Http::scNone; +    req.start = req.end = -1; +    req.m_start = 
>> req.m_end = -1; +    req.u_start = req.u_end = -1; +
>> req.v_start = req.v_end = -1; +    method_ = HttpRequestMethod();
>> +}
> 
> 
> The above is OK. If you want to polish it further to avoid calling 
> virtual methods from constructors (which is bad style), make these 
> changes:
> 
> 1. Initialize all members in each constructor as needed, without
> calling clear().
> 
> 2. Make Parser::clear() pure, without a body.
> 
> 3. In RequestParser::clear(), just do:
> 
> *this = RequestParser();
> 

I'm never sure during this sequence if complex types like SBuf need an
explicit call to their constructor at step (1) to free old
smart-pointer values or not.

Trying it this way.

> 
> For the proposed commit message, please add an brief explanation
> for "why we are doing this", in addition to the already provided
> "what we are doing" summary.
> 
> And please make sure the commit message explicitly says that we
> are adding a few potential performance regressions with this
> change.

Is that a +1 considering these are mostly cosmetic?

Amos

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJUPj+hAAoJELJo5wb/XPRjG+8H/06I959LgxkJtMCq08npL/p6
+W8fsz7SWh5C6NwFzTCyonF6CCwQF7RBWxbXMaDq/RknMSyYGy7pwn+7JhNQz9TM
16I3HnP8cSheL0TDub8DlVHVIqiKW0gE12Z4F86j21YO/EqkGJuK7sZoXoePPu7w
2eMW4xg1NjsS0dbD1HJp1BRw3QUkZg30oenWhWRQbpWn3lTlEXPKOXZJxf/eILEN
q8yX9BYmTVM7Zt9ZNlU0lIfzhXTwZkaibZ4+MNpqfjyjLhVYZ77leY5Sn4xCVlr1
ERW+u4ap8HeB4oDMoP/GRchuF9GqTlSjx6hvxmWMtbxjaISUO0W7Twz5XVTfJLs=
=uro/
-----END PGP SIGNATURE-----