[squid-dev] [PATCH] HTTP Parser upgrade

Tue Oct 14 21:19:43 UTC 2014

On 10/11/2014 02:17 PM, Amos Jeffries wrote:
> On 7/06/2014 4:55 a.m., Amos Jeffries wrote:
>> Latest version of the Parser-NG project branch. Since last
>> message I have dropped the copy-constructor related pieces.
> 
>> Polygraph testing on the branch showed almost no performance 
>> impact (+0.04ms/req, -0.03% request rate). Nevertheless a half 
>> dozen points of added memory reallocate or data copy are marked
>> for future removal as the Parser-NG and SBuf improvements
>> continue.

Just to double check: Is that really 0.04ms or 40ms? The former is
beyond Polygraph precision (i.e. it is the same as "zero"). The latter
is a significant overhead. I hope it is the former!

> + * \deprecated use SBuf constructor instead */ 
> -HttpRequestMethod::HttpRequestMethod(char const *begin, char const
> *end) : theMethod(Http::METHOD_NONE) 
> +HttpRequestMethod::HttpRequestMethod(char const *begin) :
> theMethod(Http::METHOD_NONE)

Please rework this constructor into a HttpRequestMethodXXX() function
call. Otherwise, it is very difficult to spot places where it is still
being used, including in future patches. Since you are also changing
the constructor profile, it should be possible to do that replacement
directly in your patch, avoiding re-editing many files by hand.

This is especially important if this old HttpRequestMethod(char*) has
a somewhat different semantics/compliance than the new
HttpRequestMethod(SBuf).

> -    method = HttpRequestMethod(start, t); +    SBuf m(start,
> start-t); // XXX: SBuf ctor allocates and data-copies. performance
> regression. +    method = HttpRequestMethod(m);

What is the purpose of introducing this SBuf performance regression?

* If the deprecated HttpRequestMethod constructor works fine, use that
and avoid this regression while waiting for callers to pass in an SBuf.

* If the deprecated HttpRequestMethod constructor is not as good as
the new HttpRequestMethod constructor for some reason (other than SBuf
parameter itself), then should we re-implement the deprecated
HttpRequestMethod constructor to create an SBuf and then call the new
HttpRequestMethod constructor with that SBuf instead?

> + * or from a range of chars such as "GET" from "GETFOOBARBAZ" + * 
> + * Assumes the s parameter contains only the method string

This description is not clear, especially because GETFOOBARBAZ does
not "contain only the method string" GET. The HttpRequestMethod() code
does not seem to match the documented "GETFOOBARBAZ" example because
it uses caseCmp(s) without any prefix size limits.

> +    // XXX: still check for missing method name?

Please rephrase. It is not clear what you are trying to ask here.
FWIW, the new code does handle the case of an empty method name as
well as a method name not matching any known header.

> const char *t = start + strcspn(start, w_space);
...
> -    start = t + strspn(t, w_space); +    start = t + strspn(t,
> w_space); // XXX: breaks if whitespace exists in URL

Wrong XXX? The code appears to skip the space found at t declaration
time. That is, after the declaration, "t" should point to either a
space character or, if there was not one in a URL, to a terminating
zero character. The XXX seems to be wrong in either case.

> +    SBuf url = hp->requestUri(); // use full provided URI if we
> abort +    do { // use a loop so we can break out of it +
> ::Parser::Tokenizer tok(url); +        if (tok.remaining()[0] ==
> '/') +            break;

Are we guaranteed to have a non-empty URL at this point? If not, it is
unsafe to dereference its first character.

> +    // reject URI which are not well-formed even after the
> processing above +    if (url[0] != '/') {

Similar concern here.

> -    char ipbuf[MAX_IPSTRLEN]; +    static char
> ipbuf[MAX_IPSTRLEN];

If you are changing this, please move the declaration down, where it
is actually needed.

> +        static const CharacterSet authority =
> CharacterSet("authority","-._~%:@[]!$&'()*+,;=") + +
> CharacterSet::HEXDIG + CharacterSet::ALPHA + CharacterSet::DIGIT;

Syntax error (missing "+" before CharacterSet::HEXDIG)? If yes, does
that imply that the above code has not been tested?

> +    const char *url = SBuf(hp->requestUri()).c_str();

Semantically, this returns a pointer inside the now-gone object (the
temporary anonymous SBuf). Do not do that even if it usually "works"
with the current SBuf implementation.

> -    const bool isFtp = !hp; +    const bool isFtp = (hp == NULL);

Why this change? The existing code was correct and arguably more
"future proof" if we start using some kind of smart pointers for "hp".

> +    const HttpRequestMethod &method = !isFtp ? hp->method() :
> Http::METHOD_NONE; // XXX: or should this be GET ?

The answer is in the caller you modified, I think:

> -        clientProcessRequest(this, NULL /*parser*/,
> context.getRaw(), -                             request->method,
> request->http_ver); +        clientProcessRequest(this,
> Http1::RequestParserPointer(), context.getRaw());

So it should be something like http->request->method or equivalent.
For FTP, we already have a parsed request at this point... However,
please double check that

a) you are actually using that new "method" variable -- it is
difficult to say by looking from the patch alone, but I see many cases
where the old "method" is getting replaced with "hp->method()".
Perhaps all of the uses are gone now?

b) the new "method" declaration is as close to the uses in (a) as
possible.

> Http::ProtocolVersion http_ver; -        if (ClientSocketContext
> *context = parseOneRequest(http_ver)) { +        if
> (ClientSocketContext *context = parseOneRequest()) {

Do you still need the http_ver variable after removing the
corresponding parameter?

* clientProcessRequest() no longer consumes parsed input, right? Where
does that consumption happen now?

> +     * BUG: limits output to just 1KB when Squid accepts up to
> 64KB line length. * -     * \retval -1  an error occurred.
> request_parse_status indicates HTTP status result. -     * \retval
> 1  successful parse. member fields contain the request-line items -
> * \retval  0  more data is needed to complete the parse +     *
> \return A pointer to a field-value of the first matching
> field-name, or NULL.

In what sense does the Parser limit the output? It is documented to
return a pointer to the field-value. Are we truncating that field-value?

* Http1::Parser class has virtual methods but is missing a virtual
destructor.

> +    /// raw copy of the origina client reqeust-line URI field

"origina" typo

> +    Parser() : parsingStage_(HTTP_PARSE_NONE) {}

> +    RequestParser() : Parser() {clear();} +    virtual void
> clear();

> +Http::One::Parser::clear() { -    clear(); // empty the state. -
> state = HTTP_PARSE_NEW; -    buf = aBuf; -    bufsiz = len; -
> debugs(74, 5, HERE << "Request buffer is " << buf); +
> parsingStage_ = HTTP_PARSE_NONE; +    buf_ = NULL; +
> msgProtocol_ = AnyP::ProtocolVersion(); +
> mimeHeaderBlock_.clear(); }

> +void +Http::One::RequestParser::clear() +{ +
> Http1::Parser::clear(); + +    request_parse_status =
> Http::scNone; +    req.start = req.end = -1; +    req.m_start =
> req.m_end = -1; +    req.u_start = req.u_end = -1; +    req.v_start
> = req.v_end = -1; +    method_ = HttpRequestMethod(); +}

The above is OK. If you want to polish it further to avoid calling
virtual methods from constructors (which is bad style), make these
changes:

1. Initialize all members in each constructor as needed,
   without calling clear().

2. Make Parser::clear() pure, without a body.

3. In RequestParser::clear(), just do:

   *this = RequestParser();

For the proposed commit message, please add an brief explanation for
"why we are doing this", in addition to the already provided "what we
are doing" summary.

And please make sure the commit message explicitly says that we are
adding a few potential performance regressions with this change.

Thank you,

Alex.

>> To recap, this project so far:
> 
> 
>> Rename the HttpParser class as Http1::RequestParser and move it 
>> into the Http::One:: namespace. A base Http1::Parser class is
>> also added from which Http1::ReplyParser will also be created in
>> future work.
> 
>> The parser API - is updated to process both the request-line and 
>> HTTP mime headers, returning a incomplete parse result until the 
>> entire header portion of the message has been received. - now 
>> contains accessor methods for retrieving the method, URI,
>> protocol, mime headers block as separate SBuf and some metrics
>> about those. - the old request_offsets structure and similar
>> offset details internal to the parsing are no longer exposed.
> 
>> The parser is now incremental. A parser object must be created
>> (or clean() method called) for each new message. Code using the
>> parser should pass their I/O buffer to the parse(SBuf&) method
>> and retrieve an updated copy of it afterwards via the
>> remaining() method.
> 
>> Much of the code from client_side.cc parseHttpRequest() and also 
>> the header-field code from mime_headers.cc has been moved into
>> the parser classes. The client-side code now simply runs the
>> main parse() method then uses its accessors to retrieve and
>> process the parse results if it returns success.
> 
>> The getHeaderField() method imported from mime_headers.cc has
>> been converted to simpler logics based on Tokenizer class and
>> resolving a number of bugs in its matching output.
> 
>> A unit test for incremental parsing has been added to 
>> testHttp1Parser.
> 
>> The HttpRequestMethod class is moved into the Http namespace and 
>> library to reduce dependencies on the parser class outside the 
>> library.
> 
>> The HttpHeader API has been tweaked slighty to accept
>> start+length pair instead of start/end pointers.
> 
>> Amos
> 
> 
> _______________________________________________ squid-dev mailing
> list squid-dev at lists.squid-cache.org 
> http://lists.squid-cache.org/listinfo/squid-dev
>