[squid-dev] Incremental parsing of chunked quoted extensions

Fri Oct 5 15:34:52 UTC 2018

On 10/04/2018 12:30 PM, Eduard Bagdasaryan wrote:
> Hello all,
> 
> There is a bug in Squid during incremental parsing of quoted chunked
> extensions, resulting in unexpected throwing in
> One::Parser::skipLineTerminator(). The underlying problem comes from
> the fact that Http::One::Tokenizer::qdText() cannot parse
> incrementally because it cannot distinguish "an invalid input" from
> "need more data" case.
> 
> I see two approaches for fixing this problem:
> 
> * Low-level.
>   Supply Parser::Tokenizer and its children with incremental parsing
>   functionality. It looks like this would require much non-trivial work.
> 
> * High-level.
>   Avoid incremental parsing of chunked extensions and parse them all
>   at once. 

> Is the incremental parsing of chunked extensions all-important here?

IMO, it is not important: The vast majority of chunks have no extensions
at all, and the vast majority of chunks with extensions have small
extensions. Thus, the performance improvement from incremental parsing
would be an extremely rare event. Such improvements are never "all
important" (and are often not worth their complexity).

Parsing of chunked extensions does not need to be incremental. However,
parsing of chunked encoding as a whole has to be incremental --
reparsing data chunks _is_ expensive, and we should not wait for the end
of the message body (or even for the end of the chunk) even if reparsing
were cheap. Thus, TeChunkedParser has to be incremental.

Whether the incremental pieces of TeChunkedParser should be used when
parsing chunk extensions is a separate question. I will partially answer
it further below.

>   In other words, we need to adjust the Tokenizer caller
>   to buffer the whole chunk extension line before passing it to the
>   Tokenizer in order to avoid "need more data" case, causing that bug.

I doubt that writing code to explicitly buffer the whole line before
parsing extensions is necessary. It is definitely not desirable because
it slows things down in _common_ cases. I would adjust the code to
follow this sketch instead:

    while (!atEof()) {
        if (skipLineTerminator())
            return true; // reached the end of extensions (if any)
        if (parseChunkExtension())
            continue; // got one extension; there may be more
        if (tok.skipAll(NotLF) && skip(LF))
            throw "cannot parse chunk extension"; // <garbage> CR*LF
        return false; // need more data to finish parsing extension
    }
    return false; // need data to start parsing extension or CRLF

where skipLineTerminator() is, essentially,

    return tok.skip(CRLF)) || (relaxed_parser && tok.skipOne(LF));

As you can see, the above sketch is optimized for the common cases and
blindly seeks to the end of the line only in rare cases of partially
received or malformed extensions.

Please note that whether the above sketch is used incrementally is an
_orthogonal_ question:

* In incremental parsing, a successful parseChunkExtension() would
update object state to remember the parsed extension and advance the
parsing position to never come back to the same spot. The loop caller
would advance the parsing position/state if (and only if) the loop
returns true.

* In non-incremental parsing, a successful parseChunkExtension() would
update its parameter (not shown) to remember the parsed extension. The
loop caller would update object state ("committing" that not shown
parameter info) and advance the parsing position if (and only if) the
loop returns true. [ Alternatively, a successful parseChunkExtension()
would update object state directly, sometimes overwriting the previous
(identical) value many times. The current useOriginBody code uses that
simpler scheme. However, even with this simpler alternative, the
advancement is controlled by the loop caller. ]

Should we fix and continue to use incremental parsing here or abandon
it? The answer probably depends on overall code simplicity. Since we
have to use incremental parsing for parsing chunked encoding as a whole
(as discussed in the beginning of this email), we may continue to use it
for parsing extensions if (and only if) doing so does not complicate things.

HTH,

Alex.