[squid-dev] [PATCH] Parser-NG conversion of ICAP pt2

Wed Oct 21 22:25:24 UTC 2015

On 08/23/2015 02:40 PM, Amos Jeffries wrote:
> It turns out that ICAP implements has three distinct protocol parsers.
> 
> I begin the ICAP parser conversion to the Parser-NG model with
> ModXact::parseHeaders() - which was conflating both ICAP and HTTP, and
> the HTTP directional parsers.
> 
> 
> * splits the exiting parse method into 3 distinct stages; ICAP-reply,
> HTTP-request, HTTP-reply. Each stage is sequential and controlled by the
> Encapsulated header contents.
> 
> I'm not sure yet if we need to be tolerant of out-of-order segments in
> the payload. The spec is pretty clear that order is explicit and
> specific. 

AFAIK, we do not need to be tolerant to out-of-order segments -- there
are no known cases of those. We might want to be tolerant to unexpected
segments (bug 2480).

> But the old parser actually ignored the Encapsulated header
> byte offsets (!!).

The new parsers should continue to ignore them IMO. Those offsets are an
ICAP mis-feature -- a duplication of information that, as most
duplicates, leads to bugs. Ignoring them makes Squid ICAP code more
robust and probably decreases compatibility problems without [known] bad
side effects. We should parse the headers declared by Encapsulated.
Ignore numerical offset values.

> +     * ICAP header parse does not share the HTTP segment code, and
> +     * Encapsulated: tells us how many bytes and where each payload segment is.
> +     * We can pull N bytes into a child SBuf for parsing.
> +     *
> +     * 1) if there are not enough bytes we need more before even attempting the parse
> +     *
> +     * 2) after parse we can verify that it consumed all of the child buf.
> +     *    if there are leftovers ... smuggling attack from the ICAP server?
> +     */

This part of the TODO comment should be removed IMO. See my comments
above for the rationale.

> +    /* TODO: we do not need to rely on readBuf anymore for the parser logic.

The comment seems to be misplaced which makes it unclear. The
surrounding code does not rely on readBuf parser logic, whatever that
is. Please reword it in a context-specific way or move it to a more
appropriate place.

> +    /* Attempt to parse the ICAP message */

s/parse the ICAP message/isolate the ICAP response header/

(if that is what you are doing here; note that there is a "parse
headers" comment further below so you are not parsing headers here)

> +        bool parsedOk = hp->parse(readBuf);

If you can make it constant, make it const.

> +        // sync the buffers after parsing.
> +        readBuf = hp->remaining();

Too early? The "unrecoverable parsing error" code below should show what
we failed to parse, not what remained after we failed to parse.

> +    /* We know the whole response is in parser now */

s/response/response header/

> +    // XXX: performance regression. SBuf::c_str() reallocates
> +    SBuf tmpPhrase(hp->reasonPhrase());
> +    icapReply->sline.set(hp->messageProtocol(), hp->messageStatus(), tmpPhrase.c_str());

Do we need to introduce a regression here? How about using a few
hard-coded reasons for supported codes and a catch-all for all others?

Many of the above comments apply to the other parsing methods you have
created: parseHttpRequestHead and parseHttpRequestHead. Please check
those as well.

> +        // const_cast is okay, the buffer area behind the c_str will not be used again by this xaction
> +        // and that will only change when urlParse() starts taking the requestUri() SBuf directly

This violates, and I quote, "DO NOT EVER USE THE RETURNED POINTER FOR
WRITING" c_str() API. Explaining why that API violation may be safe
today is better than nothing, but still deserves a big fat XXX.

Even if the above assumption is accurate today, a code change in a
seemingly unrelated place may make it false, leading to hard-to-find
bugs. All it takes is for some code to copy the underlying SBuf
somewhere... If you want to improve this, please contact Christos: IIRC,
we have seen a similar problem elsewhere in the code and ended up adding
a new SBuf method to combat it. Christos should have the details.

> +    state.parsing = State::psHttpRequestHeader; // 'reqhdr' segment maybe first

To clarify, I would replace that comment with something like

// some buggy ICAP servers send HTTP request headers even when
// we expect only HTTP response headers (e.g., during RESPMOD)
state.parsing = State::psHttpRequestHeader

Alternatively, you can set state.parsing based on the Encapsulated
header. This will be more complex but a bit faster because you will
avoid calling parseHttpRequestHead() for 99.99999999% of RESPMOD
responses. That method has a non-trivial initial overhead.

> +        // parse headers
> +        adapted.header->pstate = psReadyToParseHeaders;
> +        Must(adapted.header->httpMsgParseStep(httpReqParser->mimeHeader().rawContent(), httpReqParser->mimeHeader().length(), true) >= 0);
> +
> +        setOutcome(xoModified);

If I am reading the patched code correctly, this is wrong for HTTP
request headers in a RESPMOD response. Those headers must be ignored,
not used as adapted message headers.

> +        // XXX: replyHttpHeaderSize used to account how many ICAP body bytes are processed
> +        // but ICAP can present two header segments and only the second gets counted.
> +        // Worse; the second in that case is HTTP reply message and smaller of the two.

Similarly, there appears to be some confusion regarding the number of
HTTP headers in the ICAP response. A valid response has either zero or
one HTTP header. If we decide to accept buggy ICAP responses that
include both HTTP request and HTTP response headers, then we must
*discard* the HTTP request header.

Two HTTP headers in an ICAP response is not an XXX (our bug), but an
ICAP server bug. If we decide to be tolerant, we need to process it
correctly, but if it is too complex to do that correctly, we should not
be tolerant in this case -- AFAICT, there are no popular (or even
non-obscure) ICAP servers that send two HTTP message headers in ICAP
responses.

Referring to bug 2480 in some of the source code comments might be
helpful: http://bugs.squid-cache.org/show_bug.cgi?id=2480

> +    Icap::ResponseParserPointer hp;  ///< parser for current ICAP protocol message (if any)

AFAICT, this is an attempt at incremental parsing for ICAP response
headers. I recommend avoiding that optimization because ICAP response
headers are small and we have spent enough time fixing incremental
parsing bugs already. Make that hp parser local to parseIcapHead().

> * adds "ICAP" / "icap" to the registered protocol types and scheme
> names, and associated Icap::ProtocolVersion() infrastructure.
> 
> * adds Adaptation::Icap::ResponseParser class extending
> Http1::ResponseParser with ICAP related details and first-line parser.

I do not think we should do that. The HTTP and ICAP syntax are
essentially the same. There is no good reason to duplicate this code. If
you need to parametrize "HTTP" vs. "ICAP" tokens, either add a data
member or, better, allow for both in the *parser* and let the caller
check the parsed protocol. IIRC, we already allow ICY for HTTP and this
approach may help remove that exception from the parser.

> +    // NOTE: RFC 3507 does not define extended whitespace characters to be tolerated
> +    //       like RFC 7230. Being strict leads to fewer problems in the long term.

RFC 3507 lets HTTP/1 define general status line and header syntax. Thus,
it essentially allows everything that HTTP/1 allows.

The lack of an explicit reference to some HTTP syntax concept in RFC
3507 should not be misinterpreted as an implication that HTTP-allowed
things are prohibited in ICAP. We have wasted weeks on this during HTTP
parser "upgrade" which is still unfinished. Let's not repeat that here,
please. Reuse the HTTP parsing code for ICAP as much as you can.

> There is some weird race behaviour I still want to verify if trunk has
> too. But have gone with PATCH instead of PREVIEW since this seems like a
> good place to pause. Leaving most polish and some major bug fixes to
> followups. That includes the other two parsers in adaptation/icap/.

Similarly, my initial review focuses on just a few big issues. More
review rounds will be necessary.

Please note that you are not simply replacing old hand-written parsing
code with code that uses currently available parsers. You are also
changing the overall ICAP logic surrounding that code. I doubt that was
a good idea, but it is probably too late to go back to a simpler
replacement plan.

Thank you,

Alex.