[squid-dev] [PATCH] HTTP Response Parser upgrade

Fri Jan 23 08:00:20 UTC 2015

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 23/01/2015 6:47 a.m., Alex Rousskov wrote:
> On 01/20/2015 06:58 AM, Amos Jeffries wrote:
> 
>> Updated patch attached resolving the issues found by coadvisor in
>> the original.
> 
> 
>> +size_t +Client::needBufferSpace(const SBuf &readBuf, const
>> size_t minSpace) const +{ +    size_t space =
>> readBuf.spaceSize(); // available space w/o heroic measures
> 
> If the above line is the only readBuf usage in this method, pass 
> readBuf.spaceSize() as a parameter instead of the whole buffer.
> 

Done.

> 
>> +    if (space < minSpace) { +        const size_t maxSpace =
>> SBuf::maxSize; // absolute best +        space = min(minSpace,
>> maxSpace); // do not promise more than asked +    }
> 
> Here and elsewhere in Client::needBufferSpace(), the returned space
> size may be lower than minSpace. s/minSpace/wantedSpace/ or
> something like that.
> 
>> +         * If the pipe is totally full, don't register the read
>> handler. +         * The BodyPipe will call our
>> noteMoreBodySpaceAvailable() method +         * when it has free
>> space again.
> 
> The above comment does not match the code. The code does not deal
> with read handler registration, only size calculations.
> 

Removed that paragraph.

> 
>> +        size_t adaptation_space = +
>> virginBodyDestination->buf().potentialSpaceSize();
> 
> Make const if possible.

Done.

> 
> 
>> +    /// determine how much space the buffer needs to reserve +
>> size_t needBufferSpace(const SBuf &readBuf, const size_t
>> minSpace) const;
> 
> need+something implies a notification or state changing method,
> which is not what this method is about. Based on the caller, I
> suggest to
> 
> s/needBufferSpace/calcReadSize/ or s/needBufferSpace/readSize/
> 

While the current caller is only in http.cc, this is intended for use
by all Client child classes in their own ways.

I'm going with calcBufferSpaceToReserve() for now.

> 
>> void HttpStateData::readReply(const CommIoCbParams &io) {
> ...
>> +    assert(!flags.do_next_read); // XXX: should have been set
>> false by mayReadVirginBody()
> ...
>> +    assert(Comm::IsConnOpen(serverConnection)); +
>> assert(io.conn->fd == serverConnection->fd);
> 
> These are all new asserts in an async job callback. Should we use 
> Must()s instead to avoid crashing the whole Squid process when
> dealing with what seems to be a transaction-specific problem?
> 

Right. Fixed.

> 
>> +    // how much we want to read +    const int read_size =
>> needBufferSpace(inBuf, (limitBuffer - inBuf.length()));
> 
> Unsigned->signed conversion. Can you use size_t for read_size?
> 
>> +    if (read_size < 1) {
> 
> ... and then convert this to !read_size or (read_size == 0) for
> clarity?
> 
> 
>> +    /// amount of message payload/body received so far. +
>> int64_t payloadSeen; +    /// positive when we read more than we
>> wanted +    int64_t payloadTruncated;
> 
> Have you considered making these unsigned?
> 

I did. We end up with a great number of casting back to signed values
for use in math calculations and if-conditionals when these members
interact with adapted, store, and memory object members. Ideally all
those other parts should be fixed, but that is outside scope of this
patch.

> 
>> +    /// parse scan to find the mime headers block for current
>> message +    bool findMimeBlock(const char *which, size_t
>> limit);
> 
> Find is a usually a const method. This method does a lot more than
> just finding MIME headers. To fix that and to supply information
> that the method name does not already reveal, I would do:
> 
> /// returns whether the header was successfully parsed bool
> parseMimeHeader(const char *headerKind, const size_t sizeLimit);
> 

I've adjusted the documentation to say what state changes happen on true.

I do object to the use of "header" other than as a modifier to the
named mime block type like I had it. This function does nothing
involving the "header" level of syntax/tokens in its current incarnation.

It finds a MIME block/segment terminator up to a limited distance into
the I/O buffer, and extracts the preceeding octets into an SBuf for
*later* code to parse.

One day, when the headers actually are parsed here, I do intend to
rename it. But until then it would be false and slightly confusing
documentation to imply a full parse like the other methods in this class.

NP: I would be using Tokenizer for this instead of headersEnd(),
except that Tokenizer has no way to search for multi-byte
delimiter/magic yet and merging it with headersEnd() is just as
premature as making this method action the parsing of the header lines.

Would you accept "identify" instead of "find" ?
 or maybe "grab" is better?

NP: just realized the new form was also dropping ICY mime header
blocks. Fixed that too.

> Note that the limit parameter is const.
> 

Fixed.

> 
>> +        int64_t mimeHeaderBytes = 0; +        // XXX: c_str()
>> reallocates. performance regression. +        if
>> ((mimeHeaderBytes = headersEnd(buf_.c_str(), buf_.length())) ==
>> 0) { +            if (buf_.length()+firstLineSize() >= limit) { +
>> debugs(33, 5, "Too large " << which); +
>> parseStatusCode = Http::scHeaderTooLarge; +
>> parsingStage_ = HTTP_PARSE_DONE; +            } else +
>> debugs(33, 5, "Incomplete " << which << ", waiting for end of
>> headers"); +            return false; +        } +
>> mimeHeaderBlock_ = buf_.consume(mimeHeaderBytes); +
>> debugs(74, 5, "mime header (0-" << mimeHeaderBytes << ") {" <<
>> mimeHeaderBlock_ << "}");
> 
> 
> The above is better written as
> 
> // XXX: c_str() reallocates. performance regression. if (const
> int64_t mimeHeaderBytes = ...) { mimeHeaderBlock_ =
> buf_.consume(mimeHeaderBytes); debugs(...); } else { ... return
> false; }
> 
> but there is more that needs to be done in the above code:
> 
>> +    // Squid could handle these headers, but admin does not want
>> to +    if (messageHeaderSize() >= limit) { +        debugs(33,
>> 5, "Too large " << which); +        parseStatusCode =
>> Http::scHeaderTooLarge; +        return false; +    }
> 
> This should be rewritten to go before we initialize
> mimeHeaderBlock_ IMO. Otherwise, the damage of keeping the
> admin-prohibited header is already done.
> 

Done.

NP: This will prevent the ERR_TOO_BIG and cache.log level 11,2
displaying the mime block as part of the message when its present but
outside the desired size range.

>> +const int 
>> +Http::One::ResponseParser::parseResponseStatusAndReason() +{ +
>> if (buf_.isEmpty()) +        return 0; +
> 
> Remove this check (and fix the tokenizer if it cannot handle empty
> buffers).
> 

Done.

>> +        // NP: search space is >3 to get terminator character) +
>> if(!tok.prefix(status, CharacterSet::DIGIT, 4)) +
>> return -1; // invalid status +        // NOTE: multiple SP or
>> non-SP bytes between version and status code are invalid. +
>> if (tok.atEnd()) +            return 0; // need more to be sure
>> we have it all +        if(!tok.skip(' ')) +            return
>> -1; // invalid status, a single SP terminator required + +
>> debugs(74, 6, "found string status-code=" << status); + +
>> // get the actual numeric value of the 0-3 digits we found +
>> ::Parser::Tokenizer t2(status); +        int64_t statusValue; +
>> if (!t2.int64(statusValue)) +            return -1; // ouch.
>> digits not forming a valid number? +        debugs(74, 6, "found
>> int64 status-code=" << statusValue); +        if (statusValue < 0
>> || statusValue > 999) +            return -1; // ouch. digits not
>> within valid status code range.
> 
> This sequence does not make sense to me: Incomplete status should
> be treated the same as early atEnd(), but the former is treated as
> an invalid status. 4-digit status should be considered as invalid
> but it is not. Numeric value of 0 digits?! Etc., etc. It feels like
> you are still parsing using low-level C function calls, not a
> Tokenizer that we worked so hard to build!
> 

Yesterday doing the request-line parser I found a bug in
Tokenizer::prefix where it was returning the entire buffer to the
caller if the match happened to be same size as the limit.

It seems I was working around that with the limit of 4, though I
didn't realize it at the time (no status-line unit tests :-( ).

> The correct sequence should be similar to these lines:
> 
> if (tok.prefix(status, CharacterSet::DIGIT, 3) && tok.skip(' ')) 
> debugs(74, 6, "found string status-code=" << status); else if
> (tok.atEnd()) return 0; // good so far but need more to finish 
> else return -1; // invalid status
> 

Done, and ...

> However, if Tokenizer integer parsing interface were written
> correctly, the above would have actually looked close to this:
> 

<snip>
> 
> uint64_t status; if (tok.uint64(status, 3, base10) && tok.skip('
> ')) debugs(74, 6, "found status-code=" << status); else if
> (tok.atEnd()) return 0; // good so far but need more to finish 
> else return -1; // invalid status
> 

... using this style.

Added a limit as rightmost parameter to tok.int64 method.

I also added a boolean to make it reject +/- symbols.

> 
> The last form (with a size limit for uint64) allows the Tokenizer
> caller to terminate parsing earlier, without waiting for the end of
> all-digit input, but I am not sure that is a common need.
> 
> I urge you to fix Tokenizer integer parsing and rewrite the parsing
> code.

Done.

> 
> That is all I had time for today. I would have to come back to
> this later. Sorry.
> 

Thank you very much.

> 
>> * This has now had several hours of regular web browsing by
>> myself (YouTube, Facebook, some news sites, and the Squid sites)
>> with nothing noticably going wrong.
> 
> Any ICAP tests? The HTTP-ICAP interaction has been a source of many
> bugs before and your patch changes the relevant code. If you have
> not tested it, please do (or ask others to do it).

No the ICAP related code is not touched. It still uses the HttpMsg
parser functions and its own original buffering. These changes only
touch the server socket (virgin reply) buffer I/O and parse function.

I was very careful about checking for that separation before going
beyond the I/O buffer conversion stage.

> 
>> * Polygraph shows a 7 RPS performance loss (0.36%), with trunk
>> having +/-2 RPS (0.1%) normal fluctuation. I believe this is from
>> the transitional regression to data-copy into MemBuf for chunked
>> coding.
> 
> Did Polygraph actually use chunked responses in these tests?

Unsure, but I think so.

Amos
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJUwf+TAAoJELJo5wb/XPRjw+QH/iHzTTe8kSakrQKp3OIjavSg
HHAIatnuyuJg9/iyePC/rIW+gjHshntOw+QpAB2F++yiQyVhR3NWVDWutUZNL1zr
mR+nxZon/+eX7TN/sKHBrVmglqAzpnzn89CQdpGXs7G5NeIc8Dew1PC5FnulQTo6
SI4nvJHgs6VNsiWTaK9130/b/YsAziStFAPbIH2lVWUHr0YalVioi6EpqhLZlaLs
A6NEsTAQNr2XLdz24NALn780WdCdyV7PudK41cQpqKNLqoOY+hwG+Ze9Isf+9Eow
L2blRYSAN0Z0i5onEw1k2iwGzQTYaQ/8ksUOyKx9c5dvitaEoQxTdCL6FJpLuMY=
=eHle
-----END PGP SIGNATURE-----