[squid-dev] [PATCH] mime unfolding

Thu May 12 11:55:49 UTC 2016

On 12/05/2016 5:34 a.m., Alex Rousskov wrote:
> On 05/11/2016 05:32 AM, Amos Jeffries wrote:
> 
>> +    if (mimeHeaderBlock_.length() < 2) {
>> +        if (mimeHeaderBlock_[0] == '\n')
>> +            mimeHeaderBlock_ = Http1::CrLf();
>> +        else
>> +            mimeHeaderBlock_.append(Http1::CrLf());
>> +    }
>> +    // now mimeHeaderBlock_ has 0+ fields followed by the CRLF terminator
> 
> 
> mimeHeaderBlock_[0] does not exist when mimeHeaderBlock_ is empty (the
> append statement implies that the block may be empty). The comment lies
> about CRLF terminator because CR may be still missing in longer blocks.
> Moreover, I suspect you do not need this code at all if the loop below
> is fixed!
> 

Oaky. That can be fixed with a simple isEmpty() check.

> 
>> +    do {
>> +        SBuf line;
>> +        if (tok.prefix(line, nonCRLF))
>> +            mimeHeaderBlock_.append(line);
>> +
>> +        // the skipAll() might erase bare-CR not part of a CR*LF
>> +        SBuf savePoint(tok.remaining());
>> +        auto crLen = tok.skipAll(CharacterSet::CR);
>> +
>> +        if (tok.skipOne(CharacterSet::LF)) {
>> +            if (tok.skipAll(CharacterSet::WSP))
>> +                mimeHeaderBlock_.append(' ');
>> +            else
>> +                mimeHeaderBlock_.append(Http1::CrLf());
>> +
>> +        } else if (crLen) {
>> +            // TODO: it might be better to replace the CR with one SP, but for now preserve
>> +            // preserve the bare-CR
>> +            tok.reset(savePoint);
>> +            if (tok.prefix(line, CharacterSet::CR))
>> +                mimeHeaderBlock_.append(line);
>> +        }
>> +
>> +    } while (!tok.atEnd());
> 
> 
> The above loop requires a genius to fully comprehend.

Why thank you :-) .

Though it does seem very simple, perhapse the minor crLen optimization
is confusing.

> I certainly cannot
> validate it. I think the whole method ought to be much simpler/clearer.
> Something along these lines:
> 
>     Tokenizer tok(mimeHeaderBlock_);
>     mimeHeaderBlock_.clear();
>     SBuf prefix;
>     while (tokenBeforeObsfolds(tk, prefix)) {
>         mimeHeaderBlock_.append(prefix);
>         mimeHeaderBlock_.append(' ');
>     }
>     mimeHeaderBlock_.append(tk.remaining());
> 
> We already have a Tokenizer::token(prefix, delimiters) method. The
> tokenBeforeObsfolds() call above is exactly the same except the

If you think their behaviour is exactly the same, then I think you are
misunderstanding the purposes of at least one. But I wont draw out this
thread with that.

> delimiters pattern is more complex. See attachment for a simple
> implementation and review/adjust as needed, of course. I have not tested
> my sketch.

Besides the 8 compile errors it has several behavioural bugs though I
see you did workaround one of them by adding an extra Tokenizer pass
across the data, which adds another bug.

1) not unfolding.

Consider the trivial case of:

 Foo: bar\r\n
 Folded: hello \r\n
  world\r\n
 \r\n

tokenBeforeObsfolds() will detect a prefix of "Foo: bar\r" and a
non-fold LF.
 causing it to return false.
 which exits the while loop in unfoldMime().
 which appends tok.remaining() to the mime block.

The common traffic case with folding will be to have some header like
Date or Host first with the fold in a later custom header.

Result: no unfolding.

Assuming you can fix that without a lot more complexity..

2) erases characters not part of any folding.
 ... worse: it erases delimiters.

the use of nonLF has consumed into prefix some characters that might
have been part of the fold sequence.
 skipAllTrailing() is used to workaround that bug.
 which consumes again, more characters than the ones that might be part
of the fold.
 resulting in a header value of "helloworld" which should have been
"hello world".

Whether that was semantically or lexically meant to be one or two tokens
in the header we don't know. But either way it is one token now.

Lets hope that header was not using any of the common delimiter
characters ... um .. as a delimiter.

Result: HTTP semantics are potentially changed for headers both known
and unknown.

The purpose of this unfolder is to unfold. End of task. Not to cleanup
the headers. Nor to impose any "garbage" meaning on the non-fold
characters. That job belongs to the parser logic which does lexical
interpretation of the header field-value.

This logic *cannot* tell the meaning of non-fold characters. This was
why my single loop went out of its way to preserve discovered sequences
of bare-CR, despite how damaging we know that character can be at times.

As a side note. I'm not sure we should treat a whole sequence of folds
as a single SP. We are allowed to replace 1 fold with 1+ SP characters.
But nothing was discussed in the WG about N folds being replaced with
just 1 SP character. So we can't really assume some other
implementations wont interpret each fold as a SP with meaning. It would
be odd, but then anything playing around with sequences of folds is
already odd.

Also note there are whitespace characters which are not part of the fold
sequence. So even though we can remove CR*LF 1*(SP/HTAB) - we cant
safely remove the other 'garbage' whitespace characters as part of the
fold whitespace suffix any more than my demo above can safely remove it
before the CR.

>> +const SBuf &CrLf();
> 
> I suspect this should be defined inline for performance reasons.

If we do that we have again the issue you brought up earlier about
multiple globals init order, which is why it is not just a member of
Http1::Parser to begin with.

Linkers these days with -flto are smart enough to optimize it away if
the user wants to.

> Inlining may result in duplication of a few CRLF SBufs, but that is
> probably better than the expense of calling a function every time we
> need CRLF. I expect the number of use cases to grow as we polish the code.
> 

Then we can look at that later if/when the usage grows. Right now it
doesn't really have a proper need to be outside Parser.

> 
>> But the savings is not enough to be bothered with AFAICS. 
> 
> There are no performance savings or losses here to worry about. The code
> in question is "fast enough" for the non-performance-sensitive path it
> is on because containsObsFold ought to be false in the vast majority of
> cases. This code should be optimized for quality, not speed.

For regular traffic that is right.

But DoS attack conditions are also something we have to consider as a
high priority in these early parsing stages. So we need to consider
speed even if we technically call it a "slow path".
My earlier option #2 loop was about 1% faster than the Tokenizer
do-while loop.

I was going to run your proposal through my ab tests before replying to
see what its speed was like. But bug #1 above means it short-circuits
the test cases.

Amos