[squid-dev] [PATCH] mime unfolding

Thu May 12 15:16:46 UTC 2016

On 05/12/2016 05:55 AM, Amos Jeffries wrote:
> On 12/05/2016 5:34 a.m., Alex Rousskov wrote:
>> On 05/11/2016 05:32 AM, Amos Jeffries wrote:
>>> +    do {
>>> +        SBuf line;
>>> +        if (tok.prefix(line, nonCRLF))
>>> +            mimeHeaderBlock_.append(line);
>>> +
>>> +        // the skipAll() might erase bare-CR not part of a CR*LF
>>> +        SBuf savePoint(tok.remaining());
>>> +        auto crLen = tok.skipAll(CharacterSet::CR);
>>> +
>>> +        if (tok.skipOne(CharacterSet::LF)) {
>>> +            if (tok.skipAll(CharacterSet::WSP))
>>> +                mimeHeaderBlock_.append(' ');
>>> +            else
>>> +                mimeHeaderBlock_.append(Http1::CrLf());
>>> +
>>> +        } else if (crLen) {
>>> +            // TODO: it might be better to replace the CR with one SP, but for now preserve
>>> +            // preserve the bare-CR
>>> +            tok.reset(savePoint);
>>> +            if (tok.prefix(line, CharacterSet::CR))
>>> +                mimeHeaderBlock_.append(line);
>>> +        }
>>> +
>>> +    } while (!tok.atEnd());

>> I certainly cannot
>> validate it. I think the whole method ought to be much simpler/clearer.
>> Something along these lines:
>>
>>     Tokenizer tok(mimeHeaderBlock_);
>>     mimeHeaderBlock_.clear();
>>     SBuf prefix;
>>     while (tokenBeforeObsfolds(tk, prefix)) {
>>         mimeHeaderBlock_.append(prefix);
>>         mimeHeaderBlock_.append(' ');
>>     }
>>     mimeHeaderBlock_.append(tk.remaining());
>>
>> We already have a Tokenizer::token(prefix, delimiters) method. The
>> tokenBeforeObsfolds() call above is exactly the same except the

> If you think their behaviour is exactly the same, 

I think the concept is exactly the same.

> then I think you are
> misunderstanding the purposes of at least one. But I wont draw out this
> thread with that.

Ditto.

> Besides the 8 compile errors

As I said, it is a sketch. I cannot sketch non-trivial code without
compile errors. I am surprised you are wasting your time trying to rub
this in.

> it has several behavioural bugs though I
> see you did workaround one of them by adding an extra Tokenizer pass
> across the data, which adds another bug.
> 
> 1) not unfolding.
> 
> Consider the trivial case of:
> 
>  Foo: bar\r\n
>  Folded: hello \r\n
>   world\r\n
>  \r\n
> 
> tokenBeforeObsfolds() will detect a prefix of "Foo: bar\r" 

tokenBeforeObsfolds() should detect a prefix of "Foo: bar\r\nFolded:
hello". The delimiter is obs-fold, not CRLF. That is the whole idea
behind this approach! If the sketched tokenBeforeObsfolds() does not do
it, then it needs to be adjusted so that it does. I am trying to steer
your towards a reasonable solution, not write the code for you.

> 2) erases characters not part of any folding.
>  ... worse: it erases delimiters.
> 
> the use of nonLF has consumed into prefix some characters that might
> have been part of the fold sequence.

The folding sequence essentially starts with LF. Thus, there cannot be
"characters that might have been part of the fold sequence" in nonLF.
The only exception is CR before LF, which the sketch takes care of.

>  skipAllTrailing() is used to workaround that bug.

It is not a workaround. skipAllTrailing() removes CR before LF (and any
other trailing whitespace after the prefix). Whether we should trim that
other trailing whitespace is debatable, I guess, but I think it makes
sense to do that (if not, RelaxedDelimiterCharacters should be replaced
with just CR).

>  which consumes again, more characters than the ones that might be part
> of the fold.
>  resulting in a header value of "helloworld" which should have been
> "hello world".

The space between "hello" and "world" is added in the unfoldMime() loop,
resulting in "hello world". If we replace RelaxedDelimiterCharacters
with just CR, then the result will be
"hello  world" (two spaces), which, in the majority of folding cases is
probably not what the sender meant, but I am sure there are exceptions
either way. I do not object to limiting trimming to CRs or CR.

> The purpose of this unfolder is to unfold. End of task. Not to cleanup
> the headers. Nor to impose any "garbage" meaning on the non-fold
> characters. 

If you are writing a strict message validator, yes. If you are writing
an HTTP proxy that is supposed to handle real traffic, then the answer
is far less clear. At any rate, whether whitespace before obs-fold is
garbage is a separate and minor question. The sketched code accommodates
each answer trivially: all RelaxedDelimiterCharacters, all CRs, or just
one CR.

> That job belongs to the parser logic which does lexical
> interpretation of the header field-value.

In general, the header field parser would not be able to do anything
about this garbage. The garbage spaces may be inside a quoted string
and/or extension header that we do not parse. We have to make a decision
here. The header field parser will make its own decisions.

> As a side note. I'm not sure we should treat a whole sequence of folds
> as a single SP. We are allowed to replace 1 fold with 1+ SP characters.
> But nothing was discussed in the WG about N folds being replaced with
> just 1 SP character. So we can't really assume some other
> implementations wont interpret each fold as a SP with meaning. It would
> be odd, but then anything playing around with sequences of folds is
> already odd.

Agreed. Another secondary decision: skipAllObsFolds() or just skipObsfold().

> Also note there are whitespace characters which are not part of the fold
> sequence. So even though we can remove CR*LF 1*(SP/HTAB) - we cant
> safely remove the other 'garbage' whitespace characters as part of the
> fold whitespace suffix

Whether we should remove that garbage or not is debatable. My sketch
removes it and makes it pretty clear what is being removed. Your email
questioning that removal is a good proof of that -- I have no idea what
your loop removes or does not remove, and would be unable to safely
adjust it to remove less (or more).

>>> +const SBuf &CrLf();
>>
>> I suspect this should be defined inline for performance reasons.
> 
> If we do that we have again the issue you brought up earlier about
> multiple globals init order

Defining this function as inline (i.e. in the header) does not create
any initialization issues.

> Then we can look at that later if/when the usage grows. Right now it
> doesn't really have a proper need to be outside Parser.

Oh, it has plenty of "need". We just have [correctly] chosen to ignore
that need during this project.

  $ bzr grep -F '"\r\n"'  | wc -l
  58

>>> But the savings is not enough to be bothered with AFAICS. 
>>
>> There are no performance savings or losses here to worry about. The code
>> in question is "fast enough" for the non-performance-sensitive path it
>> is on because containsObsFold ought to be false in the vast majority of
>> cases. This code should be optimized for quality, not speed.
> 
> For regular traffic that is right.
> 
> But DoS attack conditions are also something we have to consider as a
> high priority in these early parsing stages. So we need to consider
> speed even if we technically call it a "slow path".

We do not need to optimize exceptional path speed as long as it is
reasonable. Parsing speed does _not_ protect from DoS attacks, and we
have far bigger problems to solve before optimizing obs-fold handling
may become important.

> My earlier option #2 loop was about 1% faster than the Tokenizer
> do-while loop.

In other words, there was no meaningful performance difference that you
could measure. I do not know why you keep bringing this "1%" up.

Alex.