[squid-dev] [PATCH] mime unfolding

Amos Jeffries squid3 at treenet.co.nz
Sat May 14 12:42:37 UTC 2016


Attached patch implements three slightly differing parsers for mime
unfold. Their property differences are described in the documentation
for the unfoldMime() member.


On 13/05/2016 3:16 a.m., Alex Rousskov wrote:
> On 05/12/2016 05:55 AM, Amos Jeffries wrote:
>> On 12/05/2016 5:34 a.m., Alex Rousskov wrote:
>>> On 05/11/2016 05:32 AM, Amos Jeffries wrote:
>>>> +    do {
>>>> +        SBuf line;
>>>> +        if (tok.prefix(line, nonCRLF))
>>>> +            mimeHeaderBlock_.append(line);
>>>> +
>>>> +        // the skipAll() might erase bare-CR not part of a CR*LF
>>>> +        SBuf savePoint(tok.remaining());
>>>> +        auto crLen = tok.skipAll(CharacterSet::CR);
>>>> +
>>>> +        if (tok.skipOne(CharacterSet::LF)) {
>>>> +            if (tok.skipAll(CharacterSet::WSP))
>>>> +                mimeHeaderBlock_.append(' ');
>>>> +            else
>>>> +                mimeHeaderBlock_.append(Http1::CrLf());
>>>> +
>>>> +        } else if (crLen) {
>>>> +            // TODO: it might be better to replace the CR with one SP, but for now preserve
>>>> +            // preserve the bare-CR
>>>> +            tok.reset(savePoint);
>>>> +            if (tok.prefix(line, CharacterSet::CR))
>>>> +                mimeHeaderBlock_.append(line);
>>>> +        }
>>>> +
>>>> +    } while (!tok.atEnd());
> 
> 
>>> I certainly cannot
>>> validate it. I think the whole method ought to be much simpler/clearer.
>>> Something along these lines:
>>>
>>>     Tokenizer tok(mimeHeaderBlock_);
>>>     mimeHeaderBlock_.clear();
>>>     SBuf prefix;
>>>     while (tokenBeforeObsfolds(tk, prefix)) {
>>>         mimeHeaderBlock_.append(prefix);
>>>         mimeHeaderBlock_.append(' ');
>>>     }
>>>     mimeHeaderBlock_.append(tk.remaining());
>>>
>>> We already have a Tokenizer::token(prefix, delimiters) method. The
>>> tokenBeforeObsfolds() call above is exactly the same except the
> 
>> If you think their behaviour is exactly the same, 
> 
> I think the concept is exactly the same.
> 
> 
>> then I think you are
>> misunderstanding the purposes of at least one. But I wont draw out this
>> thread with that.
> 
> Ditto.
> 

I fixed the non-unfolding bug and finally clicked. You are using token()
to find a prefix. Not prefix(). So of course it will act like token().
 And I see it doing so. Discarding all the visible-text bytes form the
mime block, and keeping only the LF characters. Exactly the conceptual
operation we want to not do.

> 
>> Besides the 8 compile errors
> 
> As I said, it is a sketch. I cannot sketch non-trivial code without
> compile errors. I am surprised you are wasting your time trying to rub
> this in.
> 

Apologies. Your comment about needing to be a genius to read three if
statements rubbed me the wrong way yesterday. I know we are both
stressed out by other things as well as this and should have let it go.

One of the parsers you will find attached (MIME_UNFOLD_SLOW) is that
loop expanded into functions so each if statement gets its own named
function and so non-genius people can avoid reading the description of
unfoldMime.


> 
>> it has several behavioural bugs though I
>> see you did workaround one of them by adding an extra Tokenizer pass
>> across the data, which adds another bug.
>>
>> 1) not unfolding.
>>
>> Consider the trivial case of:
>>
>>  Foo: bar\r\n
>>  Folded: hello \r\n
>>   world\r\n
>>  \r\n
>>
>> tokenBeforeObsfolds() will detect a prefix of "Foo: bar\r" 
> 
> tokenBeforeObsfolds() should detect a prefix of "Foo: bar\r\nFolded:
> hello".

It doesn't. That was my point.


Lets get to the base of the design decision.

In the grammar we are dealing with for unfold; the CRLF, bare-LF,
bare-CR, and obs-fold (CR* LF WSP) are all slightly overlapping
delimiters with different meanings.

What you are asking for is:
<https://en.wikipedia.org/wiki/Recursive_descent_parser>
Note how the description make a point of: "Even when they terminate,
parsers that use recursive descent with backtracking may require
exponential time."

My testing has shown that to be true even with only a handful of folds.
Getting worse the more folds exist. With just 175 byte headers and 7
folds it halves the speed of the proxy response time.


What I wrote was an LL(0) parser with if-statements. The least complex
implementation of the least complex parser.


Just to be clear, I'm not putting this on you.  The sketch you came up
with looks like a formal logic language implementation of parser. Those
formal languages cannot implement certain LL or LR parsers involving
ambiguous tokens like we are dealing with here in simple ways - so LALR
is their favoured approach.
  But this is C/C++ not one of those formal proofing languages, and
*can* implement the faster parser without compromising readability.


> The delimiter is obs-fold, not CRLF.

It is neither. That is the core of the difficulty.

CR is optional and WSP is any one of a set of characters.

The only reliably fixed character in the delimiter is LF. It will always
be present in a fold, but like the others it may also be present in a
non-fold. The CR and WSP details are look-ahead or backtrack states for
the LR and LALR parsers. Which make them double in complexity with each
ambiguity.

The LL parser is just "remembering" the past state sequence and when it
gets to the WSP or non-WSP at the end can dump out the single
appropriate value for fold vs non-fold sequence to append() to the
output block.


> That is the whole idea
> behind this approach! If the sketched tokenBeforeObsfolds() does not do
> it, then it needs to be adjusted so that it does. I am trying to steer
> your towards a reasonable solution, not write the code for you.

You steered towards a recursive-descent LALR(5) parser.

That type is one that is particularly prone to infinite loops from
unterminated recursion which are not easily visible at the
per-unit/function level. They are also prone to causing stack size
explosions on large inputs even when not infinite. And as mentioned
above exponentially slow O(k^2). The larger the k-value (5 for this
unfold), the worse these problems become and less visible.

FYI: I've marked the lines that look slightly broken in terms of that
one unit at first reading, but are there to prevent unbounded recursion.
Note that the functional / recursive descent implementation of my LL
even has that problem at one point, its just a natural side effect of
the descent.


The LL(0) parse loop without recursion is always moving forward. So any
bugs are only about whether or not it tokenizes correctly. It will
always terminate in linear time.


> 
>> 2) erases characters not part of any folding.
>>  ... worse: it erases delimiters.
>>
>> the use of nonLF has consumed into prefix some characters that might
>> have been part of the fold sequence.

Sorry that was not right. The use of non-LF and token() has discarded
into the nether regions all non-LF characters. Then the above happens
for the LF that got consumed into prefix.

> 
> The folding sequence essentially starts with LF. Thus, there cannot be
> "characters that might have been part of the fold sequence" in nonLF.
> The only exception is CR before LF, which the sketch takes care of.

I understand that much was the intention.

I'm trying to convey that fold and atEnd() are not the only occurance of
halting points when one scans for LF in a blob. The if-statement you
sketched assumes they are.


That was previously embodied by:

>>>> +        if (tok.skipOne(CharacterSet::LF)) {
>>>> +            if (tok.skipAll(CharacterSet::WSP))
>>>> +                mimeHeaderBlock_.append(' ');
>>>> +            else
>>>> +                mimeHeaderBlock_.append(Http1::CrLf());
>>>> +
>>>> +        } else

 LF <char> (aka line terminator) becomes CRLF
 LF WSP (aka fold) becomes SP
 then loop continues for the next blob-before-LF


> 
>>  skipAllTrailing() is used to workaround that bug.
> 
> It is not a workaround. skipAllTrailing() removes CR before LF (and any
> other trailing whitespace after the prefix). Whether we should trim that
> other trailing whitespace is debatable, I guess, but I think it makes
> sense to do that (if not, RelaxedDelimiterCharacters should be replaced
> with just CR).

Thats what I mean by it being the workaround for CR having been consumed
in the first place. This is one instance of what is called
"backtracking" in parser terminology. Each backtracking is another +1 to
the k-value of the parser branch. k-value being a measure of the
complexity in formal analysis.


>> The purpose of this unfolder is to unfold. End of task. Not to cleanup
>> the headers. Nor to impose any "garbage" meaning on the non-fold
>> characters. 
> 
> If you are writing a strict message validator, yes. If you are writing
> an HTTP proxy that is supposed to handle real traffic, then the answer
> is far less clear. At any rate, whether whitespace before obs-fold is
> garbage is a separate and minor question. The sketched code accommodates
> each answer trivially: all RelaxedDelimiterCharacters, all CRs, or just
> one CR.
> 
> 
>> That job belongs to the parser logic which does lexical
>> interpretation of the header field-value.
> 
> In general, the header field parser would not be able to do anything
> about this garbage. The garbage spaces may be inside a quoted string
> and/or extension header that we do not parse. We have to make a decision
> here. The header field parser will make its own decisions.
> 

As the coder I'm deciding to do exactly what is documented in the RFC,
in the same way it describes (well, one of the three options). So that
when problems get reported about the specific header bytes produced we
can point at it and say "you are required to expect this when the client
sends obs-fold". If we make the decision on our own assumptions about
usage we are the ones causing the problems. The phrase about hoists and
petards comes to mind here.

I know you are generally against my choices to make the parser strict.
This is ironically one place where being strict means letting more
through, and being relaxed means erasing things people might have wanted
left alone.


The later layers get a lot more context to hopefully make a better
decision. So I'd rather defer what we can to there.

And yes I agree this is probably a case of when, not if, complaints arrive.


> 
>> As a side note. I'm not sure we should treat a whole sequence of folds
>> as a single SP. We are allowed to replace 1 fold with 1+ SP characters.
>> But nothing was discussed in the WG about N folds being replaced with
>> just 1 SP character. So we can't really assume some other
>> implementations wont interpret each fold as a SP with meaning. It would
>> be odd, but then anything playing around with sequences of folds is
>> already odd.
> 
> Agreed. Another secondary decision: skipAllObsFolds() or just skipObsfold().
> 

Going with single obs-fold removal at a time so as to remove an infinite
recursion loop in the parse functions.

> 
>> Also note there are whitespace characters which are not part of the fold
>> sequence. So even though we can remove CR*LF 1*(SP/HTAB) - we cant
>> safely remove the other 'garbage' whitespace characters as part of the
>> fold whitespace suffix
> 
> Whether we should remove that garbage or not is debatable. My sketch
> removes it and makes it pretty clear what is being removed. Your email
> questioning that removal is a good proof of that -- I have no idea what
> your loop removes or does not remove, and would be unable to safely
> adjust it to remove less (or more).
> 

It does not remove anything except the CR*LF WSP sequence of the
fold itself.


Amos
-------------- next part --------------
=== modified file 'src/http/one/Parser.cc'
--- src/http/one/Parser.cc	2016-04-12 18:12:15 +0000
+++ src/http/one/Parser.cc	2016-05-14 11:57:26 +0000
@@ -16,6 +16,12 @@
 /// RFC 7230 section 2.6 - 7 magic octets
 const SBuf Http::One::Parser::Http1magic("HTTP/1.");
 
+const SBuf &Http::One::CrLf()
+{
+    static const SBuf crlf("\r\n");
+    return crlf;
+}
+
 void
 Http::One::Parser::clear()
 {
@@ -25,11 +31,34 @@
     mimeHeaderBlock_.clear();
 }
 
+/// characters HTTP permits tolerant parsers to accept as delimiters
+static const CharacterSet &
+RelaxedDelimiterCharacters()
+{
+    // RFC 7230 section 3.5
+    // tolerant parser MAY accept any of SP, HTAB, VT (%x0B), FF (%x0C),
+    // or bare CR as whitespace between request-line fields
+    static const CharacterSet RelaxedDels =
+        (CharacterSet::SP +
+         CharacterSet::HTAB +
+         CharacterSet("VT,FF","\x0B\x0C") +
+         CharacterSet::CR).rename("relaxed-WSP");
+
+    return RelaxedDels;
+}
+
+/// characters used to separate HTTP fields
+const CharacterSet &
+Http::One::Parser::DelimiterCharacters()
+{
+    return Config.onoff.relaxed_header_parser ?
+           RelaxedDelimiterCharacters() : CharacterSet::SP;
+}
+
 bool
 Http::One::Parser::skipLineTerminator(Http1::Tokenizer &tok) const
 {
-    static const SBuf crlf("\r\n");
-    if (tok.skip(crlf))
+    if (tok.skip(Http1::CrLf()))
         return true;
 
     if (Config.onoff.relaxed_header_parser && tok.skipOne(CharacterSet::LF))
@@ -38,6 +67,330 @@
     return false;
 }
 
+/// all characters except the LF line terminator
+static const CharacterSet &
+LineCharacters()
+{
+    static const CharacterSet line = CharacterSet::LF.complement("non-LF");
+    return line;
+}
+
+#if !MIME_UNFOLD_FAST && !MIME_UNFOLD_SLOW
+
+/// RFC 7230 section 3.2
+/// obs-fold = CRLF 1*( SP / HTAB )
+static bool
+skipObsfold(Http1::Tokenizer &tk)
+{
+    Http1::Tokenizer testing(tk);
+    // RFC 7230 section 3.5:
+    // MAY accept a single LF as line terminator instead of CRLF
+    (void)testing.skipOne(CharacterSet::CR);
+    if (testing.skipOne(CharacterSet::LF) && testing.skipAll(CharacterSet::WSP)) {
+        debugs(25, 3, MYNAME);
+        tk = testing;
+        return true;
+    }
+    return false;
+}
+
+/// removes any trailing CR sequence from a buffer
+static SBuf
+trimTrailingCr(SBuf inBuf)
+{
+    Http1::Tokenizer trimming(inBuf);
+    trimming.skipAllTrailing(CharacterSet::CR);
+    SBuf outBuf = trimming.remaining();
+    debugs(25, 3, Raw("outBuf", outBuf.rawContent(), outBuf.length()));
+    return outBuf;
+}
+
+/// delimiter = CR*LF
+/// Upon success, sets pfx to characters before the first found delimiter
+/// \retval true when characters are terminated by CR*LF, not obs-fold
+/// \retval false when the character set is terminated by obs-fold, or not terminated
+static bool
+oneMimeLine(Http1::Tokenizer &tk, SBuf &pfx)
+{
+    Http1::Tokenizer testing(tk);
+    SBuf line;
+    // dont remove !WSP or no unfold happens
+    if (testing.prefix(line, LineCharacters()) && testing.skipOne(CharacterSet::LF) && !testing.skipOne(CharacterSet::WSP)) {
+        pfx = trimTrailingCr(line);
+        debugs(25, 3, Raw("line", line.rawContent(), line.length()));
+        tk = testing;
+        return true;
+    }
+    return false;
+}
+
+/// delimiter = CR*LF
+/// Upon success, sets pfx to characters before the last found delimiter
+/// \retval true when characters are terminated by CR*LF, not obs-fold
+/// \retval false when the character set is terminated by obs-fold, or not terminated
+static bool
+someMimeLines(Http1::Tokenizer &tk, SBuf &pfx)
+{
+    Http1::Tokenizer testing(tk);
+    SBuf line;
+    while (oneMimeLine(testing, line)) {
+        debugs(25, 3, Raw("line", line.rawContent(), line.length()));
+        pfx.append(line);
+        pfx.append(Http1::CrLf());
+    }
+    tk = testing;
+    return true;
+}
+
+/// delimiter = obs-fold
+/// Upon success, sets pfx to characters before the first found delimiter
+/// \retval true when a line terminated by LF WSP
+/// \retval false when the character set is terminated by CRLF or LF, not obs-fold
+static bool
+aFoldedLine(Http1::Tokenizer &tk, SBuf &pfx)
+{
+    Http1::Tokenizer testing(tk);
+    SBuf line;
+    if (testing.prefix(line, LineCharacters()) && skipObsfold(testing)) {
+        pfx = trimTrailingCr(line);
+        debugs(25, 3, Raw("line", line.rawContent(), line.length()));
+        tk = testing;
+        return true;
+    }
+    return false;
+}
+
+/// delimiter = obs-fold
+/// Upon success, sets prefix to characters before the first found delimiter
+/// Handles recieving a series of lines terminated by CR*LF but not obs-fold
+static bool
+tokenBeforeObsfold(Http1::Tokenizer &tk, SBuf &pfx)
+{
+    Http1::Tokenizer testing(tk);
+    SBuf linesBeforeFolding;
+    SBuf foldedLinePrefix;
+    if (someMimeLines(testing, linesBeforeFolding) && aFoldedLine(testing, foldedLinePrefix)) {
+        pfx.append(linesBeforeFolding);
+        pfx.append(foldedLinePrefix);
+        debugs(25, 3, Raw("token", pfx.rawContent(), pfx.length()));
+        tk = testing;
+        return true;
+    }
+
+    return false;
+}
+
+#endif
+
+#if MIME_UNFOLD_SLOW
+
+/// state = LF WSP
+/// delimiter = <char>
+static bool
+obsFolded(Http1::Tokenizer &tk)
+{
+    return tk.skipAll(CharacterSet::WSP);
+}
+
+/// state = CR
+/// delimiter = <char>
+static bool
+crToken(Http1::Tokenizer &tk, SBuf &pfx)
+{
+    SBuf blob = tk.remaining();
+    auto crLen = tk.skipAll(CharacterSet::CR);
+    if (crLen) {
+        pfx = blob.chop(0, crLen);
+        debugs(25, 3, Raw("CR-token", pfx.rawContent(), pfx.length()));
+        return true;
+    }
+    return false;
+}
+
+/// state = LF
+/// delimiter = WSP | <char>
+/// Upon success, sets eol to the characters to be used as the terminator
+static bool
+lfToken(Http1::Tokenizer &tk, SBuf &eol)
+{
+    if (tk.skipOne(CharacterSet::LF)) {
+        debugs(25, 3, "LF-token");
+        static const SBuf space(" ");
+        if (obsFolded(tk))
+            eol = space;
+        else
+            eol = Http1::CrLf();
+        debugs(25, 3, "logical line end " << Raw("eol", eol.rawContent(), eol.length()));
+        return true;
+    }
+
+    return false;
+}
+
+/// state = LF | CR* LF
+/// delimiter = <char>
+/// Upon success, sets eol to the characters to be used as the terminator
+static bool
+eolToken(Http1::Tokenizer &tk, SBuf &eol)
+{
+    // check for LF first, even though CR comes first. otherwise infinite recursion happens
+    SBuf lfBlob;
+    if (lfToken(tk, lfBlob)) {
+        eol = lfBlob;
+        return true;
+    }
+
+    SBuf crBlob;
+    SBuf trailer;
+    if (crToken(tk, crBlob) && eolToken(tk, trailer)) {
+        eol = trailer;
+        return true;
+    }
+
+    if (!crBlob.isEmpty()) {
+        eol = crBlob;
+        debugs(25, 3, "undo CR-token " << Raw("eol", eol.rawContent(), eol.length()));
+        return true;
+    }
+
+    return false;
+}
+
+/// state = <char>
+/// delimiter = CR | LF
+static bool
+tokenBeforeEol(Http1::Tokenizer &tk, SBuf &pfx)
+{
+    static const CharacterSet nonCRLF = (CharacterSet::CR + CharacterSet::LF).complement().rename("non-CRLF");
+    SBuf line;
+    if (tk.prefix(line, nonCRLF)) {
+        pfx = line;
+        debugs(25, 3, Raw("blob", pfx.rawContent(), pfx.length()));
+        return true;
+    }
+    return false;
+}
+
+#endif // MIME_UNFOLD_SLOW
+
+/**
+ * Replace obs-fold with a single SP,
+ * normalize bare-LF into CRLF, and preserve bare-CR.
+ *
+ * RFC 7230 section 3.2.4
+ * "A server that receives an obs-fold in a request message that is not
+ *  within a message/http container MUST ... replace
+ *  each received obs-fold with one or more SP octets prior to
+ *  interpreting the field value or forwarding the message downstream."
+ *
+ * "A proxy or gateway that receives an obs-fold in a response message
+ *  that is not within a message/http container MUST ... replace each
+ *  received obs-fold with one or more SP octets prior to interpreting
+ *  the field value or forwarding the message downstream."
+ *
+ *
+ * We have three implementations. Fast, slow, and slowest.
+ * The default is to build and use the slowest logic.
+ * Use -DMIME_UNFOLD_FAST=1 compiler parameter to enable the fast.
+ * Use -DMIME_UNFOLD_SLOW=1 compiler parameter to enable the slow.
+ *
+ * The differences are outlined below;
+ *
+ * Complexity
+ *  - Fast is a single-pass LL(0) parse with 1 backtrack for bare-CR handling.
+ *  - Slow is a recursive-descent LL(0) parse with no backtracking.
+ *  - Slowest is a recursive-descent LALR(5) parse with 16 (2^4) backtracking cases.
+ *
+ * Performance
+ *  - Slowest test results vary between x5 and x8 slower than Fast.
+ *  - Slow test results vary between x2 and x6 slower than Fast.
+ *
+ * Code size
+ *  - Fast is 49 KB smaller in compiled form than Slowest.
+ *  - Slow is 18 KB smaller in compiled form than Slowest.
+ */
+void
+Http::One::Parser::unfoldMime()
+{
+    // syntax =  1*( [ blob ] eol )
+    // eol    = CR* [ LF [ WSP ] ]
+    // blob   = <any character except CR or LF >
+
+    Http1::Tokenizer tok(mimeHeaderBlock_);
+    mimeHeaderBlock_.clear();
+
+#if MIME_UNFOLD_FAST
+
+    static const CharacterSet nonCRLF = (CharacterSet::CR + CharacterSet::LF).complement().rename("non-CRLF");
+
+    while (!tok.atEnd()) {
+        SBuf line;
+        // preserve all characters not part of a mime line terminator
+        if (tok.prefix(line, nonCRLF)) {
+            mimeHeaderBlock_.append(line);
+        } // else, false just means no characters were found before a terminator
+
+        // skip any CR,
+        // checkpoint so we can restore bare-CR sequences if we need to
+        SBuf savePoint(tok.remaining());
+        auto crLen = tok.skipAll(CharacterSet::CR);
+
+        // found a line terminator (LF)
+        if (tok.skipOne(CharacterSet::LF)) {
+            // if it folds, replace with SP, otherwise with CRLF
+            if (tok.skipOne(CharacterSet::WSP))
+                mimeHeaderBlock_.append(' ');
+            else
+                mimeHeaderBlock_.append(Http1::CrLf());
+
+        } else if (crLen) {
+            // the CR found was not a line terminator.
+            // undo/backtrack and preserve the bare-CR.
+            tok.reset(savePoint);
+            if (tok.prefix(line, CharacterSet::CR))
+                mimeHeaderBlock_.append(line);
+
+        } else {
+            // no CR or LF remain. done
+            break;
+        }
+    }
+
+#elif MIME_UNFOLD_SLOW
+
+    while (!tok.atEnd()) {
+
+        SBuf blob;
+        if (tokenBeforeEol(tok, blob))
+            mimeHeaderBlock_.append(blob);
+
+        SBuf eolBlob;
+        if (eolToken(tok, eolBlob))
+            mimeHeaderBlock_.append(eolBlob);
+
+        else // no EOL remain. done
+            break;
+    };
+
+#else // slowest
+
+    SBuf prefix;
+    int counter=0;
+    while (tokenBeforeObsfold(tok, prefix)) {
+        debugs(25, 3, Raw("fold prefix", prefix.rawContent(), prefix.length()));
+        mimeHeaderBlock_.append(prefix);
+        mimeHeaderBlock_.append(' ');
+        prefix.clear();
+        ++counter;
+    }
+    debugs(25, 3, "erased " << counter << " obs-fold");
+
+#endif
+
+    debugs(25, 3, "mime remainder " << Raw("remainder", tok.remaining().rawContent(), tok.remaining().length()));
+    mimeHeaderBlock_.append(tok.remaining());
+}
+
 bool
 Http::One::Parser::grabMimeBlock(const char *which, const size_t limit)
 {
@@ -51,8 +404,8 @@
          *       So the rest of the code will need to deal with '0'-byte headers
          *       (ie, none, so don't try parsing em)
          */
-        // XXX: c_str() reallocates. performance regression.
-        if (SBuf::size_type mimeHeaderBytes = headersEnd(buf_.c_str(), buf_.length())) {
+        bool containsObsFold = false;
+        if (SBuf::size_type mimeHeaderBytes = headersEnd(buf_, containsObsFold)) {
 
             // Squid could handle these headers, but admin does not want to
             if (firstLineSize() + mimeHeaderBytes >= limit) {
@@ -64,6 +417,36 @@
             }
 
             mimeHeaderBlock_ = buf_.consume(mimeHeaderBytes);
+
+            /* RFC 7230 section 3:
+             * "A recipient that receives whitespace between the start-line and
+             * the first header field MUST ... consume each whitespace-preceded
+             * line without further processing of it."
+             *
+             * We need to always use the relaxed delimiters here to prevent
+             * line smuggling through strict parsers.
+             * Note that 'whitespace' in RFC 7230 includes CR. So that means
+             * sequences of CRLF will be pruned, but not sequences of bare-LF.
+             */
+            Http1::Tokenizer tok(mimeHeaderBlock_);
+            while (tok.skipOne(RelaxedDelimiterCharacters())) {
+                (void)tok.skipAll(LineCharacters()); // optional line content
+                // LF terminator is required.
+                // trust headersEnd() to ensure that we have at least one LF
+                (void)tok.skipOne(CharacterSet::LF);
+            }
+            // If mimeHeaderBlock_ had just whitespace line(s) followed by CRLF,
+            // then we skipped everything, including that terminating LF.
+            // Restore the terminating CRLF if needed.
+            if (tok.atEnd())
+                mimeHeaderBlock_ = Http1::CrLf();
+            else
+                mimeHeaderBlock_ = tok.remaining();
+            // now mimeHeaderBlock_ has 0+ fields followed by the LF terminator
+
+            if (containsObsFold)
+                unfoldMime();
+
             debugs(74, 5, "mime header (0-" << mimeHeaderBytes << ") {" << mimeHeaderBlock_ << "}");
 
         } else { // headersEnd() == 0
@@ -102,12 +485,10 @@
     debugs(25, 5, "looking for " << name);
 
     // while we can find more LF in the SBuf
-    static CharacterSet iso8859Line = CharacterSet("non-LF",'\0','\n'-1) + CharacterSet(NULL, '\n'+1, (unsigned char)0xFF);
     Http1::Tokenizer tok(mimeHeaderBlock_);
     SBuf p;
-    static const SBuf crlf("\r\n");
 
-    while (tok.prefix(p, iso8859Line)) {
+    while (tok.prefix(p, LineCharacters())) {
         if (!tok.skipOne(CharacterSet::LF)) // move tokenizer past the LF
             break; // error. reached invalid octet or end of buffer insted of an LF ??
 
@@ -120,7 +501,7 @@
             continue;
 
         // drop any trailing *CR sequence
-        p.trim(crlf, false, true);
+        p.trim(Http1::CrLf(), false, true);
 
         debugs(25, 5, "checking " << p);
         p.consume(namelen + 1);

=== modified file 'src/http/one/Parser.h'
--- src/http/one/Parser.h	2016-04-12 15:07:13 +0000
+++ src/http/one/Parser.h	2016-05-06 13:45:26 +0000
@@ -111,6 +111,10 @@
     /// consume from the tokenizer and return true only if found
     bool skipLineTerminator(Http1::Tokenizer &tok) const;
 
+    /// the characters which are to be considered valid whitespace
+    /// (WSP / BSP / OWS)
+    static const CharacterSet &DelimiterCharacters();
+
     /**
      * Scan to find the mime headers block for current message.
      *
@@ -139,6 +143,9 @@
 
     /// Whether the invalid HTTP as HTTP/0.9 hack expects a mime header block
     bool hackExpectsMime_;
+
+private:
+    void unfoldMime();
 };
 
 } // namespace One

=== modified file 'src/http/one/RequestParser.cc'
--- src/http/one/RequestParser.cc	2016-01-01 00:12:18 +0000
+++ src/http/one/RequestParser.cc	2016-05-06 13:45:26 +0000
@@ -114,30 +114,6 @@
     return UriChars;
 }
 
-/// characters HTTP permits tolerant parsers to accept as delimiters
-static const CharacterSet &
-RelaxedDelimiterCharacters()
-{
-    // RFC 7230 section 3.5
-    // tolerant parser MAY accept any of SP, HTAB, VT (%x0B), FF (%x0C),
-    // or bare CR as whitespace between request-line fields
-    static const CharacterSet RelaxedDels =
-        CharacterSet::SP +
-        CharacterSet::HTAB +
-        CharacterSet("VT,FF","\x0B\x0C") +
-        CharacterSet::CR;
-
-    return RelaxedDels;
-}
-
-/// characters used to separate HTTP fields
-const CharacterSet &
-Http::One::RequestParser::DelimiterCharacters()
-{
-    return Config.onoff.relaxed_header_parser ?
-           RelaxedDelimiterCharacters() : CharacterSet::SP;
-}
-
 /// characters which Squid will accept in the HTTP request-target (URI)
 const CharacterSet &
 Http::One::RequestParser::RequestTargetCharacters()

=== modified file 'src/http/one/RequestParser.h'
--- src/http/one/RequestParser.h	2016-01-01 00:12:18 +0000
+++ src/http/one/RequestParser.h	2016-05-06 13:45:26 +0000
@@ -56,7 +56,6 @@
     bool skipTrailingCrs(Http1::Tokenizer &tok);
 
     bool http0() const {return !msgProtocol_.major;}
-    static const CharacterSet &DelimiterCharacters();
     static const CharacterSet &RequestTargetCharacters();
 
     /// what request method has been found on the first line

=== modified file 'src/http/one/forward.h'
--- src/http/one/forward.h	2016-01-01 00:12:18 +0000
+++ src/http/one/forward.h	2016-05-06 13:45:26 +0000
@@ -10,6 +10,7 @@
 #define SQUID_SRC_HTTP_ONE_FORWARD_H
 
 #include "base/RefCount.h"
+#include "sbuf/forward.h"
 
 namespace Http {
 namespace One {
@@ -27,6 +28,9 @@
 class ResponseParser;
 typedef RefCount<Http::One::ResponseParser> ResponseParserPointer;
 
+/// CRLF textual representation
+const SBuf &CrLf();
+
 } // namespace One
 } // namespace Http
 

=== modified file 'src/mime_header.cc'
--- src/mime_header.cc	2016-01-01 00:12:18 +0000
+++ src/mime_header.cc	2016-05-08 10:00:42 +0000
@@ -13,7 +13,7 @@
 #include "profiler/Profiler.h"
 
 size_t
-headersEnd(const char *mime, size_t l)
+headersEnd(const char *mime, size_t l, bool &containsObsFold)
 {
     size_t e = 0;
     int state = 1;
@@ -35,7 +35,10 @@
                 state = 2;
             else if ('\n' == mime[e])
                 state = 3;
-            else
+            else if (' ' == mime[e] || '\t' == mime[e]) {
+                containsObsFold = true;
+                state = 0;
+            } else
                 state = 0;
 
             break;

=== modified file 'src/mime_header.h'
--- src/mime_header.h	2016-01-01 00:12:18 +0000
+++ src/mime_header.h	2016-05-10 12:20:08 +0000
@@ -11,7 +11,35 @@
 #ifndef SQUID_MIME_HEADER_H_
 #define SQUID_MIME_HEADER_H_
 
-size_t headersEnd(const char *, size_t);
+/**
+ * Scan for the end of mime header block.
+ *
+ * Which is one of the following octet patterns:
+ * - CRLF CRLF, or
+ * - CRLF LF, or
+ * - LF CRLF, or
+ * - LF LF
+ *
+ * Also detects whether a obf-fold pattern exists within the mime block
+ * - CR*LF (SP / HTAB)
+ *
+ * \param containsObsFold will be set to true if obs-fold pattern is found. Otherwise not changed.
+ */
+size_t headersEnd(const char *, size_t, bool &containsObsFold);
+
+inline size_t
+headersEnd(const SBuf &buf, bool &containsObsFold)
+{
+    return headersEnd(buf.rawContent(), buf.length(), containsObsFold);
+}
+
+/// \deprecated caller needs to be fixed to handle obs-fold
+inline size_t
+headersEnd(const char *buf, size_t sz)
+{
+    bool ignored;
+    return headersEnd(buf, sz, ignored);
+}
 
 #endif /* SQUID_MIME_HEADER_H_ */
 

=== modified file 'src/tests/stub_mime.cc'
--- src/tests/stub_mime.cc	2016-01-01 00:12:18 +0000
+++ src/tests/stub_mime.cc	2016-05-06 13:45:26 +0000
@@ -11,5 +11,5 @@
 #define STUB_API "mime.cc"
 #include "tests/STUB.h"
 
-size_t headersEnd(const char *mime, size_t l) STUB_RETVAL(0)
+size_t headersEnd(const char *, size_t, bool &) STUB_RETVAL(0)
 



More information about the squid-dev mailing list