[squid-dev] [PATCH] mime unfolding

Thu May 19 01:59:09 UTC 2016

On 05/14/2016 06:42 AM, Amos Jeffries wrote:

> One of the parsers you will find attached (MIME_UNFOLD_SLOW) is that
> loop expanded into functions so each if statement gets its own named
> function and so non-genius people can avoid reading the description of
> unfoldMime.

I am afraid you misunderstand the problem I am trying to solve. IMO, the
parsing code, as it was written before, was impossible for a human with
average abilities to fully comprehend. The number of if statements is
not the problem on its own. Moving an if statement into a function does
not solve the problem. There is just too much persistent state and state
change/interaction complexity. We have seen many examples of similar
code leading to bugs because nobody could fully comprehend the complex
interactions happening in loops like that.

With very few exceptions that neither of us, sadly, qualify for, humans
are incapable of writing low-level non-trivial parsing code correctly. I
have seen so many bugs in such code (in Squid and elsewhere) that I am
quite convinced that this is a nearly "universal truth".

Until we have better parsing tools at our disposal, we should use
tokenizers and similar safety helpers when writing parsing code, and we
should keep it simple. This will not fully protect us from bugs (nothing
will!), but it will reduce the number of bugs and will protect us from
most CVEs.

During this particular project, as you know, it took several private
review iterations to replace unsafe (and immediately buggy!) low-level
code with tokenizers. Reducing that loop complexity is the final step. I
was hoping that providing a sketch of how to do that would allow for a
quick and painless finale, but I was obviously wrong.

I tried again below, and I hope you will like the final result more. I
am leaving intermediate steps in case you would want to stop sooner than
I did. They are short.

The simplest implementation I can think of would look like this:

    while (!tk.atEnd()) {
        if (skipObsFolds(tk))
            result += ' '; // replace obs-folds with SP
        else
            result += tk.chars(1); // advance one character
    }

but we do not want to append one character at a time, so we optimize to
append more characters (while watching for the obs-fold start). That
"more characters" optimization is where the complexity and bugs come in.

I would start with appending all characters until CR or LF because
obs-fold has to start with CR or LF. This algorithm is not going to
catch all longest sequences, but it will be reasonably efficient while
remaining simple:

    while (!tk.atEnd()) {
        if (skipObsFolds(tk))
            result += ' '; // replaced all obs-folds with one SP
        else
            result += tokenUntilCrLf(tk); // advance one or more chars
    }

    /// always extracts the first character;
    /// extracts more characters if they are not CR or LF
    SBuf
    tokenUntilCrLf(Tokenizer &tk) {
        const SBuf buf = tk.remaining();
        const auto skipped = tk.skipOne(ANY) + tk.skipAll(notCRLF);
        return buf.substr(0, skipped);
    }

Furthermore, we can observe that skipping the leading CR?LF? is always
safe _after_ our skipObsFolds() check. Thus, we can replace that
"skipped = ..." line above with the one that may skip even more:

  // this will skip at least one character (CR, LF, and/or notCRLF)
  const auto skipped =
      tk.skipOne(CR) + tk.skipOne(LF) + tk.skipAll(notCRLF);

Now you may notice that skipObsFolds() has to _start_ with a similar
CR?LF skipping sequence, so we can extract that common code from both
helper functions to arrive at something like this:

    while (!tk.atEnd()) {
        const SBuf all = tk.remaining();
        const auto crLen = tk.skipOne(CR); // may not be there
        const auto lfLen = tk.skipOne(LF); // may not be there
        if (lfLen && tk.skipAll(WSP)) // obs-fold!
            result += ' '; // replace one obs-fold with one SP
        else
            result += all.substr(0, crLen+lfLen + tk.skipAll(notCRLF));
    }

The above is a little too complex for my taste but is probably close to
optimal performance, given the tools that we have to use. "A little too
complex" because it is not _obvious_ that the loop is always making
progress. The somewhat hidden fact here is that if crLen+lfLen is zero
(i.e., no progress yet), then tk.skipAll(notCRLF) has to succeed
(progress!) since the next character has to be CR, LF, or none of those.

The usual "this is a sketch" disclaimer applies.

>  this is C/C++ not one of those formal proofing languages, and
> *can* implement the faster parser without compromising readability.

Agreed, but you did not accomplish that goal IMHO.

>> The delimiter is obs-fold, not CRLF.

> It is neither.

Here I obviously disagree. The delimiter is obs-fold. No grammar
complexities and ambiguities change that fact.

I am skipping the rest of the discussion about the types of parsers we
are writing. To reach any consensus there would require the level of
accuracy in terminology and methodology that I have never seen on this
mailing list. If you think that I sketched a recursive descent parser,
we both can survive that misunderstanding. At the end of the day, a
parser category consensus would not solve any important problems.

Alex.