[squid-dev] RFC: Adding a new line to a regex

Wed Jan 19 21:32:03 UTC 2022

Hello,

    We have a use case where a regex in squid.conf should contain/match
a new line (i.e. ASCII LF). I do not know whether there are similar use
cases with the existing squid.conf regex directives, but that is not
important because we are adding a _new_ directive that will need such
support. This email discusses the problem and proposes how to add a new
line (and other special characters) to regexes found in squid.conf and such.

Programming languages usually have standard mechanisms for adding
special characters to strings from which regexes are compiled. We all
know that "a\nb" uses LF byte in the C++ string literal. Other bytes can
be added as well: https://en.cppreference.com/w/cpp/language/escape

Unfortunately, squid.conf syntax lacks a similar general mechanism. In
most cases, it is possible to work around that limitation by entering
special symbols directly. However, that trick causes various headaches
and does not work for new lines at all because squid.conf preprocessor
and parameter parser use/strip all new lines; the code compiling the
regular expression will simply not see any.

In POSIX regex(7), the two-character \n escape sequence is referring to
the ASCII character 'n', not the new line/LF character, so entering \n
(two characters) into a squid.conf regex value will not work if one
wants to match ASCII LF.

There are many options for adding this functionality to regexes used in
_new_ squid.conf contexts (i.e. contexts aware of this enhancement).
Here is a fairly representative sample:

1a. Recognize just \n escape sequence in squid.conf regexes
   Pros: Simple.
   Cons: Converting old regexes[1] requires careful checking[2].
   Cons: Cannot detect typos in escape sequences. \r is accepted.
   Cons: Cannot address other, similar use cases (e.g., ASCII CR).

1b. Recognize all C escape sequences in squid.conf regexes
   Pros: Can detect typos -- unsupported escape sequences.
   Cons: Poor readability: Double-escaping of all for-regex backslashes!
   Cons: Converting old regexes requires non-trivial automation.

2a. Recognize %byte{n} logformat-like sequence in squid.conf regexes
   Pros: Simple.
   Cons: Converting old regexes[1] requires careful checking[3].
   Cons: Cannot detect typos in logformat-like sequences.
   Cons: Does not support other advanced use cases (e.g., %tr).

2b. Recognize %byte{n} and logformat sequences in squid.conf regexes
   Pros: Can detect typos -- unsupported logformat sequences.
   Cons: The need to escape % in regexes will surprise admins.
   Cons: Converting old regexes requires (simple) automation.

3. Use composition to combine regexes and some special strings:
   regex1 + "\n" + regex2
   or
   regex1 + %byte{10} + regex2
   Pros: Old regexes can be safely used without any conversions.
   Cons: Requires new, complex composition expressions/syntax.
   Cons: A bit difficult to read.
   Cons: Requires a lot of development.

4. Use 2b but only when regex is given to a special function:
   substitute_logformat_codes(regex)
   Pros: Old regexes can be safely used without any conversions.
   Pros: New regexes do not need to escape % (by default).
   Pros: Extendable to old regex configuration contexts.
   Pros: Extendable to non-regex configuration contexts.
   Pros: Reusing the existing parameters(...)-like call syntax.
   Cons: A bit more difficult to read than 1a or 2a.
   Cons: Duplicates "quoted string" approach in some directives[4].
   Cons: Requires arguing about the new function name :-).

Given all the pros and cons, I think we should use option 4 above.

Do you see any better options?

Thank you,

Alex.

[1] We are still talking about new configuration contexts here, but we
should still be thinking about bringing old regexes into new contexts
and, eventually, possibly even about upgrading old contexts to support
the new feature. Neither is required or urgent, but it would be nice to
eventually end up with a uniform regex approach, of course.

[2] In most cases, no old regexes will contain the \n sequence because
it means nothing to the regex compiler. A few exceptional regexes can be
edited manually. Automated conversion will be required only in some rare
cases.

[3] Essentially the same as [2] above: Old regexes are unlikely to
contain the %byte 5-character sequence (or whatever we end up calling
that special sequence -- we are polishing a PR that adds %byte support
to logformat).

[4] Several existing squid.conf directives interpret "quoted values"
specially, substituting embedded logformat %codes. Arguably, the
explicit function call mechanism is better because there is less
confusion regarding which context supports it and which does not. And we
probably should not "quote regexes" because many old regexes contain
double quotes already.