[squid-dev] RFC: Adding a new line to a regex

Amos Jeffries squid3 at treenet.co.nz
Fri Jan 21 17:42:08 UTC 2022


On 20/01/22 10:32, Alex Rousskov wrote:
> Hello,
> 
>      We have a use case where a regex in squid.conf should contain/match
> a new line (i.e. ASCII LF). I do not know whether there are similar use
> cases with the existing squid.conf regex directives, but that is not
> important because we are adding a _new_ directive that will need such
> support. This email discusses the problem and proposes how to add a new
> line (and other special characters) to regexes found in squid.conf and such.


With the current mix of squid.conf parsers this RFC seems irrelevant to me.

The developer designing a new directive also writes the parse_*() 
function that processes the config file line. All they have to do is 
avoid using the parser functions which implicitly do the problematic 
behaviour.
  The fact that there is logic imposing this problem at all is a bug to 
be resolved. But that is something for a different RFC.


> 
> Programming languages usually have standard mechanisms for adding
> special characters to strings from which regexes are compiled. We all
> know that "a\nb" uses LF byte in the C++ string literal. Other bytes can
> be added as well: https://en.cppreference.com/w/cpp/language/escape
> 

There was a plan from 2014 (re-attempted by Christos 2016) to migrate 
Squid from the GNURegex dependency to more flexible C++11 regex library 
which supports many regex languages. With that plan the UI would only 
need an option flag or pattern prefix to specify which language a 
pattern uses.

That plan was put on hold due to feature-incomplete GCC 4.8 versions 
being distributed by CentOS 7 and RHEL needing to build Squid.

One Core Developer (you Alex) has repeatedly expressed a strong opinion 
veto'ing the addition/removal of features to Squid-6 while they are 
still officially supported by a small set of "officially supported" 
Vendors. RHEL and CentOS being in that set.


When combined, those two design limitations mean the C++11 regex library 
cannot be implemented in a Squid released prior to June 2024.



IMO that plan is still a good one for long-term. However you design your 
new directive UI please make it compatible with that.


> Unfortunately, squid.conf syntax lacks a similar general mechanism.

This is not a property of squid.conf design choices. It is an artifact 
of the GNURegex language.

Until Squid gets a major upgrade to support other regex languages. We 
are stuck with these pattern limitations.

  In
> most cases, it is possible to work around that limitation by entering
> special symbols directly. However, that trick causes various headaches
> and does not work for new lines at all because squid.conf preprocessor
> and parameter parser use/strip all new lines; the code compiling the
> regular expression will simply not see any.
> 
> In POSIX regex(7), the two-character \n escape sequence is referring to
> the ASCII character 'n', not the new line/LF character, so entering \n
> (two characters) into a squid.conf regex value will not work if one
> wants to match ASCII LF.
> 
> There are many options for adding this functionality to regexes used in
> _new_ squid.conf contexts (i.e. contexts aware of this enhancement).
> Here is a fairly representative sample:
> 
> 1a. Recognize just \n escape sequence in squid.conf regexes
>     Pros: Simple.
>     Cons: Converting old regexes[1] requires careful checking[2].
>     Cons: Cannot detect typos in escape sequences. \r is accepted.
>     Cons: Cannot address other, similar use cases (e.g., ASCII CR).
> 
> 1b. Recognize all C escape sequences in squid.conf regexes
>     Pros: Can detect typos -- unsupported escape sequences.
>     Cons: Poor readability: Double-escaping of all for-regex backslashes!
>     Cons: Converting old regexes requires non-trivial automation.
> 

As you mention these \-escape is a feature of POSIX Regular Expression 
language.

Taking this step we will no longer to honestly say that Squid is only 
supporting GNU "regex" patterns. Open the floodgate and you will find a 
mountain of admin wanting the other POSIX features for one reason or 
another.

We would be better accepting the long-ago planned migration to C++11 
regex than taking more half-measures like implementing \-escape patterns 
ourselves.


> 
> 2a. Recognize %byte{n} logformat-like sequence in squid.conf regexes
>     Pros: Simple.
>     Cons: Converting old regexes[1] requires careful checking[3].
>     Cons: Cannot detect typos in logformat-like sequences.
>     Cons: Does not support other advanced use cases (e.g., %tr).
> 
> 2b. Recognize %byte{n} and logformat sequences in squid.conf regexes
>     Pros: Can detect typos -- unsupported logformat sequences.
>     Cons: The need to escape % in regexes will surprise admins.
>     Cons: Converting old regexes requires (simple) automation.
> 
> 
> 3. Use composition to combine regexes and some special strings:
>     regex1 + "\n" + regex2
>     or
>     regex1 + %byte{10} + regex2
>     Pros: Old regexes can be safely used without any conversions.
>     Cons: Requires new, complex composition expressions/syntax.
>     Cons: A bit difficult to read.
>     Cons: Requires a lot of development.
> 

Please no. There are enough regex languages confusing people. Lets not 
be responsible for creating yet another.

That is my clear "no" vote on all (2) and (3) idea variants.

> 
> 4. Use 2b but only when regex is given to a special function:
>     substitute_logformat_codes(regex)
>     Pros: Old regexes can be safely used without any conversions.
>     Pros: New regexes do not need to escape % (by default).
>     Pros: Extendable to old regex configuration contexts.
>     Pros: Extendable to non-regex configuration contexts.
>     Pros: Reusing the existing parameters(...)-like call syntax.
>     Cons: A bit more difficult to read than 1a or 2a.
>     Cons: Duplicates "quoted string" approach in some directives[4].
>     Cons: Requires arguing about the new function name :-).
> 

Or (5) Alex puts aside his objection blocking the plan to convert Squid 
to C++11 regex library.


Cheers
Amos


More information about the squid-dev mailing list