[squid-dev] RFC: Adding a new line to a regex

Alex Rousskov rousskov at measurement-factory.com
Sat Jan 22 04:17:02 UTC 2022


TLDR: Solutions #4 (substitute_logformat_codes) and #6 (STL regex) are
still leading. The next key question is: Do we want this feature to be
the last drop that brings STL regex support to Squid? If yes, #6 wins
IMO. Two followup questions at the end of the email, below details...


On 1/21/22 8:59 PM, Amos Jeffries wrote:
> On 22/01/22 08:36, Alex Rousskov wrote:
>> TLDR: I am adding solution #6 into the mix based on Amos email (#5 was
>> taken by Eduard). Amos needs to clarify why he thinks that Squid master
>> branch cannot accept STL-based regexes "now". After that, we can decide
>> whether #6 remains a viable candidate. Details below.
>>
>>
>> On 1/21/22 12:42 PM, Amos Jeffries wrote:
>>> On 20/01/22 10:32, Alex Rousskov wrote:
>>>> We have a use case where a regex in squid.conf should contain/match
>>>> a new line [...] This email discusses the problem and proposes how
>>>> to add a new line (and other special characters) to regexes found
>>>> in squid.conf and such.
>>
>>
>>> With the current mix of squid.conf parsers this RFC seems irrelevant
>>> to me.
>>
>> I do not understand the relationship between "the current mix of
>> squid.conf parsers" and this RFC relevance. This RFC is relevant because
>> it is about a practical solution to a real problem facing real Squid
>> admins.
>>
> 
> Sentence #2 of the RFC explicitly states that admin needs are not
> relevant "I do not know whether there are similar use
> cases with the existing squid.conf regex directives"

There is no connection between that sentence and admin needs:

* Admin needs are not limited to the existing squid.conf regex directives.

* It is possible that similar needs do exist for the existing squid.conf
regex directives. I just do not know it. This existence is not required
for the RFC to be relevant (because of the first bullet).


> The same sentence delimits RFC scope as: "adding a _new_ directive that
> will need such support."

Correct.


> That means the syntax defining how the regex pattern is configured does
> not yet exist. It is not necessary for the developer to design their
> _new_ UI syntax in a way that exposes admin to this problem in the first
> place. Simply design the

The above paragraph was cut short, but I agree with its beginning: It is
not necessary. In fact, I would say that hitting the problem we are
trying to solve would be a very bad idea! However, our definition of
"way that exposes" may differ.


>> Whether Squid has one parser or ten, good ones or bad ones, is relevant
>> to how the solution is implemented/integrated with Squid, of course, but
>> that is already a part of the analysis on this thread.

> Very relevant.

Glad we agree.


> RFC cites "squid.conf preprocessor and parameter parser
> use/strip all new lines" as a problem.
>
> I point out that this behaviour depends on *which* config parser is
> chosen to be used by the (again _new_) directive. It should be an
> implementation detail for the dev, not design consideration for this RFC.

The behavior I described is rooted in the _preprocessor_ behavior, so
one cannot work around it by inventing a new directive parser. By the
time we get to the directive parser, it is already too late -- the new
lines are already stripped.

Finally, even if we modify the preprocessor to preserve new lines in
some special cases, preserving real new lines is actually not a good
solution at all! That solution results in very awkward-looking
configuration statements. The code would not care, of course, but humans
writing those statements will. All of the proposed solutions are better IMO.


> I must state that I do not see much in the say of squid.conf syntax
> discussion in the RFC text. It seems to focus a lot on syntax inside the
> regex pattern.

... which is obviously a part of the encompassing squid.conf syntax, but
it does not really matter -- if you prefer to think of this RFC as
discussing regex pattern syntax, you can do that and still make correct
decisions AFAICT.


> IMO regex is such a complicated situation that we should avoid having
> special things inside or on top of its syntax. That is a recipe for
> admin pain.

I agree, we should, but we also should solve the problem this RFC is
addressing. The problem obviously has acceptable solutions. There is no
reason to say "regex are too complicated" and walk away. We can make
things better here by finding the best solution we can find.


>> 6. Use STL regex features that support \n and similar escape sequences
>> Pros: Supports much more than just advanced escape sequences!
>> Pros: The new syntax is easy to document by referencing library docs.
> 
> Pro: we do not have to write any part of pattern matching ourselves.
> Simpler config parser.
> 
> Pro: we do not have to maintain custom code supporting special
> behaviours in regex pattern configuration.
> 
> Pro: we do not have to provide additional user support for non-standard
> squid.conf patterns.
> 
> Pro: we do not have to waste brain cycles designing how to integrate
> syntax into regex patterns cleanly.

Some of the new Pros above are serious exaggerations, but I do not think
we need to fight over these details at this point.


>> Cons: Requires serious changes to the internal regex support in Squid.

> IIRC, the changes are not as serious as it may seem. The largest part is
> squid.conf parser alteration to accept the proposals flag/prefix and
> patterns cleanly. Beyond that is just a switch of container which is
> easy (not trivial, just easy).

Given the poor state of some underlying C++ classes, and the code using
them, it will not be easy to do this correctly IMO. I have seen easier
changes take a lot of iterations to get right and to merge, for many
reasons that are still here.

However, I do not think we have to argue about the exact difficulty
level of this task for now. You say "easy". I say "serious". I doubt
this vague difference in our estimates will decide which solution is the
best. If it will, we can come back to this item.


>> Cons: Miserable STL regex performance in some environments[1,2]?
> 
> IMO this is balanced by Squid existing regex being well known to have
> similar performance issues.

We can always make things much worse by adding STL regex library-induced
delays to the existing Squid delays. AFAICT, many folks believe that the
GNU regex library we are using is much faster than some popular STL
library implementations. Whether most of that difference is in memory
allocations and whether most of those memory allocations are equally bad
for existing GNU regex code in Squid is currently unknown.

I suspect we will not know for sure until we implement at least some
rudimentary form of STL regex support so that we can do a meaningful
test (for a given STL implementation).


>> Cons: Converting old regexes requires (complex) automation.
> 
> Disagree this is problem.
> 
> GNU regex is predecessor syntax behind all modern regex variants. We can
> retain GNUregex as the default pattern and require language flag/prefix
> for patterns needing modern features.

We can, but this is not what this Cons item is talking about. This
specific Cons item is talking about the cost of a decision by an admin
and/or (eventually) the Squid Project to switch to _one_ (new) regex
style everywhere. At that point, complex conversion will be required.

You may believe that that moment will never come. I am not so sure
because there is obviously value in using (new) regexes and using one
regex style everywhere. I do not think we need to agree on this.


>> Cons: Requires dropping GCC v4.8 support.
>> Cons: Amos thinks Squid cannot support STL regex until 2024.
> 
> the technical part of my earlier statement is already
> covered by the GCC 4.8 line.

Since I did not know what your assertions regard 2024 are based on, I
could not safely merge the two entries. If this is just about GCC v4.8,
I will disregard the second one.


>> [2] STL does not allow us to define a custom allocator for its regexes.
>> Various STL implementations have various hidden workarounds, but we will
>> be at their (varying) mercy.

> probably should be a Con in its own right.

Sure, let's treat it as such.



>>> One Core Developer (you Alex) has repeatedly expressed a strong opinion
>>> veto'ing the addition/removal of features to Squid-6 while they are
>>> still officially supported by a small set of "officially supported"
>>> Vendors. RHEL and CentOS being in that set.

>> Sorry, I have no idea what you are talking about.

> Your latest voicing of it was in
> http://lists.squid-cache.org/pipermail/squid-dev/2021-December/009743.html

Nothing in that email precludes stopping GCC v4.8 support in master.


>> "
>> Any
>> known Squid regression affecting the "main" environment should block the
>> PR introducing that regression IMO.
>> "

I still believe the above assertion is correct, but it is irrelevant
here. Regression tests do not drive what Project changes are possible.
*We* decide which environments we support, not our regression tests!


> The definition of "main" under discussion in that thread never reached
> consensus to change away from the existing OS represented by the Jenkins
> 5-pr-test nodes. So (for now) it still includes LTS versions of RHEL /
> CentOS 7 shipping the broken GCC 4.8.x std::regex.

Yes, but so what?! We have the power to decide:

1. Whether master must continue to support GCC v4.8. If we decide that
we should drop that support (for various good reasons beyond this RFC),
then we will obviously stop testing Squid for GCC v4.8 regressions or
move those tests into the "optional" or non-blocking categories.

2. Whether the new feature must be available in GCC v4.8 builds. If
there are good reasons to base this feature on STL regexes, and good
reasons to keep GCC v4.8 support, then we _can_ make the new feature
conditional on newer compilers. Doing so increases implementation
complexity and support headaches, but it is technically possible. We
already have a Cons item to remind us about this problem. I am not
saying this is something we should do, only that this is an option to
consider.

I knew I was not repeatedly vetoing dropping GCC v4.8. Glad this was
clarified, putting one more wild misinterpretation behind us.


Do you think we should bite the bullet and add STL regex support to
Squid master branch?

If yes, I suggest _postponing_ the decision regarding GCC v4.8 support
until it becomes clear how difficult it would be to make the new code
_compile_ on GCC v4.8? I know it is not going to _work_, but Squid can
easily detect that and reject configurations that are using new regexes
with GCC v4.8 builds. "Sorry, you cannot use this regex with this
binary" errors are not ideal, but this approach saves a lot of work, and
adds support for the vast majority of Squid deployments that are willing
to run master/v6.


Thank you,

Alex.


More information about the squid-dev mailing list