[squid-dev] [PATCH] Support for regex with \-escaped characters

Wed Dec 3 23:04:20 UTC 2014

On 12/03/2014 07:46 AM, Amos Jeffries wrote:
> On 21/11/2014 5:38 p.m., Alex Rousskov wrote:
>> On 11/17/2014 05:56 PM, Amos Jeffries wrote:

>>> +    while (ConfigParser::RecognizeQuotedPair_ && *(nextToken-1)
>>> == '\\') {
> 
>> Are you sure that nextToken never points at the start of a buffer,
>> especially when parsing parameters loaded from external files? If
>> you are absolutely sure, this guarantee should be documented where
>> ConfigParser::TokenParse() is defined. Otherwise, the code has to
>> be revised (but still support regex parameter files that start with
>> a backslash).
> 
> I dont believe that is possible. nextToken always points at the
> current buffer, 

Yes, but I am worried that nextToken can point to the very beginning of
the current buffer, rendering *(nextToken-1) invalid.

> There is an un-conditional strcspn() call prior to the while-loop
> which will move the nextToken pointer down into the buffer at least
> one character in distance.

Why do you expect that strcspn() to always return a positive value?

> If the buffer starts with a '\' character the nextToken should be
> pointing at its escaped value character. 

Where would nextToken be increased in that case, prior to reaching the
loop? AFAICT, strcspn() would return zero when the buffer starts with a
backslash and, hence, nextToken will remain pointing at the beginning of
the buffer. It is possible that I have overlooked some other code that
makes that impossible, but even if I did, the change looks very fragile
at best.

>> The second problem with this code is that it incorrectly handles
>> "\\X" AFAICT: X should not be escaped, even though it is preceded
>> by a backslash (because that backslash is itself escaped).

> This code does not un-escape anything. All it does is notice '\'
> escapes and ensures that both the '\' and character following are
> included in the token

Unfortunately, the same code also includes characters that were _not_
protected/escaped/whatever by '\'. Details below.

> The regex pattern parser is responsible for any
> un-escaping and processing of the token itself.

Agreed. I am only talking about tokenizing, which is this code
responsibility.

>     while (... *(nextToken-1) == '\\') {
>         ++nextToken; // skip the quoted-pair (\-escaped) character

I am talking about the "skip" code above. It is incorrectly skipping X
in "\\X". For example, it will "skip" (i.e., include in token) a space
character in

  foo\\ bar

The above text has two space-separated tokens: "foo\\" and "bar", but
the loop will treat that text as one token, because it sees the space
character as "escaped". That is wrong AFAICT.

>> Except for the nextToken-1 concerns, the above corrections are
>> cosmetic, of course.

> Fixed all, and applied to trunk as rev.13730

The two non-cosmetic concerns remain for the now-committed code IMO.

Thank you,

Alex.