[squid-dev] [PATCH] Support for regex with \-escaped characters

Thu Dec 4 15:32:04 UTC 2014

On 12/04/2014 07:33 AM, Amos Jeffries wrote:
> On 4/12/2014 12:04 p.m., Alex Rousskov wrote:
>> On 12/03/2014 07:46 AM, Amos Jeffries wrote:
>>> On 21/11/2014 5:38 p.m., Alex Rousskov wrote:
>>>> On 11/17/2014 05:56 PM, Amos Jeffries wrote: 
>>>>> +    while (ConfigParser::RecognizeQuotedPair_ &&
>>>>> *(nextToken-1) == '\\') {

>>>> Are you sure that nextToken never points at the start of a
>>>> buffer, especially when parsing parameters loaded from external
>>>> files? If you are absolutely sure, this guarantee should be
>>>> documented where ConfigParser::TokenParse() is defined.
>>>> Otherwise, the code has to be revised (but still support regex
>>>> parameter files that start with a backslash).

>>> I dont believe that is possible. nextToken always points at the 
>>> current buffer,

>> Yes, but I am worried that nextToken can point to the very
>> beginning of the current buffer, rendering *(nextToken-1) invalid.

>>> There is an un-conditional strcspn() call prior to the
>>> while-loop which will move the nextToken pointer down into the
>>> buffer at least one character in distance.

>> Why do you expect that strcspn() to always return a positive
>> value?

> Return value is a size_t (unsigned).

That fact does not answer my question. Why do you expect strcspn() to
always return a value greater than zero? A positive returned value is
required to satisfy your assertion that a "strcspn() call prior to the
while-loop will move the nextToken pointer down into the buffer at least
one character in distance".

>>> If the buffer starts with a '\' character the nextToken should
>>> be pointing at its escaped value character.
> 
>> Where would nextToken be increased in that case, prior to reaching
>> the loop? AFAICT, strcspn() would return zero when the buffer
>> starts with a backslash and, hence, nextToken will remain pointing
>> at the beginning of the buffer. It is possible that I have
>> overlooked some other code that makes that impossible, but even if
>> I did, the change looks very fragile at best.

The above paragraph details the earlier concern with a specific example.

>>>> The second problem with this code is that it incorrectly
>>>> handles "\\X" AFAICT: X should not be escaped, even though it
>>>> is preceded by a backslash (because that backslash is itself
>>>> escaped).
> 
>>> This code does not un-escape anything. All it does is notice '\' 
>>> escapes and ensures that both the '\' and character following
>>> are included in the token
> 
>> Unfortunately, the same code also includes characters that were
>> _not_ protected/escaped/whatever by '\'. Details below.
> 
> 
>>> The regex pattern parser is responsible for any un-escaping and
>>> processing of the token itself.
> 
>> Agreed. I am only talking about tokenizing, which is this code 
>> responsibility.
> 
> 
>>> while (... *(nextToken-1) == '\\') { ++nextToken; // skip the
>>> quoted-pair (\-escaped) character
> 
> 
>> I am talking about the "skip" code above. It is incorrectly
>> skipping X in "\\X". For example, it will "skip" (i.e., include in
>> token) a space character in
> 
>> foo\\ bar
> 
>> The above text has two space-separated tokens: "foo\\" and "bar",
>> but the loop will treat that text as one token, because it sees the
>> space character as "escaped". That is wrong AFAICT.
> 
> 
> Ah, I see. The !ConfigParser::RecognizeQuotedValues test was
> preventing the correct delimiter set being used until it had already
> reached the loop.
> 
> Fixed now.

FWIW, I do not see the fix yet (trunk r13732).

Alex.