[squid-dev] [PATCH] Support for regex with \-escaped characters

Thu Dec 4 16:18:13 UTC 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 5/12/2014 4:32 a.m., Alex Rousskov wrote:
> On 12/04/2014 07:33 AM, Amos Jeffries wrote:
>> On 4/12/2014 12:04 p.m., Alex Rousskov wrote:
>>> On 12/03/2014 07:46 AM, Amos Jeffries wrote:
>>>> On 21/11/2014 5:38 p.m., Alex Rousskov wrote:
>>>>> On 11/17/2014 05:56 PM, Amos Jeffries wrote:
>>>>>> +    while (ConfigParser::RecognizeQuotedPair_ && 
>>>>>> *(nextToken-1) == '\\') {
> 
>>>>> Are you sure that nextToken never points at the start of a 
>>>>> buffer, especially when parsing parameters loaded from
>>>>> external files? If you are absolutely sure, this guarantee
>>>>> should be documented where ConfigParser::TokenParse() is
>>>>> defined. Otherwise, the code has to be revised (but still
>>>>> support regex parameter files that start with a
>>>>> backslash).
> 
> 
>>>> I dont believe that is possible. nextToken always points at
>>>> the current buffer,
> 
>>> Yes, but I am worried that nextToken can point to the very 
>>> beginning of the current buffer, rendering *(nextToken-1)
>>> invalid.
> 
> 
>>>> There is an un-conditional strcspn() call prior to the 
>>>> while-loop which will move the nextToken pointer down into
>>>> the buffer at least one character in distance.
> 
> 
>>> Why do you expect that strcspn() to always return a positive 
>>> value?
> 
> 
>> Return value is a size_t (unsigned).
> 
> That fact does not answer my question. Why do you expect strcspn()
> to always return a value greater than zero? A positive returned
> value is required to satisfy your assertion that a "strcspn() call
> prior to the while-loop will move the nextToken pointer down into
> the buffer at least one character in distance".
> 

I think we were getting misdirected by the behaviour caused by the
sep= logics. My tests were showing a guaranteed increment, but when I
added the dump of which sep value was being used it showed the '\'
were not being found at all. The !RecognizeQuotedValues check was
causing "sep=w_space;" and skiping past them.

The new commit, both finds the '\', checks it without a -1 on
nextToken and increments over exactly 2 chars for the "\x" sequence.

> 
>>>> If the buffer starts with a '\' character the nextToken
>>>> should be pointing at its escaped value character.
>> 
>>> Where would nextToken be increased in that case, prior to
>>> reaching the loop? AFAICT, strcspn() would return zero when the
>>> buffer starts with a backslash and, hence, nextToken will
>>> remain pointing at the beginning of the buffer. It is possible
>>> that I have overlooked some other code that makes that
>>> impossible, but even if I did, the change looks very fragile at
>>> best.
> 
> The above paragraph details the earlier concern with a specific
> example.
> 
> 
>>>>> The second problem with this code is that it incorrectly 
>>>>> handles "\\X" AFAICT: X should not be escaped, even though
>>>>> it is preceded by a backslash (because that backslash is
>>>>> itself escaped).
>> 
>>>> This code does not un-escape anything. All it does is notice
>>>> '\' escapes and ensures that both the '\' and character
>>>> following are included in the token
>> 
>>> Unfortunately, the same code also includes characters that
>>> were _not_ protected/escaped/whatever by '\'. Details below.
>> 
>> 
>>>> The regex pattern parser is responsible for any un-escaping
>>>> and processing of the token itself.
>> 
>>> Agreed. I am only talking about tokenizing, which is this code
>>>  responsibility.
>> 
>> 
>>>> while (... *(nextToken-1) == '\\') { ++nextToken; // skip
>>>> the quoted-pair (\-escaped) character
>> 
>> 
>>> I am talking about the "skip" code above. It is incorrectly 
>>> skipping X in "\\X". For example, it will "skip" (i.e., include
>>> in token) a space character in
>> 
>>> foo\\ bar
>> 
>>> The above text has two space-separated tokens: "foo\\" and
>>> "bar", but the loop will treat that text as one token, because
>>> it sees the space character as "escaped". That is wrong
>>> AFAICT.
>> 
>> 
>> Ah, I see. The !ConfigParser::RecognizeQuotedValues test was 
>> preventing the correct delimiter set being used until it had
>> already reached the loop.
>> 
>> Fixed now.
> 
> 
> FWIW, I do not see the fix yet (trunk r13732).

I collided with Christos commits, its at rev.13734.

Amos
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJUgIlEAAoJELJo5wb/XPRjLQYH/jEP+RyAvAG6UNZwPgzeAB9x
qaQr8vL1167b9rAkUONQCRO3vfhKVZjOqVrHEQ2h0cXWCSEunGDVEREvBTuiDr4E
3pG+gd/dscgna2GamcDr/pdbNF9DSA9yAdkLgdRTbCP9XBqYPJjzaNobTGVnk7FZ
D23ALoUWV5JSgvlpEQwtXFyBuxfd/OS2T8kusyNWU9TbiOc6puHX/F/V6cOCO9rP
bOmqBVihF9wowvNRIFt13FYMRHcpQ9gsTQF75olxpdJTOf4avqBRPRlXYOeVxS5G
o9VF8R5s0hFFXZP8M8N0uVscwj+ISXTy+pv6gyVNA78NiECOpr3zjeZdnGepLr8=
=WcGj
-----END PGP SIGNATURE-----