[squid-dev] [PATCH] Support for regex with \-escaped characters

Amos Jeffries squid3 at treenet.co.nz
Wed Dec 3 14:46:46 UTC 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 21/11/2014 5:38 p.m., Alex Rousskov wrote:
> On 11/17/2014 05:56 PM, Amos Jeffries wrote:
> 
>> For now the detection is only added during parsing of regex
>> tokens or files.
> 
> And it should probably stay that way: We cannot easily add support
> for \-based escaping in bare strings because in some of those
> strings \ is meaningful on its own (unfortunately).
> 
> 
>> +    while (ConfigParser::RecognizeQuotedPair_ && *(nextToken-1)
>> == '\\') {
> 
> Are you sure that nextToken never points at the start of a buffer, 
> especially when parsing parameters loaded from external files? If
> you are absolutely sure, this guarantee should be documented where 
> ConfigParser::TokenParse() is defined. Otherwise, the code has to
> be revised (but still support regex parameter files that start with
> a backslash).

I dont believe that is possible. nextToken always points at the
current buffer, although it may point at the \0 c-string terminator.
There is an un-conditional strcspn() call prior to the while-loop
which will move the nextToken pointer down into the buffer at least
one character in distance.

If the buffer starts with a '\' character the nextToken should be
pointing at its escaped value character. If it does not start with
that then it will be somewhere down the buffer after the next delimiter.

I have moved the \0 terminator check up into the while() condition so
that it will not do a [-1] de-reference if the buffer is ever of
0-length. Although that should have been found and protected against
already by the if-condition at the start of the method.


> 
> The second problem with this code is that it incorrectly handles
> "\\X" AFAICT: X should not be escaped, even though it is preceded
> by a backslash (because that backslash is itself escaped).

This code does not un-escape anything. All it does is notice '\'
escapes and ensures that both the '\' and character following are
included in the token (unless that following character is the \0
terminator). The regex pattern parser is responsible for any
un-escaping and processing of the token itself.

It accounts for tokens starting with a '\', for multiple sequential
'\', and for "loaded file lines wrongly ending with a '\'. In that
latter case by treating them as an un-escaped '\' character in the
pattern token.


> 
>> +	<tag>configuration_includes_quoted_values</tag> +	<p>Regex
>> expression values can not be parsed when this directive is +
>> configured to <em>ON</em>.
> 
> 
> This is a little misleading because, IIRC, the 
> configuration_includes_quoted_values hack can be turned ON for just
> a section of the configuration file.\ I suggest s/when this
> directive is configured to ON/in parts of configuration where this
> directive is set to ON/.
> 

Done.

> 
> Please consider s/can not/cannot/ for improved consistency:
> 
> $ fgrep -i 'can not' src/*pre | wc -l 7 $ fgrep -i 'cannot'
> src/*pre | wc -l 19
> 

Done for the release notes text that made you point this out.
The others are unrelated to this patch. Will do as a separate cleanup
change.

> 
>> +	<tag>configuration_includes_quoted_values</tag> +	<p>Regex
>> expression values can not be parsed when this directive is +
>> configured to <em>ON</em>. Instead Squid now accepts regex
>> \-escaped +	   characters including escaped whitespace.
> 
> 
> We should clarify that \-escaped characters are supported when
> parsing regular expressions, not just any token. I also suggest
> moving the last sentence to its own paragraph as it has no relation
> to configuration_includes_quoted_values:

It is directly related to the previous sentence. eg. We do not support
A. Instead we support B.

Also, all <p> entries following <tag> are documented changes to that
same <tag>.

I opted for documenting this under
configuration_includes_quoted_values since the alternative would be
listing each and every directive in Squid which has a regex option and
repeating the same two sentences about how it operates in regards to
when a previous configuration_includes_quoted_values was set to ON.


I'm in half a mind to temporarily disable the
configuration_includes_quoted_values when calling RegexToken(). But
since we already have a hard error when the two collide I left it
as-is and try to document what to do instead of placing quotes around
the pattern.

> 
> <p>When parsing regular expressions, Squid now accepts \-escaped 
> characters, including escaped whitespace.</p>
> 
> If "escaped whitespace" means \s, I would say so explicitly. If it 
> means "\ ", I would say "including escaped space characters".

Escaped whitespace means \ followed by whitespace. As per the bug
report use case.

  acl foo browser Windows\ XP Windows.NT

has two patterns in it equivalent to the single pattern:
  acl foo browser Windows(\ XP|.NT)

Fixed.

>> +    static bool RecognizeQuotedPair_; ///< The next tokens may
>> containing quoted-pair (\-escaped) characters
> 
> s/containing/contain/

Fixed.

> 
> 
> Except for the nextToken-1 concerns, the above corrections are 
> cosmetic, of course.
> 

Fixed all, and applied to trunk as rev.13730

Amos
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJUfyJVAAoJELJo5wb/XPRjRmsIAIv+hgXFiQd0St4U3wFYneSR
RGlFpnAGs8NyYiihxf7nnGBgYzVvbBbzLEIPSAzemGGfhfIsmu36ruSCj74qsA77
zTLt6bBwmSVUwrjrPdWmua64rswB4WOibtdaBr2+qCiy82G6CrbS8aP0CSC7h7p3
Uyvl1JzJHDFxz8qj31KJv+gvFo5RKYjD7p1Q+OJYkwhFweVk5G4C9Pr3olmyY7zn
cV4y07L+8NFt/+r2n0rceVBKjiITvAGXr3X9cKDhdJwUE102TqAY1XaPaSl3UQJE
GKrz0srs3A1Vn0JVyq7xf4GP4xqxrqNhLP0Uoh64oCn+l7ZXCuPyhVnghYg48Ck=
=O2Dd
-----END PGP SIGNATURE-----


More information about the squid-dev mailing list