[squid-dev] [PATCH] implement RFC3986

Tue Jan 5 12:11:32 UTC 2016

I've been playing around with these changes today and the more I thnk
about it the more it seems it should be in src/anyp/ using the Uri::
namespace - which will encompass both RFCs 1738 and 3896 CharacterSet
definitions along with other src/URL.h src/url.cc code implementing
those RFCs logics.

On 29/12/2015 10:53 a.m., Alex Rousskov wrote:
> On 12/28/2015 12:01 PM, Kinkie wrote:
>> On Mon, Dec 28, 2015 at 5:46 PM, Alex Rousskov wrote:
>>> On 12/28/2015 07:01 AM, Kinkie wrote:
>>>> +static int
>>>> +fromhex(char ch)
>>>> +{
>>>> +    if (ch >= '0' && ch <= '9')
>>>> +        return ch - '0';
>>>> +    if (ch >= 'a' && ch <= 'f')
>>>> +        return ch - 'a' + 10;
>>>> +    if (ch >= 'A' && ch <= 'F')
>>>> +        return ch - 'A' + 10;
>>>> +    return -1;
>>>> +}
>>>
>>> If you are after performance, an unconditional lookup in a 256-member
>>> int vector would be faster.
>>
>> Performance is good, but it's not the primary objective. For now I'd
>> leave as-is.
> 
> Consider adding a TODO comment then. Someday, somebody might actually
> look through TODO comments and implement simple and useful optimizations
> like this one [ instead of complaining that there are only hard problems
> that they cannot solve well or, worse, attempting to solve those problems ].
> 

Today I re-implemented both tohex() and fromhex() as array lookups. It
is a little time consuming to manually generate the fromhex array. But a
one-off action and seems to work fine.

> 
>>>> +    for (auto in = s.begin(); in != e; ++in) {
>>>> +        if (*in != '%') { // normal case, copy and continue
>>>> +            rv.push_back(*in);
>>>> +            continue;
>>>> +        }
>>> ...
> 
> 
>>> I cannot validate that low-level C code. Can you use a Tokenizer
>>> instead? Is it critical to provide copy-less performance for helpers
>>> using std::string instead of SBuf?
> 
> 
>> Performance for helpers is not really critical IMO.
>> SBuf is a lot of code and it really depends on infrastructure provided
>> by squid (e.g. mempools).
>> Outside squid it really makes not much sense to bring in that much
>> scaffolding - which would also have include stubs etc.
>> It's much better to be API-compatible and use std::string there.
> 
> I doubt that writing low-level code that is difficult to review and that
> we often get wrong is worth std::string API compatibility. I have
> already speculated about that in the previous squid-dev message so I
> will not repeat those thoughts here.
> 
> 
> 
> 
>>>> +const CharacterSet
>>>> +    RFC3986::Unsafe("rfc1738:unsafe", "<>\"# %{}|\\^~[]`'"),

RFC 1738 does not specify the ' character in that unsafe set.

>>>> +    RFC3986::Ctrls("rfc1738:ctrls", {{0x00, 0x1f}, {0x7f,0xff}}),
>>>> +    RFC3986::Reserved1738("rfc1738:reserved", ";/?:@=&"),
>>> ...
>>>> +    RFC3986::Unescaped = (UnsafeAndCtrls - CharacterSet(nullptr,"%") ).rename("rfc1738:unescaped"),
>>>
>>>
>>> If these are RFC 1738 sequences, why place them in the RFC 3986 namespace?
>>
>> Providing callers backwards compatibility is the main reason for me.
> 
> This answer does not compute for me: AFAIK, you are adding a _new_ class
> or namespace called RFC3986. Thus, backwards compatibility is not an
> issue here.
> 
> Please note that I am not saying that "rfc1738:ctrls" and
> "rfc1738:unsafe" sets should not be provided. I am saying that if they
> are provided, they should be provided inside an Rfc1738 namespace rather
> than inside RFC3986.

The set called Ctrls is actually CharacterSet::CTL, which we already
have defined in CharacterSet.h

Amos