[squid-dev] [WTF] HttpHeader strangenesses

Alex Rousskov rousskov at measurement-factory.com
Tue Aug 11 16:36:11 UTC 2015


On 08/10/2015 08:47 PM, Amos Jeffries wrote:
> On 11/08/2015 11:10 a.m., Alex Rousskov wrote:
>> On 08/10/2015 02:13 PM, Kinkie wrote:
>>> I'm going over HttpHeader to see if there's any possible improvements to
>>> be obtained by dragging it (kicking and screaming, from the look of it)
>>> into the 10's.
>>>
>>> Here's a list of weirdnesses I'm coming to find, hopefully someone has
>>> any idea of WHY they are like this, or if they can be safely be removed:
>>>
>>> - HttpHeader::insertTime makes no sense. It's O(N) and it's always used
>>> in two patterns: delById(HDR); insertTime(HDR) and if (!has(HDR)
>>> insertTime(HDR)).
>>
>>
>> Henrik may know why he went the extra mile to insert header Date at the
>> beginning of the message as opposed to just appending it at the end
>> (trunk r7037).


> IIRC its a HTTP/1.0 performance optimization.
> 
> Locating Host quickly avoids work in the inefficient getHost() pre-parse
> function which is now Http::One::Parser::getHeaderField().
> 
> Locating Date header at the top avoids a similar scan to determine
> cacheability and validity of received replies. AFAIK that is not a
> useful now that HTTP/1.1 involve other headers than Date.


If you are talking about optimizing Squid code, then I think we should
work on removing any "preparse" scans such as getHeaderField() calls in
prepareAcceleratedURL() instead of discussing their optimization:

1. Quickly parse into an object supporting fast lookups/manipulations.
2. Perform semantic analysis and manipulations using the object from #1.


> So if we want to optimize it is probably best to optimize allowing this
> type of insertion than to remove it completely. 

We should not remove what we need, of course. Just be careful about
optimizing something we do not use often or should not be using often.

For example:

* Probably useful optimization: Replace most remove/insert pairs with an
update operation that does not remove or insert anything unless the
removal is actually needed.

* Probably a waste of time: Optimizing the insertion-at-the-top that
either does not happen often now or should not continue to happen often
[after other, useful optimizations].


> We probably want to be making the headers list into multiple lists and
> just add/remove/append to the appropriate one. Squid will mostly not
> have to deal with the entity or custom headers at all and it wastes time
> searching through them for control headers that were already found
> during parse.

Avoid scans for something already detected is important, but I do not
think it is (or should be) related to splitting headers into multiple
lists. I recommend keeping a single container at least until we catch
all the bigger fish. Moreover, the way we optimize lookups may influence
how we split the storage container in the future (if such a split is
needed at all).


> We may also want to think about having the control headers as
> first-class unique members of HttpHeader instead of using the list
> accesses. For example a HdrDate HdrCc, HdrSc, Hdr member that stores the
> date including a "not present" flag. 

Yes, and we already store some parsed field values (e.g.,
content_length), but clumsily.


> Leaving the headers list only containing the entity and unknown headers.

Not storing original header field images is a separate optimization
decision IMO (compared to storing their parsed values). We may actually
want to preserve images while caching parsed values.


>  Assembling the message buffer would then be a hard-coded sequence of {
> member Host, member Date, member Cc, member ..., walk generic headers list }

If we are talking about optimization, it may be best to _avoid_
re-assembly of header fields that do not require modifications. If that
is too difficult, then we may want to avoid generation of string images
for header fields that we have seen a reusable string image for. All of
this will probably save a lot more cycles than splitting header field
container.


>>> - statistics accounting: we count A LOT (see HttpHeader::clean) Is it
>>> really worth it in a performance-critical code as this? Maybe we could
>>> shuffle things around to HttpHeader and HttpHeaderEntry's ctor and dtor ..

>> The statics code is ugly and, IIRC, collects inaccurate stats in many
>> cases. I am sure it can be significantly improved.
>>
>> However, I would recommend going into the _opposite_ direction. Instead
>> of trying to find the right low-level code to collect header stats (and
>> then dealing with problems such as cloned headers screwing up all
>> stats), I would add dedicated statistics-collection calls in a few
>> specific places in the high-level code, where new headers are "accepted
>> into Squid" and should be counted.
>>
>> Neither approach will cover 100% of cases, but I think the high-level
>> approach will have a lot less problems overall. As an added bonus, it
>> would be easier to disable all or some statistics header collection that
>> way.


> Agreed. Parser is the only thing that should be stats accounting for
> "seen" (aka received) headers. And that can do it at the same point it
> adds the freshly parsed value to the HttpHeader object state.


Parser is an example of low-level code that I would not associate with
statistics collection _unless_ your goal is to actually collect
low-level statistics about parsing. AFAIK, the original goal of header
statistics collection was to learn "what internet traffic looks like".
What we parse is different (e.g., we may parse the same HTTP message
header more than once).


HTH,

Alex.



More information about the squid-dev mailing list