[squid-dev] Planning for experiments with Digest, Link and metalink files.

Thu Jan 28 11:56:28 UTC 2016

On 28/01/2016 9:18 p.m., Eliezer Croitoru wrote:
> I have been working for some time on experimenting things related to
> metalinks files and I have couple things in my mind about it.
> Now I am running for more then 4 month my local proxy service with
> SHA256 live digesting of all traffic using an eCAP module.
> It's a small service and cannot take too much weight but it seems that
> with digest enabled or disabled squid doesn't loose enough speed that I
> seem to care.
> 

The other one to test is AES hashing. IIRC, that should be faster than
SHA256 (but slower than MD5).

> Now I wanted to move into another side of my experiment and to implement
> some web service that uses the metalinks files data.
> Since metalinks are for files or static objects the first thing is a
> simple file server.
> 
> I do not know what header to use for the digest hashes? I have seen that
> fedora mirror server uses "Digest: TYPE=hash" but from what I remember
> Henrik was talking about Content-MD5 and similar.
> I can implement both Content-TYPE headers and Digest but what standards
> are the up-to-date?
> 

AFAIK, the header should be defined in the metalinks RFC document,
(whichever that is). Content-MD5 is related, but when I looked at its
specification it was locked to MD5 algorithm and not extensible to
better hashes.

> Also the other question, how should the If-None-Match or another
> conditional should be used(since it was meant for ETAG)?

The metalinks does not use conditional headers at all.
<https://tools.ietf.org/html/rfc6249#section-6> defines either a GET
requet with Range headers or a HEAD request as the way to fetch the
Links and Digest header details.

The idea is to fetch the headers from each mirror and compare the Digest
value and/or ETag to ensure it is providing the right object.
Accumulating any new Links to the possible set as things progress. If
the Digest does not match those headers are probably "broken" and
another mirror needs to be tried.

Metalinks use-case is for resuming a broken fetch or splitting download
of a large object across multiple mirrors in an attempt to boost D/L
speed with parallelisation - but we dont like that second one because
Squid still lacks Range/partial caching of the resulting parts.

There is an "If:" header in WebDAV that seems to provide the necessary
semantics for passing the hashes in conditional requests. But the text
is not very clear on what field-value format to use, or how to interpret
the lists it contains.
<http://tools.ietf.org/html/rfc4918#section-10.4>

> 
> For now I am only working on a full hash match leaving aside any partial
> content pieces matching.
> 
> The options to validate file hash vs a static web service are:
> - via some POST request(which can return a 304\302\200)
> - via special if-none-match header
> 
> Also I had in my mind a situation that the client has couple hashes and
> want to verify a full match to the whole set or to at-least one from a
> set. What do you think should be done?
> 
> The current implementations relies on the fact that everybody uses the
> same default hash but I am almost sure that somewhere in the future
> people would like to run Digest match to some kind of salted algorithm
> so I am considering what would be the right way to allow\implement
> support for such cases.
> Am I dreaming too much?
> 
> Another issue is a PUT requests, does it makes sense to attach some
> Digest or other headers by the user that will be stored with the file
> metadata or metalink file? compared.. to uploading two files one for the
> object and another one for the metalink file?

We want to eventually be storing the payload/body of the PUT, and
updating its URL / Store-ID entry to be based on the response
Content-Location header that comes back from the server.

In light of that, doing a hash of that body makes sense. The problem is
just that Squid does not currently implement that storage idea at all.
So there is nothing stored until a later GET pulls it back into the proxy.

HTH
Amos