[squid-dev] The next step towards: StoreID and metalink.

Mon Sep 21 21:57:24 UTC 2015

On 22/09/2015 9:40 a.m., Eliezer Croitoru wrote:
> Alex, Amos,
> 
> The first step now I am trying to grasp first what could be done in the
> current state of squid and ECAP without any changes.
> Currently squid provides the methods to modify a request and\or response.
> 
> Basic StoreID support in a REQMOD should be possible in any case to
> allow an ECAP module to do those things and also sine it has more
> information about the request details it can decide better then the
> current StoreID helpers.
> Leaving the StoreID aside and back to hashes.
> Currently the ECAP module can calculate hashes on the fly and then in
> the end of the transaction write the result to either a log file or a DB.
> For now the benefits i see from this is the option to find duplicate
> content based on the hash.
> For example: running a db lookup for similar hashes or something like
> sort by hash.
> pesudo:
> iterate on the hashes and urls
> if hash exists add the url to the array
> else create a new hash to url array mapping.
> then list urls with more then one url and get statistics for that.
> * the statistics can be based on an access.log object/download size lookup
> 
> It would require some relational DB or some other way to store it in a
> K\V DB.
> 
> I think that in the current state of ECAP I can only build a statistics
> tool based on the ECAP module.
> 
> The actual cases which might benefit from a cache lookup would be
> metalinks.
> And a "If Modified Digest" might also benefit from it.
> 
> There is another way to de-duplicate content for metalinks using cache
> objects planting\redirection.
> The procedure would involve a setup which will allow for example a 302
> redirection planting.
> I will describe it more in depth:
> A response from a trusted source was found with metalink sources.
> Once the hash was validated a series of objects "insertion" or
> "implanting" stage will start.
> In this stage each and everyone of the link urls will be planted in the
> DB with a 302 redirect url.
> (it can be inserted into squid cached objects or using an external DB)
> The result will be:
> If someone tries to contact a specific URL which is the DB, a 302
> redirection will be issued towards the already cached and hashed url.
> it's not 100% full proof unless there is knowledge about the cache
> internals but as Amos suggested in the past, the store.log might be
> enough to make it possible to track cache removals and insertions.
> 
> Which of the ideas is a more realistic one compared to changing squid
> and\or ECAP?

Using 302 kind of defeats the purpose of metalinks: that the content can
be fetched from alternative URLs if one breaks in any way (by the
client, or by Squid on revalidation retries).

Using StoreID helper to re-write the IDs to the same cached content is
more inline with the metalinks model, in that Squid and/or client can
revalidate the cached object from whatever URI the client is fetching
and it updates all the other ones the StoreID maps to that sfileno ID.

Note that all this use of eCAP and StoreID is for PoC testing how
metalinks works. Long-term these actions should all be an internal
feature of Squid. If we find that Squid needs to be altered to make the
PoC work then we are probably better just patching the related part of
metalinks operation in directly, or re-designing the PoC.

Amos