[squid-users] Question on developing customized Cache Selection algorithm from Round Robin, Least Load
squid3 at treenet.co.nz
Tue Aug 18 10:24:45 UTC 2015
On 18/08/2015 5:42 a.m., Du, Hongfei wrote:
> We are in an attempt to extend Squid Cache selection algorithm to
a more sophisticated, let’s say to add WRR or WFQ, a few questions to
Like Eliezer said this is really a question for squid-dev mailing list
where the developers hang out.
WRR (weighted round-robin) is already implemented and exactly how Squid
cache_dir currently operate. The weighting is based on storage area
available size and I/O loading.
WFQ (weighted fair queueing) is a queueing algorthm as the 'Q' says.
Caching != queueing. In fact a cache is so different from a queue that
WFQ would badly affect performance if it were used to decide what
storage an object went into.
In essence, the problem is that we cannot dictate what objects will be
requested by clients. They want what they ask for. Squids duty is 1) to
answer reliably and 2) fast as possible regardless of objects location.
> - As we probably has to rewrite new algorithm and recompile it, so
does anyone know where(or which file) is the existing Round Robin or
Least Load algorithm defined in source codes?
That depends on whether you mean the algorithm applied for local storage
vs network sources, or the one(s) applied to individual caches for
> - Is there straight forward method to tell/instruct squid to store
content from network(e.g. an URL) in a predefined specific disk folder
rather than using the selection algorithm itself?
The URL and all other relevant details from the transaction are hashed
to lookup an index to find the 32-bit 'sfileno' value which is a UID
encoding the location of indexed objects in Squid local storage.
It _sounds_ simple enough, but those "other relevant details" is a
massive complication. One single URL can potentially contain all
possible objects that ever have or ever will exist on the Internet. Even
considering storing things one file per URL dies a horrible death when
it encounters popular modern websites.
Within Squid we refer to "the HTTP cache" as a single thing. But it is
constructed of many storage areas. The individual cache_dirs and other
places where HTTP objects might be found. Remote network sources are
also accounted for.
There is algorithm(s) applied in layers to decide which type of storage
area is use, then which one within the selected type is most
appropriate. Based on object availability, cacheability, size, storage
area speed, object popularity, and temporal relationships to others.
Then a sfileno is assigned if its local storage.
Then objects get moved between storage areas anyway based on need and
popularity. And objects get removed from invdividual storage areas based
on lack of popularity. Both of which affect future requests for them.
So the particulars of what you want to do matter, a lot.
FWIW, we have known outstanding needs for:
* updated cache_peer selection algorithms. Current Squid outgoing TCP
connection failover works with a list of IPs that get tried until one
succeeds. The old selection algorithms produce only a single IP rather
than a preference-ordered set of peers to try.
- also none of the algorithms provide byte-base loading.
* ETag based cache index. For better performant If-Match/If-None-Match
* 206 partial object caching. Rock can store them, but no algorithms yet
exist to properly manage the pieces of incomplete objects or aggregation
from different transactions.
* per-area storage indexes, instead of a Big Global Index. Working
towards 64-bit sfileno which are needed for some TB sized caches. Rock
and Transients storage areas are done, but other caches still TODO.
* better HDD load detection. To inform the weighting of cache_dir
seectio algorithms. This is a hardware driver related project.
* Support for ZFS and XFS dynamic inode sizing. This causes lots of
issues with "wrong" disk storage under/over usage. Another hardware
driver related project.
More information about the squid-users