[squid-dev] [RFC] Do we want paranoid_hit_validation?

Tue Jan 8 03:58:22 UTC 2019

Hello,

    Squid has a few bugs that may result in rock cache corruption.
Factory is working on fixing those bugs. During that work, we have added
support for validating rock disk cache entry metadata at the time of a
cache hit.

This particular validation does not require checksums or other expensive
computations. It does not require disk I/O. The code simply traverses
the chain of disk slot metadata for the entry and compares the sum of
individual slot sizes with the expected total cache entry size. The
validation is able to detect many (but not all) cases of cache index
corruption.

A sketch of configuration directive to control this feature is quoted
further below. The initial (incomplete/unpolished but "working")
implementation is at
https://github.com/measurement-factory/squid/commit/c884c6d775f316e0c2962472fd9f9e7a7f86ff32

Should we add a polished version of this feature to Squid?

* Pros: Can detect and _bypass_ many cache index corruption cases.

* Cons: Requires quite a few extra CPU cycles for larger objects (they
have longer slot lists) but useless for correctly working code because
such code will never corrupt its index. It is essentially a
triage/troubleshooting feature.

The validation cost can be reduced by limiting the slot-scanning loop to
a few iterations by default (as opposed to just turning the feature off
by default) -- in most environments, most hits are small. However, the
sooner we interrupt the loop, the fewer corruption cases we can detect
(the probability of corruption may increases with the chain length).

It can be argued that since this kind of validation can be implemented
as a stand-alone tool (that scans shared memory indexes of a running
Squid), it should not be accepted into Squid [runtime code]. The counter
argument here is that such "external" metadata scans can be very lengthy
for large caches and are, hence, likely to miss fresh corruption cases
(that may have a higher probability of being accessed again!). The
built-in code has the advantage of being executed for disk cache hits
only and being able to convert corrupted hits into benign cache misses.

What do you think?

Thank you,

Alex.

------------- cf.data.pre ----------------
NAME: paranoid_hit_validation
DEFAULT: off
LOC: Config.onoff.paranoid_hit_validation
DOC_START
       Controls whether Squid should perform paranoid validation of
       cache entry metadata integrity every time a cache entry is hit.
       Squid bugs notwithstanding, this low-level validation should
       always succeed. Each failed validation results in a cache miss,
       a BUG line reported to cache.log, and the invalid entry marked
       as unusable (and eventually purged from the cache).
DOC_END