[squid-users] What would be the maximum ufs\aufs cache_dir objects?

Wed Jul 19 13:38:27 UTC 2017

On 18/07/17 05:34, Eliezer Croitoru wrote:
> So basically from I understand the limit of the AUFS\UFS cache_dir is at:
> 16,777,215 Objects.
> So for a very loaded system it might be pretty "small".
> 
> I have asked since:
> I have seen the mongodb ecap adapter that stores chunks and I didn't liked it.
> In the other way I wrote a cache_dir in GoLang which I am using for the windows updates caching proxy and for now it's surpassing the AUFS\UFS limits.
> 
> Based on the success of the Windows Updates Cache proxy which strives to cache only public objects, I was thinking about writing something similar for a more global usage.
> The basic constrain on what would be cached is only If the object has Cache-Control "public".

You would end up with only a small sub-set of HTTP every being cached.

CC:public's main reason for existence is to re-enable cacheability of 
responses that contain security credentials - which is prevented by 
default as a security fail-safe.

I know a fair number of servers still send it when they should not. But 
that is declining as content gets absorbed by CDN who take more care 
with their bandwidth expenditure.

> The first step would be an ICAP service (respmod) which will log requests and response and will decide what GET results are worthy of later fetch.
> Squid currently does things on-the-fly while the client transaction is fetched by the client.

What things are you speaking about here?

How do you define "later"? is that 1 nanosecond or 64 years?
  and what makes 1 nanosecond difference in request timing for a 6GB 
object any less costly than 1 second?

Most of what Squid does and the timing of it have good reasons behind 
them. Not saying change is bad, but to make real improvements instead of 
re-inventing some long lost wheel design one has to know those reasons 
to avoid them becoming problems.
  eg. the often laughed at square wheel is a real and useful design for 
some circumstances. And their lesser bretheren cogwheels and the like 
are an age proven design in rail history for places where roundness 
actively inhibits movement.

> For an effective cache I believe we can compromise on another approach which relays or statistics.
> The first rule is: Not everything worth caching!!!
> Then after understanding and configuring this we can move on to fetch *Public* only objects when they get a high repeated downloads.
> This is actually how google cache and other similar cache systems work.
> They first let traffic reach the "DB" or "DATASTORE" if it's the first time seen.

FYI: that is the model Squid is trying to move away from - because it 
slows down traffic processing. As far as I'm aware G has a farm of 
servers to throw at any task - unlike most sysadmin trying to stand up a 
cache.

> Then after more the a specific threshold they object is being fetched by the cache system without any connection to the transaction which the clients consume.

Introducing the slow-loris attack.

It has several variants:
1) client sends a request, very , very, ... very slowly. many thousands 
of bots all do this at once, or building up over time.
   -> an unwary server gets crushed under the weight of open TCP 
sockets, and its normal clients get pushed out into DoS.

2) client sends a request. then ACK's delivery, very, very, ... very slowly.
   -> an unwary server gets crushed under the weight of open TCP 
sockets, and its normal clients get pushed out into DoS. AND suffers for 
each byte of bandwidth it spent fetching content for that client.

3) both of the above.

The slower a server is at detecting this attack the more damage can be 
done. This is magnified by whatever amount of resource expenditure the 
server goes to before detection can kick in - RAM, disk I/O, CPU time, 
TCP sockets, and of most relevant here: upstream bandwidth.

Also, Loris and clients on old tech like 6K modems or worse are 
indistinguishable.

To help resolve this problem Squid does the _opposite_ to what you 
propose above. It makes the client delivery and the server fetch align 
to avoid mistakes detecting these attacks and disconnecting legitimate 
clients.
  The read_ahead_gap directive configures the threshold amount of server 
fetch which can be done at full server-connection speed before slowing 
down to client speed. The various I/O timeouts can be tuned to what a 
sysadmin knows about their clients expected I/O capabilities.

> It might not be the most effective caching "method" for specific very loaded systems or specific big files and *very* high cost up-stream connections but for many it will be fine.
> And the actual logic and implementation can be each of couple algorithms like LRU as the default and couple others as an option.
> 
> I believe that this logic will be good for specific systems and will remove all sort of weird store\cache_dir limitations.

Which weird limitations are you referring to?

The limits you started this thread about are caused directly by the size 
of a specific integer representation and the mathematical properties 
inherent in a hashing algorithm.

Those types of limit can be eliminated or changed in the relevant code 
without redesigning how HTTP protocol caching behaves.

Amos