[squid-users] What would be the maximum ufs\aufs cache_dir objects?

Sat Jul 22 21:17:27 UTC 2017

OK so time for a response!

I want to first describe the different "cache" models which are similar to real world scenarios.
The first and basic example would be the local small "Store" which is supplying food or basic things you need like work tools such as a screw driver and many other small things for basic house maintenance.
In these type of stores you have "on-demand"  order which is separated into the "fast" or the "slow" supply time.
And compared to these there is the "Warehouse" or some big place which in some cases just supply what they "have" or "sell" like ikea or similar brands.
And of-course in the world of workshops it's everything "on-demand" and almost nothing is on the shelf while these supply their services for almost anyone they also store basic standard materials which can be used for each order.

In the world of proxies we have a "storage" system but it's not 100% similar to any of the real world scenarios of "storage" in stores.
For this reason it's hard to just pick a specific model like "LRU" which local food stores uses most of the time but with a "pre-fetch" flavour.

For a cache admin there are couple things to think about before implementing any cache:
- purpose
- available resources(bandwidth, storage space, etc...)
- funding for the system\project

So for example some admins just try blindly to force the usage of caching on their clients while they harm themselves and their clients(what squid 3.3+ fixed).
They just don't get it that the only time when you want to cache is when you need it and not when you want it.
If you have limited amount of bandwidth and your clients knows it but blindly "steal" the whole line from others forcibly the real measures to enforce the bandwidth policy is not using a cache but by a QOS system.
There are solutions which can help admins to give their clients the best Internet experience.
I know that on some ships for example that have expensive Satellite Internet links you pay per MB and Windows 10 updates download are out of the question and Microsoft sites and updates should be blocked by default and should only be allowed for specific major bug fixes cases.
For places which have lots of users but limited bandwidth, cache might not be the right solution for every scenario and you(admin) need a bandwidth policy rather than a cache.
A cache is something I can call "luxury" and it only an enhancement of a network.
In today Internet there is so much content out there that we actually need to kind "limit" our usage and consumption to something reasonable compared to our environment.
With all my desire to watch some YouTube video in 720p HD or 1080p HD it's not the right choice if someone else in my network needs to use the Internet link for medical related things.

With all the above in mind I believe that the Squid way of doing things is good and fit for most of the harshest environments and 3.5 does good job restricting the admin from caching what is dangerous to be cached.
This is one of the things I consider in squid as Stable!

And now related to the statement "cache only public" is divided into two types:
- Everything that worth caching and is permitted by law and will not do harm
- Caching what is required to be cached

For example: Why should I cache a 13KB object if I have a 10Gbit line to the WAN?

>From my experience with caching there is no "general" solution and the job of a cache admin should be a task that can take time to tune the system for the right purpose.
For example if squid will cache every new object there is an option that the clients way of using the internet will fill up the disk and then will start a cycle of cache cleanup and population that will never end and there will be a circular situation of an "egg and a chicken" and the cache will never even serve one hit because of the admin tries to "catch all".

So now cache public objects might take another thing than "CC =~ public" and the successor of the tiny technical term will become a "smarter" one that defines "public" to what is really should be cached and that have a chance to being re-fetched and re-downloaded more than once and will not cause the cache to cycle around and just write+cleanup over and over again.

Squid currently does "things"(analyses requests and responses and populates the cache) on-the-fly and gives the users a very "Active"' part in the population of the cache..
So the admin have the means to control the cache server and how much the users have influence but most of these I know tend to just spin a cache instance and maybe google for couple useless "refresh_pattern" and then use them, causing this endless loop of store..cleanup...

Squid is a great  product but the combination of:
- Greed
- Lust
- Pressure 
- Budget
- Lack of experience
- Laziness

Leads cache systems around the world to less effective than they would have been with a bit more effort to understand the subject practically.

You have asked about "later"  and the definition is admin dependent.
Of-course that for a costly links such as Satellite later might not be cheaper.. but it can be effective.
Depends on the scenario later would be the right way of doing things while in many other cases it might not and the cache admin needs to do some digging understand what he does.
A cache is actually  a shelf product and if you need something that will work "in-line" ie both transparent and "on-the-fly" and "user-driven" it might be a good idea to pay on it to somebody that can give results.
As was mentioned here(list) in the past, when you calculate the hours of a sysadmin compared to a ready to use product there are scenarios which a working product is the better choice from all the aspects that was mentioned up here in this email.

I am pretty sure I understand why squid timing for the download is working as it is...( I am working in a big ISP after all, one of the top 10 in the whole area).
But I want to clear out that I don't want to invent a wheel but to give response to some specific cases which I already do.
For example the MS updates caching proxy will not work for other domains that these of ms updates.
Also I just want to mention that MS updates has a very remarkable way of making sure that the client will receive the file and insure it's integrity, MS deserve a big respect for their way of implementing the CIA!!!
(Despite to the fact that many describe why and how much they dislike MS)

Indeed G has more than one server farms that helps with "harvesting, rendering, analyzing, categorizing etc" which many doesn't have and I claim that for specific targeted things I can offer a free solutions that might help many ISP's that are already using squid.
Also I believe that offering video tutorials from a Squid Developer might help cache admins to understand how to tune their systems without creating this "cycle" I mentioned earlier.
(I do intent to release couple tutorials and I would like to get recommendations for some key points that needs to be said)

About the mentioned cons\attacks that the server would be vulnerable to...
The service I am offering would work with squid as an ICAP service and it will not download just any request over and over again.
Also it's good you mentioned these specific attacks pattern because the solution should eventually be integrated with squid logs analysis to find out how many unique requests have been made for a specific url\object and it will help to mitigate some of these attacks.

I do like the read ahead gap and I liked the concept but we are talking about couple things eventually:
- Understand the scenario and the requirements from the cache
- Limit the cache "Scope"
- Allow an object to be fetched only once and based on statistics.
- Allow squid to cache what it can and the ICAP service will act as an "Addon" for squid helping the admin with specific scenarios like MS updates, YouTube, Vimeo and couple other sites of interest.

Currently squid cannot use AUFS\UFS cache_dir for SMP and the cache store system I wrote utilizes the FS and has the option to choose between hashing algorithm like instead of MD5 such as SHA256\512..
I believe that it's a time that we start to think about more then MD5 ie SHA256 and maybe make it configurable as we talked a year or more ago.(I cannot do this and I do not have a funder for this..)

And just to grasp the differences, the caching service I am running for MS updates utilizes less CPU, balance the CPU and get very high number of cache HIT's and throughput.

@bold@ My solutions are act as addons to Squid and not replacing it.. at bold@
So if squid is vulnerable to something it will be hit before my service.

Currently I am just finishing a solution for YouTube local Store for public videos only.
It consists of couple modules:
- Queue and fetch system
- Storage system (NFS)
- Web Server(Apache, PHP)
- Object storage server
- Squid traffic analysis utilities
- External acl helper that will help to redirect traffic from the YouTube Page into the locally cached version( will have an option to bypass the cached version)

I indeed wrote some things from scratch but the concept was there for a very long time and was built over time.
>From my testing MS updates are a pain in the neck in the last few years but I have seen improvement with them.
I noticed that Akamai services are sometimes broken and MS systems tends to start fetching the object form Akamai and then starts to stream it directly from a MS farm directly so....
CDN are nice but if you implement then in the wrong way the can "block" the traffic.
This specific issue I have seen with MS updates spanned over couple countries and I didn't managed to contact any of Akamai personal sing the public emails contacts for a while..
So they just don't get paid from MS due to their lack of effort to make their service one level up.

I hope that couple things were cleared out.
If you have any more comments I'm here for them.

Eliezer

----
Eliezer Croitoru
Linux System Administrator
Mobile: +972-5-28704261
Email: eliezer at ngtech.co.il

-----Original Message-----
From: squid-users [mailto:squid-users-bounces at lists.squid-cache.org] On Behalf Of Amos Jeffries
Sent: Wednesday, July 19, 2017 16:38
To: squid-users at lists.squid-cache.org
Subject: Re: [squid-users] What would be the maximum ufs\aufs cache_dir objects?

On 18/07/17 05:34, Eliezer Croitoru wrote:
> So basically from I understand the limit of the AUFS\UFS cache_dir is at:
> 16,777,215 Objects.
> So for a very loaded system it might be pretty "small".
> 
> I have asked since:
> I have seen the mongodb ecap adapter that stores chunks and I didn't liked it.
> In the other way I wrote a cache_dir in GoLang which I am using for the windows updates caching proxy and for now it's surpassing the AUFS\UFS limits.
> 
> Based on the success of the Windows Updates Cache proxy which strives to cache only public objects, I was thinking about writing something similar for a more global usage.
> The basic constrain on what would be cached is only If the object has Cache-Control "public".

You would end up with only a small sub-set of HTTP every being cached.

CC:public's main reason for existence is to re-enable cacheability of 
responses that contain security credentials - which is prevented by 
default as a security fail-safe.

I know a fair number of servers still send it when they should not. But 
that is declining as content gets absorbed by CDN who take more care 
with their bandwidth expenditure.

> The first step would be an ICAP service (respmod) which will log requests and response and will decide what GET results are worthy of later fetch.
> Squid currently does things on-the-fly while the client transaction is fetched by the client.

What things are you speaking about here?

How do you define "later"? is that 1 nanosecond or 64 years?
  and what makes 1 nanosecond difference in request timing for a 6GB 
object any less costly than 1 second?

Most of what Squid does and the timing of it have good reasons behind 
them. Not saying change is bad, but to make real improvements instead of 
re-inventing some long lost wheel design one has to know those reasons 
to avoid them becoming problems.
  eg. the often laughed at square wheel is a real and useful design for 
some circumstances. And their lesser bretheren cogwheels and the like 
are an age proven design in rail history for places where roundness 
actively inhibits movement.

> For an effective cache I believe we can compromise on another approach which relays or statistics.
> The first rule is: Not everything worth caching!!!
> Then after understanding and configuring this we can move on to fetch *Public* only objects when they get a high repeated downloads.
> This is actually how google cache and other similar cache systems work.
> They first let traffic reach the "DB" or "DATASTORE" if it's the first time seen.

FYI: that is the model Squid is trying to move away from - because it 
slows down traffic processing. As far as I'm aware G has a farm of 
servers to throw at any task - unlike most sysadmin trying to stand up a 
cache.

> Then after more the a specific threshold they object is being fetched by the cache system without any connection to the transaction which the clients consume.

Introducing the slow-loris attack.

It has several variants:
1) client sends a request, very , very, ... very slowly. many thousands 
of bots all do this at once, or building up over time.
   -> an unwary server gets crushed under the weight of open TCP 
sockets, and its normal clients get pushed out into DoS.

2) client sends a request. then ACK's delivery, very, very, ... very slowly.
   -> an unwary server gets crushed under the weight of open TCP 
sockets, and its normal clients get pushed out into DoS. AND suffers for 
each byte of bandwidth it spent fetching content for that client.

3) both of the above.

The slower a server is at detecting this attack the more damage can be 
done. This is magnified by whatever amount of resource expenditure the 
server goes to before detection can kick in - RAM, disk I/O, CPU time, 
TCP sockets, and of most relevant here: upstream bandwidth.

Also, Loris and clients on old tech like 6K modems or worse are 
indistinguishable.

To help resolve this problem Squid does the _opposite_ to what you 
propose above. It makes the client delivery and the server fetch align 
to avoid mistakes detecting these attacks and disconnecting legitimate 
clients.
  The read_ahead_gap directive configures the threshold amount of server 
fetch which can be done at full server-connection speed before slowing 
down to client speed. The various I/O timeouts can be tuned to what a 
sysadmin knows about their clients expected I/O capabilities.

> It might not be the most effective caching "method" for specific very loaded systems or specific big files and *very* high cost up-stream connections but for many it will be fine.
> And the actual logic and implementation can be each of couple algorithms like LRU as the default and couple others as an option.
> 
> I believe that this logic will be good for specific systems and will remove all sort of weird store\cache_dir limitations.

Which weird limitations are you referring to?

The limits you started this thread about are caused directly by the size 
of a specific integer representation and the mathematical properties 
inherent in a hashing algorithm.

Those types of limit can be eliminated or changed in the relevant code 
without redesigning how HTTP protocol caching behaves.

Amos
_______________________________________________
squid-users mailing list
squid-users at lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-users