[squid-users] Regex optimization

Wed Apr 27 14:01:31 UTC 2016

On 27/04/2016 11:32 p.m., Alfredo Rezinovsky wrote:
> I saw in debug log that when an ACL has many regexes each one is compared
> sequentially.
> 
> If I have
> 
> www.facebook.com
> facebook.com
> www.google.com
> google.com
> 
> If will be faster to check just ONE optimized regex like
> (www\.)?(facebook|google).com than the previous three?
> 
> I'm really talking about optimizing about 3000 url regexes in one huge
> regex because comparing each and every url to 3000 regexes is too slow.

As Yuri was trying to point out (I think) simply using one bigger regex
pattern is not always meaning faster.

> 
> I know using
> (www\.facebook\.com)|(facebook\.com)|(www\.google\.com)|(google\.com) with
> PCRE will produce the same optimized result as
> (www\.)?(facebook|google)\.com. Squid uses GnuRegex. Does GNURegex lib
> optimizes this as well ?

If you actually pass GNURegex that *single* pattern. Yes, it will do
some optimization. Though I'm not sure how much exactly in comparison to
PCRE.

 * Also, while GNURegex is the built-in backup regex engine bundled with
Squid. It really is only a backup engine for systems like Windows which
dont provide a regex engine. The stdlib regex library is always used if
available. On some OS that stdlib engine is GNU, on others PCRE or
something even better.

What you see in the log is the fact that Squid is actually *not*
configured with a single compound "optimized" pattern. You are actually
using a file with ~3000 patterns in it ... so 3000 regex patterns to be
checked against the URL.

Whether Squid checks 3000 tests or some smaller number depends on what
Squid version you are using. The recent versions do some trivial pattern
aggregation and stripping away prefix/suffix ".*" garbage to help the
library optimize better. But as Yuri showed, bigger pattern is not
necessarily better *steps* for per-test speed. The gains are mostly in
reduced Squid code CPU time and RAM overheads.
Regex is still the slowest of the ACLs in terms of raw CPU consumed.

The biggest problem with using regex for domain name lists is that regex
is optimized for left-to-right comparisons. Domain name labels are built
right-to-left. dstdomain is optimized for right-to-left comparison with
an early-abort on mismatch and sub-domain wildcards - which gives it a
huge advantage in CPU cycles over regex.

Amos