[squid-users] ACL matches when it shouldn't

Fri Oct 2 09:08:14 UTC 2020

Regarding the use of an external ACL I quickly implemented a perl script that "does the job", but it seems to be somewhat sluggish.

This is how it's configured in squid.conf:
external_acl_type bllookup ttl=86400 negative_ttl=86400 children-max=80 children-startup=10 children-idle=3 concurrency=8 %PROTO %DST %PORT %PATH /opt/custom/scripts/squid/ext_txt_blwl_acl.pl --categories=adv,aggressive,alcohol,anonvpn,automobile_bikes,automobile_boats,automobile_cars,automobile_planes,chat,costtraps,dating,drugs,dynamic,finance_insurance,finance_moneylending,finance_other,finance_realestate,finance_trading,fortunetelling,forum,gamble,hacking,hobby_cooking,hobby_games-misc,hobby_games-online,hobby_gardening,hobby_pets,homestyle,ibs,imagehosting,isp,jobsearch,military,models,movies,music,podcasts,politics,porn,radiotv,recreation_humor,recreation_martialarts,recreation_restaurants,recreation_sports,recreation_travel,recreation_wellness,redirector,religion,remotecontrol,ringtones,science_astronomy,science_chemistry,sex_education,sex_lingerie,shopping,socialnet,spyware,tracker,updatesites,urlshortener,violence,warez,weapons,webphone,webradio,webtv

I'd like to avoid the use of a DB if possible, but maybe someone here has an idea to share on flat file text searches.

Currently the dir structure of my blacklists is:

topdir
category1 ... categoryN
domains urls

So basically one example file to search in is topdir/category8/urls, etc.

The helper perl script contains this code to decide whether to block access or not:

foreach( @categories )
{
        chomp($s_urls = qx{grep -nwx '$uri_dst$uri_path' $cats_where/$_/urls | head -n 1 | cut -f1 -d:});

        if (length($s_urls) > 0) {
            if ($whitelist == 0) {
                $status = $cid." ERR message=\"URL ".$uri_dst." in BL ".$_." (line ".$s_urls.")\"";
            } else {
                $status = $cid." ERR message=\"URL ".$uri_dst." not in WL ".$_." (line ".$s_urls.")\"";
            }
            next;
        }

        chomp($s_urls = qx{grep -nwx '$uri_dst' $cats_where/$_/domains | head -n 1 | cut -f1 -d:});

        if (length($s_urls) > 0) {
            if ($whitelist == 0) {
                $status = $cid." ERR message=\"Domain ".$uri_dst." in BL ".$_." (line ".$s_urls.")\"";
            } else {
                $status = $cid." ERR message=\"Domain ".$uri_dst." not in WL ".$_." (line ".$s_urls.")\"";
            }
            next;
        }
}

There are currently 66 "categories" with around 50MB of text data in all.
So that's a lot to go through each time there's an HTTP request.
Apart from placing these blacklists on a ramdisk (currently on an M.2 SSD disk so I'm not sure I'll notice anything) what else can I try?
Should I reindex the lists and group them all alphabetically?
For instance should I process the lists in order to generate a dir structure as follows?

topdir
a b c d e f ... x y z 0 1 2 3 ... 7 8 9
domains urls

An example for a client requesting https://www.google.com/ would lead to searching only 2 files:
topdir/w/domains
topdir/w/urls

An example for a client requesting https://01.whatever.com/x would also lead to searching only 2 files:
topdir/0/domains
topdir/0/urls

An example for a client requesting https://8.8.8.8/xyz would also lead to searching only 2 files:
topdir/8/domains
topdir/8/urls

Any ideas or links to scripts that already prepare lists for this?

Thanks,

Vieri