[squid-dev] RFC: Categorize level-0/1 messages

Wed Oct 20 15:22:53 UTC 2021

Hello,

    Nobody likes to be awaken at night by an urgent call from NOC about
some boring Squid cache.log message the NOC folks have not seen before
(or miss a critical message that was ignored by the monitoring system).
To facilitate automatic monitoring of Squid cache.logs, I suggest to
adjust Squid code to divide all level-0/1 messages into two major
categories -- "problem messages" and "status messages"[0]:

* Problem messages are defined as those that start with one of the three
well-known prefixes: FATAL:, ERROR:, and WARNING:. These are the
messages that most admins may want to be notified about (by default[1])
and these standardized prefixes make setting up reliable automated
notifications straightforward.

* Status messages are all other messages. Most admins do not want to be
notified about normal Squid state changes and progress reports (by
default[2]). These status messages are still valuable in triage so they
are _not_ going away[3].

Today, Squid does not support the problem/status message classification
well. To reach the above state, we will need to adjust many messages so
that they fall into the right category. However, my analysis of the
existing level-0/1 messages shows that it is doable to correctly
classify most of them without a lot of tedious work (all numbers and
prefix strings below are approximate and abridged for clarity of the
presentation):

* About 40% of messages (~700) already have "obvious" prefixes: BUG:,
BUG [0-9]*:, ERROR:, WARNING:, and FATAL:. We will just need to adjust
~20 existing BUG messages to move them into one of the three proposed
major categories (while still being clearly identified as Squid bugs, of
course).

* About 15% of messages (~300) can be easily found and adjusted using
their prefixes (after excluding the "obvious" messages above). Here is a
representative sample of those prefixes: SECURITY NOTICE, Whoops!, XXX:,
UPGRADE:, CRITICAL, Error, ERROR, Error, error, ALERT, NOTICE, WARNING!,
WARNING OVERRIDE, Warning:, Bug, Failed, Stop, Startup:, Shutdown:,
FATAL Shutdown, avoiding, suspending, DNS error, bug, cannot, can't,
could not, couldn't, bad, unable, malformed, unsupported, not found,
missing, broken, unexpected, invalid, corrupt, obsolete, unrecognised,
and unknown. Again, there is valuable information in many of these
existing prefixes, and all valuable information will be preserved (after
the standardized prefix). Some of these messages may be demoted to
debugging level 2.

* The remaining 45% of messages (~800) may remain as is during the
initial conversion. Many of them are genuine status/progress messages
with prefixes like these: Creating, Processing, Adding, Accepting,
Configuring, Sending, Making, Rebuilding, Skipping, Beginning, Starting,
Initializing, Installing, Indexing, Loading, Preparing, Killing,
Stopping, Completed, Indexing, Loading, Killing, Stopping, Finished,
Removing, Closing, Shutting. There are also "squid -k parse" messages
that are easy to find automatically if somebody wants to classify them
properly. Most other messages can be adjusted as/if they get modified or
if we discover that they are frequent/important enough to warrant a
dedicated adjustment.

If there are no objections or better ideas, Factory will work on a few
PRs that adjust the existing level-0/1 messages according to the above
classification, in the rough order of existing message categories/kinds
discussed in the three bullets above.

Thank you,

Alex.

[0] The spelling of these two category names is unimportant. If you can
suggest better category names, great, but let's focus on the category
definitions.

[1] No default will satisfy everybody, and we already have the
cache_log_message directive that can control the visibility and volume
of individual messages. However, manually setting those parameters for
every level-0/1 message is impractical -- we have more than 1600 such
messages! This RFC makes a reasonable _default_ treatment possible.

[2] Admins can, of course, configure their log monitoring scripts to
alert them of certain status messages if they consider those messages
important. Again, this RFC is about facilitating reasonable _default_
treatment.

[3] We could give status messages a unique prefix as well (e.g., INFO:)
but such a prefix is not necessary to easily distinguish them _and_
adding a prefix would create a lot more painful code changes, so I think
we should stay away from that idea.