[squid-dev] Strategy about build farm nodes

Sun May 16 07:31:45 UTC 2021

On 4/05/21 2:29 am, Alex Rousskov wrote:
> On 5/3/21 12:41 AM, Francesco Chemolli wrote:
>> - we want our QA environment to match what users will use. For this
>> reason, it is not sensible that we just stop upgrading our QA nodes,
> 
> I see flaws in reasoning, but I do agree with the conclusion -- yes, we
> should upgrade QA nodes. Nobody has proposed a ban on upgrades AFAICT!
> 
> The principles I have proposed allow upgrades that do not violate key
> invariants. For example, if a proposed upgrade would break master, then
> master has to be changed _before_ that upgrade actually happens, not
> after. Upgrades must not break master.

So ... a node is added/upgraded. It runs and builds master fine. Then 
added to the matrices some of the PRs start failing.

*THAT* is the situation I see happening recently. Master itself working 
fine and "huge amounts of pain, the sky is falling" complaints from a 
couple of people.

Sky is not falling. Master is no more nor less broken and buggy than it 
was before sysadmin touched Jenkins.

The PR itself is no more, nor less, "broken" than it would be if for 
example - it was only tested on Linux nodes and fails to compile on 
Windows. As the case for master *right now* happens to be.

> 
> What this means in terms of sysadmin steps for doing upgrades is up to
> you. You are doing the hard work here, so you can optimize it the way
> that works best for _you_. If really necessary, I would not even object
> to trial upgrades (that may break master for an hour or two) as long as
> you monitor the results and undo the breaking changes quickly and
> proactively (without relying on my pleas to fix Jenkins to detect
> breakages). I do not know what is feasible and what the best options
> are, but, again, it is up to _you_ how to optimize this (while observing
> the invariants).
> 

Uhm. Respectfully, from my perspective the above paragraph conflicts 
directly with actions taken.

 From what I can tell kinkie (as sysadmin) *has* been making a new node 
and testing it first. Not just against master but the main branches and 
most active PRs before adding it for the *post-merge* matrix testing 
snapshot production.

   But still threads like this one with complaints appear.

I understand there is some specific pain you have encountered to trigger 
the complaint. Can we get down to documenting as exactly as possible 
what the particular pain was?

  Much of the processes we are discussing are scripted automation not 
human processing mistakes. Handling such pain points as bugs with 
bugzilla "Project" section would be best. Re-designing the entire system 
policy just moves us all to another set of unknown bugs when the scripts 
are re-coded to meet that policy.

> 
>> - I believe we should define four tiers of runtime environments, and
>> reflect these in our test setup:
> 
>>   1. current and stable (e.g. ubuntu-latest-lts).
>>   2. current (e.g. fedora 34)
>>   3. bleeding edge
>>   4. everything else - this includes freebsd and openbsd
> 
> I doubt this classification is important to anybody _outside_ this
> discussion, so I am OK with whatever classification you propose to
> satisfy your internal needs.
> 

IIRC this is the 5th iteration of ground-up redesign for this wheel.

Test designs that do not fit into our merge and release process sequence 
have proven time and again to be broken and painful to Alex when they 
operate as-designed. For the rest of us it is this constant re-build of 
automation which is the painful part.

A. dev pre-PR testing
    - random individual OS.
    - matrix of everything (anybranch-*-matrix)

B. PR submission testing
    - which OS for master (5-pr-test) ?
    - which OS for beta (5-pr-test) ?
    - which OS for stable (5-pr-test) ?

Are all of those sets the same identical OS+compilers? no.
Why are they forced to be the same matrix test?
   IIRC, policy forced on sysadmin with previous pain complaints.

Are we getting painful experiences from this?
   Yes. Lack of branch-specific testing before D on beta and stable 
causes those branches to break a lot more often at last-minute before 
releases than master. Adding random days/weeks to each scheduled release.

C. merge testing
    - which OS for master (5-pr-auto) ?
    - which OS for beta (5-pr-auto) ?
    - which OS for stable (5-pr-auto) ?
      NP: maintainer does manual override on beta/stable merges.

Are all of those sets the same identical OS+compilers? no.
   Why are they forced to be the same matrix test? Anubis

Are we getting painful experiences from this? yes. see (B).

D. pre-release testing (snapshots + formal)
    - which OS for master (trunk-matrix) ?
    - which OS for beta (5-matrix) ?
    - which OS for stable (4-matrix) ?

Are all of those sets the same identical OS+compilers? no.
Are we forcing them to use the same matrix test? no.
Are we getting painful experiences from this? maybe.
   Most loud complaints have been about "breaking master" which is the 
most volatile branch testing on the most volatile OS.

FTR: the reason all those matrices have '5-' prefix is because several 
redesigns ago the system was that master/trunk had a matrix which the 
sysadmin added nodes to as OS upgraded. During branching vN the 
maintainer would clone/freeze that matrix into an N-foo which would be 
used to test the code against OS+compilers which the code in the vN 
branch was designed to build on.

Can we have the people claiming pain specify exactly what the pain is 
coming from, and let the sysadmin/developer(s) with specialized 
knowledge of the automation in that area decide how best to fix it?

> 
>> I believe we should focus on the first two tiers for our merge workflow,
>> but then expect devs to fix any breakages in the third and fourth tiers
>> if caused by their PR,
> 
> FWIW, I do not understand what "focus" implies in this statement, and
> why developers should _not_ "fix any breakages" revealed by the tests in
> the first two tiers.
> 
> The rules I have in mind use two natural tiers:
> 
> * If a PR cannot pass a required CI test, that PR has to change before
> it can be merged.
> 
> * If a PR cannot pass an optional CI test, it is up to PR author and
> reviewers to decide what to do next.

That is already the case. Already well documented and understood.

I see no need to change anything based on those criteria. Ergo you have 
some undeclared criteria leading to whatever pain triggered this 
discussion. Maybe the pain is some specific bug that does not need a 
whole discussion and re-design by committee?

> 
> These are very simple rules that do not require developer knowledge of
> any complex test node tiers that we might define/use internally.
> 

This is the first I've heard about dev having to have such knowledge. 
Maybe because they are already *how we do things*. Red-herring argument?

> Needless to say, the rules assume that the tests themselves are correct.
> If not, the broken tests need to be fixed (by the Squid Project) before
> the first bullet/rule above can be meaningfully applied (the second one
> is flexible enough to allow PR author and reviewers to ignore optional
> test failures).
> 

There is a hidden assumption here too. About the test being applied 
correctly.

I posit that is the real bug we need to sort out. We could keep on 
"correcting" the node sets (aka tests) back and forward between being 
suitable for master or suitable for release branches. That just shuffles 
the pain from one end of the system to the other.

Make Anubis and Jenkins use different matrix for each branch at the B 
and C process stages above. Only then will discussion of what nodes to 
add to what test/matrix actually make progress.

> 
>> Breakages due to changes in nodes (e.g. introducing a new distro
>> version) would be on me and would not stop the merge workflow.
> 
> What you do internally to _avoid_ breakage is up to you, but the primary
> goal is to _prevent_ CI breakage (rather than to keep CI nodes "up to
> date"!).

The principle ("invariant" in Alex terminology?) with nodes is that they 
represent the OS environment a typical developer can be assumed to be 
running on that OS version+compiler combination.

Distros release security updates to their "stable" versions. Therefore 
to stay true to the goal we require constant small upgrades as an 
ongoing part of sysadmin maintenance.

Adding new nodes with next distro release versions is a manual process 
not related to keeping existing nodes up to date (which is automated?).

 From time to time distros break their own ability to compile things. 
This is to be expected on distros with rolling release and ironically 
LTS release (which get *less* testing of updates than normal releases).
It does not indicate "broken master" nor "broken CI" in any way.

> 
> There are many ways to break CI and detect those breakages, of course,
> but if master cannot pass required tests after a CI change, then the
> change broke CI.

I have yet to see the code in master be corrupted by CI changes in such 
a way that it could not build on peoples development machines.

What we do have going on is network timeouts, DNS resolution, CPU wait 
timeouts, and rarely _automated_ CI upgrades all causing short-term 
failure to pass a test.

A PR fixing newly highlighted bugs gets around the latter. Any pain (eg 
master blocked for 2 days waiting on the fix PR to merge) is a normal 
problem with that QA process and should not be attributed to the CI change.

> 
>> What I would place on each individual dev is the case where a PR breaks
>> something in the trunk-matrix,trunk-arm32-matrix, trunk-arm64-matrix,
>> trunk-openbsd-matrix, trunk-freebsd-matrix builds, even if the 5-pr-test
>> and 5-pr-auto builds fail to detect the breakage because it happens on a
>> unstable or old platform. >
> This feels a bit out of topic for me, but I think you are saying that
> some CI tests called trunk-matrix, trunk-arm32-matrix,
> trunk-arm64-matrix, trunk-openbsd-matrix, trunk-freebsd-matrix should be
> classified as _required_.

That is how I read the statement too.

> In other words, a PR must pass those CI tests
> before it can be merged. Is that the situation today? Or are you
> proposing some changes to the list of required CI tests? What are those
> changes?
> 

No, situation today is that those matrix are new ones only recently 
created by sysadmin and not used for any of the merge or release process 
criteria. The BSD though were once checked in the general 5-pr-test 
required for PR testing.

IMO, it's a good point. We do need to stop the practice of just dropping 
support for any OS where attempting to build finds existing bugs in 
master (aka "breaks master, sky falling"). More focus on fixing those 
bugs to increase portability and grow the Squid community beyond the 
subset of RHEL and Ubuntu users.

Amos