[squid-dev] Strategy about build farm nodes

Alex Rousskov rousskov at measurement-factory.com
Sun May 16 23:56:01 UTC 2021


On 5/16/21 3:31 AM, Amos Jeffries wrote:
> On 4/05/21 2:29 am, Alex Rousskov wrote:
>> On 5/3/21 12:41 AM, Francesco Chemolli wrote:
>>> - we want our QA environment to match what users will use. For this
>>> reason, it is not sensible that we just stop upgrading our QA nodes,
>>
>> I see flaws in reasoning, but I do agree with the conclusion -- yes, we
>> should upgrade QA nodes. Nobody has proposed a ban on upgrades AFAICT!
>>
>> The principles I have proposed allow upgrades that do not violate key
>> invariants. For example, if a proposed upgrade would break master, then
>> master has to be changed _before_ that upgrade actually happens, not
>> after. Upgrades must not break master.
> 
> So ... a node is added/upgraded. It runs and builds master fine. Then
> added to the matrices some of the PRs start failing.

It is easy to misunderstand what is going on because there is no good
visualization of complex PR-master-Jenkins_nodes-Jenkins_failures
relationships. Several kinds of PR test failures are possible. I will
describe the two most relevant to your email:

* PR test failures due to problems introduced by PRs should be welcomed
at any time. CI improvements are allowed to find new bugs in open PRs.
Such findings, even when discovered at the "last minute", should be seen
as an overall positive event or progress -- our CI was able to identify
a problem before it got officially accepted! I do not recall anybody
complaining about such failures recently.

* PR test failures due to the existing master code are not welcomed.
They represent a CI failure. In these cases, if the latest master code
is tested with the same test after the problematic CI change, then that
master test will fail. Nothing a PR can do in this situation can fix
this kind of failure because it is not PR changes that are causing the
failure -- CI changes broke the master branch, not just the PR! This
kind of failures are the responsibility of CI administrators, and PR
authors should complain about them, especially when there are no signs
of CI administrators aware of and working on addressing the problem.

A good example of a failure of the second kind a -Wrange-loop-construct
error in a PR that does not touch any range loops (Jenkins conveniently
deleted the actual failed test, but my GitHub comment and PR contents
may be enough to restore what happened):
https://github.com/squid-cache/squid/pull/806#issuecomment-827924821


[I am skipping the exaggerations (about other people exaggerations) that
were, AFAICT, driven by mis-classifying the failures of the second kind
as regular PR failures of the first kind.]


>> What this means in terms of sysadmin steps for doing upgrades is up to
>> you. You are doing the hard work here, so you can optimize it the way
>> that works best for _you_. If really necessary, I would not even object
>> to trial upgrades (that may break master for an hour or two) as long as
>> you monitor the results and undo the breaking changes quickly and
>> proactively (without relying on my pleas to fix Jenkins to detect
>> breakages). I do not know what is feasible and what the best options
>> are, but, again, it is up to _you_ how to optimize this (while observing
>> the invariants).


> Uhm. Respectfully, from my perspective the above paragraph conflicts
> directly with actions taken.

My paragraph is not really about any taken actions so there cannot be
any such conflict AFAICT.


> From what I can tell kinkie (as sysadmin) *has* been making a new node
> and testing it first. Not just against master but the main branches and
> most active PRs before adding it for the *post-merge* matrix testing
> snapshot production.

The above summary does not match the (second kind of) PR test failures
that I have observed and asked Francesco to address. Those failures were
triggered by the latest master code, not PR changes. No PR changes would
have fixed those test failures. In fact, an "empty" PR that introduces
no code changes at all would have failed as well. See above for one
recent example.


>   But still threads like this one with complaints appear.

This squid-dev thread did not contain any complaints AFAICT. Francesco
is trying to get consensus on how to handle CI upgrades/changes. He is
doing the right thing and nobody is complaining about it.


> I understand there is some specific pain you have encountered to trigger
> the complaint. Can we get down to documenting as exactly as possible
> what the particular pain was?

Well, I apparently do not know what you call "complaint", so I cannot
answer this exact question, but if you are looking for a recent PR test
failure that triggered this squid-dev thread, then please see the GitHub
link above.



> Test designs that do not fit into our merge and release process sequence
> have proven time and again to be broken and painful to Alex when they
> operate as-designed. For the rest of us it is this constant re-build of
> automation which is the painful part.

While I am not sure what you are talking about specifically, I think
that any design that causes pain should be changed. Your phrasing
suggests some dissatisfaction with that natural (IMO) expectation, so I
am probably missing something. Please rephrase if this is important.


> B. PR submission testing
>    - which OS for master (5-pr-test) ?
>    - which OS for beta (5-pr-test) ?
>    - which OS for stable (5-pr-test) ?
> 
> Are all of those sets the same identical OS+compilers? no.
> Why are they forced to be the same matrix test?

I do not understand the question. Are you asking why Jenkins uses the
same 5-pr-test configuration for all three branches (master, beta, _and_
stable)? I do not know the answer.


> IIRC, policy forced on sysadmin with previous pain complaints.

Complaints, even legitimate ones, should not shape a policy. Goals and
principles should do that.

I remember one possibly related discussion where we were trying to
reduce Jenkins/PR wait times by changing which tests are run at what PR
merging stages, but that is probably a different issue because your
question appears to be about a single merging stage.


> C. merge testing
>    - which OS for master (5-pr-auto) ?
>    - which OS for beta (5-pr-auto) ?
>    - which OS for stable (5-pr-auto) ?
>      NP: maintainer does manual override on beta/stable merges.
> 
> Are all of those sets the same identical OS+compilers? no.
>   Why are they forced to be the same matrix test? Anubis

This is too cryptic for me to understand, but Anubis does not force any
tests on anybody -- it simply checks that the required tests have
passed. I am not aware of any Anubis bugs in this area, but please
correct me if I am wrong.


> D. pre-release testing (snapshots + formal)
>    - which OS for master (trunk-matrix) ?
>    - which OS for beta (5-matrix) ?
>    - which OS for stable (4-matrix) ?
> 
> Are all of those sets the same identical OS+compilers? no.
> Are we forcing them to use the same matrix test? no.
> Are we getting painful experiences from this? maybe.
>   Most loud complaints have been about "breaking master" which is the
> most volatile branch testing on the most volatile OS.

FWIW, I think you misunderstood what those "complaints" where about. I
do not know how that relates to the above questions/answers though.


> FTR: the reason all those matrices have '5-' prefix is because several
> redesigns ago the system was that master/trunk had a matrix which the
> sysadmin added nodes to as OS upgraded. During branching vN the
> maintainer would clone/freeze that matrix into an N-foo which would be
> used to test the code against OS+compilers which the code in the vN
> branch was designed to build on.

I think the above description implies that some time ago we were (more)
careful about (not) adding new nodes when testing stable branches. We
did not want a CI change to break a stable branch. That sounds like the
right principle to me (and it should apply to beta and master as well).
How that specific principle is accomplished is not important (to me) so
CI admins should propose whatever technique they think is best.


> Can we have the people claiming pain specify exactly what the pain is
> coming from, and let the sysadmin/developer(s) with specialized
> knowledge of the automation in that area decide how best to fix it?

We can, and that is exactly what is going on in this thread AFAICT. This
particular thread was caused by CI changes breaking master, and
Francesco was discussing how to avoid such breakages in the future.

There are other goals/principles to observe, of course, and it is
possible that Francesco is proposing more changes to optimize something
else as well, but that is something only he can clarify (if needed).

AFAICT, Francesco and I are on the same page regarding not breaking
master anymore -- he graciously agreed to prevent such breakages in the
future, and I am very thankful that he did. Based on your comments
discussing several cases where such master breakage is, in your opinion,
OK, you currently disagree with that principle. I do not know why.


>> I believe we should focus on the first two tiers for our merge workflow,
>> but then expect devs to fix any breakages in the third and fourth tiers
>> if caused by their PR, 

>> The rules I have in mind use two natural tiers:
>>
>> * If a PR cannot pass a required CI test, that PR has to change before
>> it can be merged.
>>
>> * If a PR cannot pass an optional CI test, it is up to PR author and
>> reviewers to decide what to do next.


> That is already the case. Already well documented and understood.

> I see no need to change anything based on those criteria. 

I hope so! I stated those rules in direct response to, what seemed to me
like, a far more complex 4-tier proposal by Francesco. I did not state
them to change some Project policy. I state them in hope to understand
(and/or simplify) his proposal.


>> These are very simple rules that do not require developer knowledge of
>> any complex test node tiers that we might define/use internally.

> This is the first I've heard about dev having to have such knowledge.

Perhaps you interpreted the email I was replying to differently than I
did. I hope the above paragraph clarifies the situation.


>> Needless to say, the rules assume that the tests themselves are correct.
>> If not, the broken tests need to be fixed (by the Squid Project) before
>> the first bullet/rule above can be meaningfully applied (the second one
>> is flexible enough to allow PR author and reviewers to ignore optional
>> test failures).

> There is a hidden assumption here too. About the test being applied
> correctly.

For me, test application (i.e. what tests are applied to what software)
correctness is a part of the overall tests correctness (which naturally
has many components). From that point of view, there is no hidden
assumption here.


> I posit that is the real bug we need to sort out. We could keep on
> "correcting" the node sets (aka tests) back and forward between being
> suitable for master or suitable for release branches. That just shuffles
> the pain from one end of the system to the other.

> Make Anubis and Jenkins use different matrix for each branch at the B
> and C process stages above. Only then will discussion of what nodes to
> add to what test/matrix actually make progress.

Anubis is pretty much a red herring here AFAICT -- IIRC, Anubis just
checks that enough required tests have passed. The test results are
supplied by Jenkins and Semaphore CI.

If you think that each target branch should have its own set of Jenkins
tests for a staged commit, then I see no problem with that per se. You
or Francesco should configure GitHub and Jenkins accordingly.

I recommend grouping all those tests under one name (that single name
can be unique to each target branch), so that Anubis can be told how
many tests to expect using a single number (rather than many
branch-specific settings). If that grouping becomes a serious problem,
we can change Anubis to support a different number of tests for
different target branches.


> The principle ("invariant" in Alex terminology?) with nodes is that they
> represent the OS environment a typical developer can be assumed to be
> running on that OS version+compiler combination.

FWIW, that principle or invariant (I often use those two words
interchangeably) feels a bit unfortunate to me: Given scarce resources,
our CI tests should not target Squid developers. They should target
typical Squid _deployments_ instead. Fortunately, the difference is
often minor.


> Distros release security updates to their "stable" versions. Therefore
> to stay true to the goal we require constant small upgrades as an
> ongoing part of sysadmin maintenance.

Yes, I think we are all in agreement that nodes should be upgraded.
Deployments often do not upgrade their OSes as often as distros do, even
for "security" reasons, but we have to pick the lesser of two evils and
upgrade as often as we can (correctly).

The only possible disagreement that I know of is about _prerequisites_
for an upgrade, as revisited below.


> Adding new nodes with next distro release versions is a manual process
> not related to keeping existing nodes up to date (which is automated?).

Sure. And new nodes should not be added if that addition breaks master.


> From time to time distros break their own ability to compile things.
> It does not indicate "broken master" nor "broken CI" in any way.

Distro changes cannot indicate "broken master" or "broken CI" _until_ we
actually apply them to our CI nodes. If upgrading a node would break or
broke master (as already defined many times), then that upgrade should
not be done or should be undone (until the master is changed to allow
the upgrade to proceed). Keeping master tests successful is a key
upgrade precondition IMO.


>> There are many ways to break CI and detect those breakages, of course,
>> but if master cannot pass required tests after a CI change, then the
>> change broke CI.

> I have yet to see the code in master be corrupted by CI changes in such
> a way that it could not build on peoples development machines.

FWIW, I do not see how the above assertion (or anything about peoples
development machines) is relevant to this discussion about Project CI
and the official merge process.


> What we do have going on is network timeouts, DNS resolution, CPU wait
> timeouts, and rarely _automated_ CI upgrades all causing short-term
> failure to pass a test.

Intermittent failures (DNS and such) are out of scope here.

CI upgrades, automated or not, should not break master. If they
accidentally do, they should be reverted. If reversal is not possible,
they should not be attempted in the first place.


> A PR fixing newly highlighted bugs gets around the latter. Any pain (eg
> master blocked for 2 days waiting on the fix PR to merge) is a normal
> problem with that QA process and should not be attributed to the CI change.

I do not understand why a problem caused by the CI change should not be
attributed to that CI change.



> We do need to stop the practice of just dropping
> support for any OS where attempting to build finds existing bugs in
> master (aka "breaks master, sky falling"). More focus on fixing those
> bugs to increase portability and grow the Squid community beyond the
> subset of RHEL and Ubuntu users.

If somebody can add support for more environments (and adding that
support does not make things worse for Squid), then they should
certainly add that support! I see no relationship with this discussion
-- nobody here wants to prohibit such useful activities, and the "do not
break master" invariant can be upheld while adding such support.


Alex.


More information about the squid-dev mailing list