[squid-dev] Strategy about build farm nodes

Mon May 17 15:03:23 UTC 2021

On 5/16/21 10:19 PM, squid3 at treenet.co.nz wrote:
> On 2021-05-17 11:56, Alex Rousskov wrote:
>> On 5/16/21 3:31 AM, Amos Jeffries wrote:
>>> On 4/05/21 2:29 am, Alex Rousskov wrote:
>>>> The principles I have proposed allow upgrades that do not violate key
>>>> invariants. For example, if a proposed upgrade would break master, then
>>>> master has to be changed _before_ that upgrade actually happens, not
>>>> after. Upgrades must not break master.
>>>
>>> So ... a node is added/upgraded. It runs and builds master fine. Then
>>> added to the matrices some of the PRs start failing.
>>
>> It is easy to misunderstand what is going on because there is no good
>> visualization of complex PR-master-Jenkins_nodes-Jenkins_failures
>> relationships. Several kinds of PR test failures are possible. I will
>> describe the two most relevant to your email:
>>
>> * PR test failures due to problems introduced by PRs should be welcomed
>> at any time.
> 
> Strawman here. This is both general statement and not relevant to CI
> changes or design(s) we are discussing.

The first bullet (out of the two bullets that are meant to be
interpreted together, in their context) is not even close to being a
strawman argument by any definition I can find. Neither is the
combination of those two bullets.

>> CI improvements are allowed to find new bugs in open PRs.

> IMO the crux is that word "new". CI improvements very rarely find new
> bugs. What it actually finds and intentionally so is *existing bugs* the
> old CI config wrongly ignored.

Here, I used "new" to mean bugs that had not been found by the previous
CI version (i.e. "newly discovered" or "new to us"). These bugs existed
before and after CI improvements, of course -- neither master nor the PR
has changed -- but we did not know about them.

>> * PR test failures due to the existing master code are not welcomed.

> That is not as black/white as the statement above implies. There are
> some master branch bugs we don't want to block PRs merging, and there
> are some (rarely) we absolutely do not want any PRs to change master
> until fixed.

Yes, master bugs should not affect PR merging in the vast majority of
cases, but that is not what this bullet is about at all!

This bullet is about (a certain kind of) PR test failures. Hopefully, we
do not need to revisit the debate whether PRs with failed tests should
be merged. They should not be merged, which is exactly why PR test
failures caused by the combination of CI changes and the existing master
code are not welcomed -- they block progress of an innocent PR.

>> They represent a CI failure.
> 
> IMO this is absolutely false. The whole point of improving CI is to find
> those "existing" bugs which the previous CI config wrong missed.

Your second sentence is correct, but it does not make my statement
false. CI should achieve several goals. Finding bugs (i.e. blocking
buggy PRs) is one of them. Merging (i.e. not blocking) PRs that should
be merged is another. And there are a few more goals, of course. The
problem described in my second bullet represents a CI failure to reach
one of the key CI goals or a failure to maintain a critical CI
invariant. It is a CI failure rather than a PR failure (the latter is
covered by the first bullet).

> e.g. v4+ currently do not build on Windows. We know this, but the
> current CI testing does not show it. Upgrading the CI to include a test
> for Windows is not a "CI failure".

If such an upgrade would result in blocking PRs that do not touch
Windows code, then that upgrade would be a CI failure. Or a failure to
properly upgrade CI.

>> In these cases, if the latest master code
>> is tested with the same test after the problematic CI change, then that
>> master test will fail. Nothing a PR can do in this situation can fix
>> this kind of failure because it is not PR changes that are causing the
>> failure -- CI changes broke the master branch,
> 
> Ah. "broke the master branch" is a bit excessive. master is not broken
> any more or less than it already was.

If you can suggest a better short phrase to describe the problem, please
do so! Until then, I will have to continue to use "breaking master" (in
this context) simply for the lack of a better phrase. I tried to explain
what that phrase means to avoid misunderstanding. I cannot find a better
way to compress that explanation into a single phrase.

> What is *actually* broken is the CI test results.

One can easily argue that CI test results are actually "correct" in this
case -- they correctly discover a bug in the code that wants to become
the next master. The problem is in the assignment of responsibility: The
PR did not introduce that bug, so the PR should not be punished for that
bug. The only way to avoid such punishment (given the natural automated
CI limitations) is to avoid breaking master tests, as previously defined.

> Otherwise, short periods between sysadmin thinking it was a safe change
> and reverting as breakage appeared is to be expected.

Well, it is debatable whether frequent breakages should be _expected_ --
there are certainly ways to avoid the vast majority of them, but I have
already agreed that we can survive breaking upgrade attempts, even
relatively frequent ones, provided the admin doing the upgrade monitors
CI and can quickly undo the attempts that break master, as previously
defined.

> I see here two distros which have "rolling release" being updated by
> sysadmin from producing outdated and wrong test results, to producing
> correct test results. This is a correct change in line with the goal of
> our nodes representing what a user running that OS would see building
> Squid master or PRs.

In general, the CI change is incorrect if it results in breaking master,
as previously defined. This particular correctness aspect does not
depend on what tests the CI change was meant to fix.

> IMO we can expect to occur on a regular basis

I hope that Francesco will find a way to avoid creating this persistent
expectation of these CI-caused problems!

> We can resolve it by having those OS only
> build in the N-matrix applied before releases, instead of the matrix
> blocking PR tests or merging.

AFAICT, that would not really _resolve_ the problem, only delay its
manifestation until release time (when the stress related to fixing it
will be extreme, and the maintainer will be tempted to abuse the
pressure to release to push low-quality changes through the review process).

> If we are all agreed, kinkie or I can implement ASAP.

I would not strongly object to delaying N-matrix tests (whatever they
will be) until release time, but delaying the pain until the release
time sounds like a poor solution to me.

I think the best solution here would heavily depend on who is
responsible for adjusting the official code (to allow the anticipated CI
upgrade). Their resources/preferences/capabilities/etc. may determine
the best solution, the set of the required tests, and the corresponding
rules. Today, nobody specific is responsible and many volunteer
developers are often unaware of the failures or cannot quickly address
them. With some luck, the changes that Francesco has started to propose
will improve this situation.

One solution that would be better, IMO, than delaying N-matrix tests
would be to make tests on "rolling" OSes optional instead of required
(but still test all PRs). Splitting merge tests into optional and
required would attract developer attention without blocking innocent PRs.

>>> B. PR submission testing
>>>    - which OS for master (5-pr-test) ?
>>>    - which OS for beta (5-pr-test) ?
>>>    - which OS for stable (5-pr-test) ?
>>>
>>> Are all of those sets the same identical OS+compilers? no.
>>> Why are they forced to be the same matrix test?
>>
>> I do not understand the question. Are you asking why Jenkins uses the
>> same 5-pr-test configuration for all three branches (master, beta, _and_
>> stable)? I do not know the answer.

> So can we agree that they should be different tests?

I, personally, cannot answer that specific question in a meaningful way,
but I doubt I would strongly object to virtually any Jenkins changes
related to stable and beta test sets. I continue to view those branches
(but not their branching points!) as primarily maintainer's
responsibility. If others do not object, you should feel free to make
them different IMO. Just keep GitHub/Jenkins/Anubis integration in mind
when you do so (e.g., by keeping the _number_ of GitHub-visible tests or
"status checks" the same across all branches).

>>> C. merge testing
>>>    - which OS for master (5-pr-auto) ?
>>>    - which OS for beta (5-pr-auto) ?
>>>    - which OS for stable (5-pr-auto) ?
>>>      NP: maintainer does manual override on beta/stable merges.
>>>
>>> Are all of those sets the same identical OS+compilers? no.
>>>   Why are they forced to be the same matrix test? Anubis
>>
>> This is too cryptic for me to understand, but Anubis does not force any
>> tests on anybody -- it simply checks that the required tests have
>> passed. I am not aware of any Anubis bugs in this area, but please
>> correct me if I am wrong.

> My understanding was that Anubis only has ability to check PRs against
> its auto branch which tracks master.

Anubis documentation[1] defines staging_branch (i.e. our "auto") as a
branch for testing PR changes as if they were merged into the PR target
branch. The target branch is defined by each GitHub PR. AFAICT, Anubis
does not really have a high-level concept of a "master" branch. The auto
branch should contain whatever target branch the PR is using.

[1] https://github.com/measurement-factory/anubis/blob/master/README.md

> Ability to have it track other
> non-master branches and merge there is not available for use.

AFAICT, Anubis is tracking v4 PRs but you are ignoring it. For example,
merged PR #815 has a "waiting for more votes" status check added by
Anubis to the last commit (f6828ed):
https://github.com/squid-cache/squid/pull/815

There may be some integration problems that I am not aware of, but I
think everything should work in principle. AFAICT, you are forcing
manual v4/v5 commits instead of following the procedure we use for
master, but that is your decision. Please note that I am not saying that
your decision is right or wrong, just documenting my understanding and
observations.

> IMO we should look into this. But it is a technical project for sysadmin
> + Eduard to coordinate. Not a policy thing.

What tests are blocking and what are optional is a policy thing. Whether
it is OK to break master, as previously defined, is a policy thing. How
to implement those and other policies is usually a technicality indeed.

>> AFAICT, Francesco and I are on the same page regarding not breaking
>> master anymore -- he graciously agreed to prevent such breakages in the
>> future, and I am very thankful that he did. Based on your comments
>> discussing several cases where such master breakage is, in your opinion,
>> OK, you currently disagree with that principle. I do not know why.

> I think we differ in our definitions of "breaking master". You seem to
> be including breakage of things in the CI system itself which I consider
> outside of "master", or expected results of normal sysadmin activity.

I hope my definition of "breaking master" in this CI change context is
clear by now. I still do not know why you oppose to prohibiting doing
that, under any name.

> I hope my response to the two use-cases you present at the top of this
> email clarify.

Unfortunately, those attacks on my choice of words did not help me
understand why you are opposed to the principles those words describe.
You are welcome to propose better phrasing, of course, but that would
not change what actually happened and my desire to prohibit those kinds
of events in the future.

Alex.