[squid-dev] Strategy about build farm nodes

Mon May 17 02:19:41 UTC 2021

On 2021-05-17 11:56, Alex Rousskov wrote:
> On 5/16/21 3:31 AM, Amos Jeffries wrote:
>> On 4/05/21 2:29 am, Alex Rousskov wrote:
>>> On 5/3/21 12:41 AM, Francesco Chemolli wrote:
>>>> - we want our QA environment to match what users will use. For this
>>>> reason, it is not sensible that we just stop upgrading our QA nodes,
>>> 
>>> I see flaws in reasoning, but I do agree with the conclusion -- yes, 
>>> we
>>> should upgrade QA nodes. Nobody has proposed a ban on upgrades 
>>> AFAICT!
>>> 
>>> The principles I have proposed allow upgrades that do not violate key
>>> invariants. For example, if a proposed upgrade would break master, 
>>> then
>>> master has to be changed _before_ that upgrade actually happens, not
>>> after. Upgrades must not break master.
>> 
>> So ... a node is added/upgraded. It runs and builds master fine. Then
>> added to the matrices some of the PRs start failing.
> 
> It is easy to misunderstand what is going on because there is no good
> visualization of complex PR-master-Jenkins_nodes-Jenkins_failures
> relationships. Several kinds of PR test failures are possible. I will
> describe the two most relevant to your email:
> 
> * PR test failures due to problems introduced by PRs should be welcomed
> at any time.

Strawman here. This is both general statement and not relevant to CI 
changes or design(s) we are discussing.

> CI improvements are allowed to find new bugs in open PRs.

IMO the crux is that word "new". CI improvements very rarely find new 
bugs. What it actually finds and intentionally so is *existing bugs* the 
old CI config wrongly ignored.

> Such findings, even when discovered at the "last minute", should be 
> seen
> as an overall positive event or progress -- our CI was able to identify
> a problem before it got officially accepted! I do not recall anybody
> complaining about such failures recently.
> 

Conclusion being that due to the rarity of "new bugs" CI improvements 
very rarely get complained about due to them.

> * PR test failures due to the existing master code are not welcomed.

That is not as black/white as the statement above implies. There are 
some master branch bugs we don't want to block PRs merging, and there 
are some (rarely) we absolutely do not want any PRs to change master 
until fixed.

> They represent a CI failure.

IMO this is absolutely false. The whole point of improving CI is to find 
those "existing" bugs which the previous CI config wrong missed.

e.g. v4+ currently do not build on Windows. We know this, but the 
current CI testing does not show it. Upgrading the CI to include a test 
for Windows is not a "CI failure".

> In these cases, if the latest master code
> is tested with the same test after the problematic CI change, then that
> master test will fail. Nothing a PR can do in this situation can fix
> this kind of failure because it is not PR changes that are causing the
> failure -- CI changes broke the master branch,

Ah. "broke the master branch" is a bit excessive. master is not broken 
any more or less than it already was.

What is *actually* broken is the CI test results.

> not just the PR! This
> kind of failures are the responsibility of CI administrators, and PR
> authors should complain about them, especially when there are no signs
> of CI administrators aware of and working on addressing the problem.
> 

*IF* all the conditions and assumptions contained in that final sentence 
are true I would agree. Such case points to incompetence or neglect on 
part of the sysadmin who broken *the CI test* then abandoned fixing it - 
complaints are reasonable there.

  [ Is kinkie acting incompetently on a regular basis? I think no. ]

Otherwise, short periods between sysadmin thinking it was a safe change 
and reverting as breakage appeared is to be expected. That is why we 
have sysadmin doing advance notices for us all to be aware of CI changes 
planned. Complaints still happen, but not much reason to redesign the 
sysadmin practices and automation (which is yet more CI change, ...).

> A good example of a failure of the second kind a -Wrange-loop-construct
> error in a PR that does not touch any range loops (Jenkins conveniently
> deleted the actual failed test, but my GitHub comment and PR contents
> may be enough to restore what happened):
> https://github.com/squid-cache/squid/pull/806#issuecomment-827924821
> 

Thank you.

I see here two distros which have "rolling release" being updated by 
sysadmin from producing outdated and wrong test results, to producing 
correct test results. This is a correct change in line with the goal of 
our nodes representing what a user running that OS would see building 
Squid master or PRs.
   One distro changed compiler and both turned on a new warning by 
default which exposed existing Squid bugs. Exactly as intended.

IMO we can expect to occur on a regular basis and it is specific to 
"rolling release" distros. We can resolve it by having those OS only 
build in the N-matrix applied before releases, instead of the matrix 
blocking PR tests or merging.

  If we are all agreed, kinkie or I can implement ASAP.

<skip>
> 
>> B. PR submission testing
>>    - which OS for master (5-pr-test) ?
>>    - which OS for beta (5-pr-test) ?
>>    - which OS for stable (5-pr-test) ?
>> 
>> Are all of those sets the same identical OS+compilers? no.
>> Why are they forced to be the same matrix test?
> 
> I do not understand the question. Are you asking why Jenkins uses the
> same 5-pr-test configuration for all three branches (master, beta, 
> _and_
> stable)? I do not know the answer.
> 

So can we agree that they should be different tests?

  If we are all agreed, that can be implemented.

After test separation we have the choice of OS to answer those questions 
I posed.

My idea is to go through distrowatch (see file attached) and sync the 
tests with OS that provide that vN (or lower) of Squid as part of its 
release. Of course, following the sysadmin testing process for any 
additions wanted.

> 
>> IIRC, policy forced on sysadmin with previous pain complaints.
> 
> Complaints, even legitimate ones, should not shape a policy. Goals and
> principles should do that.
> 
> I remember one possibly related discussion where we were trying to
> reduce Jenkins/PR wait times by changing which tests are run at what PR
> merging stages, but that is probably a different issue because your
> question appears to be about a single merging stage.
> 

I think it was the discussion re-inventing the policy prior to that 
performance one.

> 
>> C. merge testing
>>    - which OS for master (5-pr-auto) ?
>>    - which OS for beta (5-pr-auto) ?
>>    - which OS for stable (5-pr-auto) ?
>>      NP: maintainer does manual override on beta/stable merges.
>> 
>> Are all of those sets the same identical OS+compilers? no.
>>   Why are they forced to be the same matrix test? Anubis
> 
> This is too cryptic for me to understand, but Anubis does not force any
> tests on anybody -- it simply checks that the required tests have
> passed. I am not aware of any Anubis bugs in this area, but please
> correct me if I am wrong.
> 

My understanding was that Anubis only has ability to check PRs against 
its auto branch which tracks master. Ability to have it track other 
non-master branches and merge there is not available for use.

If that ability were available, we would need to implement different 
matrix as with N-pr-test to use it without guaranteed pain points.

IMO we should look into this. But it is a technical project for sysadmin 
+ Eduard to coordinate. Not a policy thing.

> 
>> D. pre-release testing (snapshots + formal)
>>    - which OS for master (trunk-matrix) ?
>>    - which OS for beta (5-matrix) ?
>>    - which OS for stable (4-matrix) ?
>> 
>> Are all of those sets the same identical OS+compilers? no.
>> Are we forcing them to use the same matrix test? no.
>> Are we getting painful experiences from this? maybe.
>>   Most loud complaints have been about "breaking master" which is the
>> most volatile branch testing on the most volatile OS.
> 
> FWIW, I think you misunderstood what those "complaints" where about. I
> do not know how that relates to the above questions/answers though.
> 

Maybe. Our different view on what comprises "breaking master" certainly 
confuses interpretations when the phrase is used as the 
problem/complaint/report description.

> 
>> FTR: the reason all those matrices have '5-' prefix is because several
>> redesigns ago the system was that master/trunk had a matrix which the
>> sysadmin added nodes to as OS upgraded. During branching vN the
>> maintainer would clone/freeze that matrix into an N-foo which would be
>> used to test the code against OS+compilers which the code in the vN
>> branch was designed to build on.
> 
> I think the above description implies that some time ago we were (more)
> careful about (not) adding new nodes when testing stable branches. We
> did not want a CI change to break a stable branch. That sounds like the
> right principle to me (and it should apply to beta and master as well).
> How that specific principle is accomplished is not important (to me) so
> CI admins should propose whatever technique they think is best.
> 
> 
>> Can we have the people claiming pain specify exactly what the pain is
>> coming from, and let the sysadmin/developer(s) with specialized
>> knowledge of the automation in that area decide how best to fix it?
> 
> We can, and that is exactly what is going on in this thread AFAICT. 
> This
> particular thread was caused by CI changes breaking master, and
> Francesco was discussing how to avoid such breakages in the future.
> 
> There are other goals/principles to observe, of course, and it is
> possible that Francesco is proposing more changes to optimize something
> else as well, but that is something only he can clarify (if needed).
> 
> AFAICT, Francesco and I are on the same page regarding not breaking
> master anymore -- he graciously agreed to prevent such breakages in the
> future, and I am very thankful that he did. Based on your comments
> discussing several cases where such master breakage is, in your 
> opinion,
> OK, you currently disagree with that principle. I do not know why.
> 

I think we differ in our definitions of "breaking master". You seem to 
be including breakage of things in the CI system itself which I consider 
outside of "master", or expected results of normal sysadmin activity. I 
hope my response to the two use-cases you present at the top of this 
email clarify.

Amos
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: squid_distros_2021.txt
URL: <http://lists.squid-cache.org/pipermail/squid-dev/attachments/20210517/1ca3ab6e/attachment-0001.txt>