Several leak failures have slipped passed continuous integration

Andrew Halberstadt ahalberstadt at mozilla.com
Thu Dec 29 16:37:15 UTC 2016


Over the holidays, we noticed that leaks in mochitest and reftest were 
not turning jobs orange, and that the test harnesses had been running in 
that state for quite some time. During this time several leak related 
test failures have landed, which can be tracked with this dependency tree:
https://bugzilla.mozilla.org/showdependencytree.cgi?id=1325148&hide_resolved=0

The issue causing jobs to remain green has been fixed, however the known 
leak regressions had to be whitelisted to allow this fix to land. So 
while future leak regressions will properly fail, the existing ones (in 
the dependency tree) still need to be fixed. For mochitest, the 
whitelist can be found here:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#2218

Other than that, leak checking is only disabled on linux crashtests.

Please take a quick look to see if there is a leak in a component for 
which you could help out. I will continue to help with triage and 
bisection for the remaining issues until they are all fixed. Also big 
thanks to all the people who are currently working on a fix or have 
already landed a fix.

Read on only if you are interested in the details.


_Why wasn't this caught earlier?
_
The short answer to this question is that we do not have adequate 
testing of our CI.
_
_The problem happened at the intersection between mozharness and the 
test harnesses. Basically a change in mozharness exposed a latent bug in 
the test harnesses, and was able to land because it appeared as if 
nothing went wrong. Catching errors like this is tricky because regular 
unit tests would not have detected it either. It requires integration 
tests of the CI system as a whole (spanning test harnesses, mozharness 
and buildbot/taskcluster).


_How will we prevent this in the future?_

Historically, integration testing our test harnesses has been a hard 
problem. However with recent work in taskcluster, python tests and some 
refactoring on the build frontend, I believe there is a path forward 
that will allow us to stand up this kind of test. I will commit some of 
my time to fix this and hope to have /something/ running that would have 
caught this by the end of Q1.

I would also like to stand up a test harness designed to test command 
line applications in CI, which would provide another avenue for writing 
test harness unit and integration tests. Bug 1311991 
<https://bugzilla.mozilla.org/show_bug.cgi?id=1311991> will track this work.

It is important that developers are able to trust our tests, and when 
bugs like this happen, that trust is eroded. For that I'd like to 
apologize, and express my hope that this will be the last time a major 
test result bug like this happens again. At the very least, we need to 
have the capability of adding a regression test when a bug like this 
happens in the future.

Thanks for your help and understanding.
- Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/firefox-dev/attachments/20161229/ad505282/attachment.html>


More information about the firefox-dev mailing list