Intent to Implement System Add-on: SHIELD/Normandy
mkelly at mozilla.com
Fri Sep 30 17:25:52 UTC 2016
CC+ the rest of the SHIELD dev team, because I'm silly and forgot to CC
On 9/30/16 6:41 AM, Gijs Kruitbosch wrote:
> Security review is separate from Firefox's "regular" code review.
> One-line patches won't normally need a separate security review, but
> will still (almost) always need code review. I'm not really speaking
> about the security review aspect of it, just about Firefox code review.
> If the practices and steps for security review are unclear, we should
> get that sorted.
Aha, got it. I don't think the process of getting a review itself is
unclear, rather the policy of things being the patch author's
responsibility wasn't clear. In our case we knew security was an
important concern for SHIELD and reached out early.
>> What I'm getting from your answer is that it is the Firefox peer's
>> responsibility to judge whether what they're merging needs review from
>> someone else besides the peer
> No, I would say that the responsibility for "who needs to look at this
> patch" is on the patch author. If the patch author is not sure, they
> should ask for help (and probably ask prospective reviewers!), but it
> shouldn't be assumed that setting r?<peer> on bugzilla is equivalent to
> delegating all the due diligence to said peer.
Yep, that's the bit I was missing. Thanks!
>>> OK, so for this one the answer is "no". This concerns me because that
>>> code ultimately affects how Firefox-the-product behaves (outside of what
>>> webpages can do) and normally such changes are subject to Firefox's
>>> review policy. This is the case for other go-faster add-ons (e.g. I have
>>> reviewed a number of changes to the pocket "go faster" add-on in the
>>> last few weeks). When this was not the case for self-repair code, that
>>> burned us.
>> Changes to the sandbox require a Firefox review since they're part of
>> the system add-on code. The intent is that reviewers will use that
>> opportunity to ensure only new behaviors that they're comfortable with
>> are added. For example, instead of adding a "change any pref" function,
>> we'd add a function that can change prefs from a whitelist of approved
>> prefs. This helps reduce the impact of bugs in the JS code.
> Yes, but I wasn't talking about bugs in the sandbox JS code for the
> atomic functionality.
Oh, I assumed you meant the JS specifically based on the lines you
quoted. My bad!
> SHIELD will likely need things like "show a thing to the user". If "show
> a thing to the user" happens every startup, or not at all, or ... - that
> is bad. If the recipe controls when we call the "show a thing to the
> user" method, and the recipe has a bug so we end up not showing the
> thing at the right times, that's not a bug you can catch by reviewing
> all the sandbox code that just handles "if this method is called, show
> Obviously we will want to have logic in the recipes, because the "whole
> point" of the model in use here is so that there's flexibility and we
> won't have to hand-code a new add-on for every study people want to run.
> That's fine, I'm just saying that the logic in the recipes is important,
> can be buggy and cause problems, and therefore needs appropriate review.
>> Also, all the behavior changes S&I put out via SHIELD involve approvals
>> via bugs and emails with external stakeholders and product owners.
> Sure, but that is not the same as code review. Approval for "it should
> do X" is not the same as "I've reviewed the code and am confident it
> does X rather than Y".
Ah, here I thought you had some concern around behavior changes
(abstract to the code implementing those changes) that didn't go through
the review process. If code review is the specific thing than I think
we're mostly on the same page.
We enforce code review currently on recipe/action code (which happens on
Github) and restrict write access on the repo so only approved reviewers
can merge the code. We also have protected branches and other nice
conveniences enabled, but those aren't _particularly_ strong technical
Is there more we should be doing here? Is it a question of who the
reviewers are, or maybe how we keep track of approved reviewers?
>> Experiments have a process for data and implementation review, and the
>> sampling we use on these recipes makes the impact of issues pretty small.
> Can you elaborate on this? In the past these recipes have gone out to
> the entire (en-US) user population, and any sampling had to happen
> *within* the equivalent of the 'recipe' (so you generate a random number
> in the recipe code and check if it's in some bucket before doing the
> thing the recipe says to do). If that code was/is buggy, the impact
> could still be severe. Is the current setup architectured differently so
> this happens within the server software (and some clients will ask for a
> recipe and get "nope, nothing here") ? Or something else?
With the system add-on, the main point where sampling occurs happens
before executing the JS in the sandbox. The filters that are sent along
with the recipe (including sampling) are evaluated client-side by the
It's conceivable that an action might do some extra filtering when it's
run. For example, the survey action we have checks to see if the user
has already seen the survey for that specific recipe and does nothing if
the user has. But for the most part, sampling is implemented in one place.
>> If a recipe is broken, we can disable it in one step with the
>> admin interface, whereas self-repair required writing and shipping a
>> patch to the self-repair repo.
> Maybe, but the upside there was that when things went pearshapes I could
> find someone to give me access to said github repo so I could revert
> something and it got autodeployed, even when it wasn't really normal
> daytime in the US.
> Can we ensure IT/ops/moc have access to this thing so they can turn
> things off (day or night) if they go pearshapes (which is what we do for
> other web things, AIUI) ?
After the last email, I set up a meeting with the S&I team next week to
talk about exactly this. :D
>>> Followup question: does this code in any way follow a train model (ie
>>> get tested against Firefox Nightly users first) ? The self-repair code
>>> didn't and just ran against 10% of release users to test it, which,
>>> again, burned us in the past.
>> It doesn't follow a train model, but we often run new recipes / actions
>> against small percentages (0.001%, for example) to test if it performs
>> as we expect before bumping up the sample rate. We are also capable of
>> running recipes on specific channels if the risk is deemed high enough
>> to warrant that.
> Great! Still interested in how the sampling is happening.
>>> Related, why can't the experiments people "want" to run be run against
>>> Nightly users first to test the viability of the system add-on and the
>>> rest of the platforms involved with running these
>>> recipes/experiments/snippets? Why the need to immediately uplift to
>> We have ways of doing basic "Does it work, does it break everything"
>> testing on Nightly that we plan to perform, but that won't take terribly
>> long to run. The interesting tests we want to do that require real
>> surveys we want to test are useful because they involve testing the
>> entire pipeline from downloading and displaying a Heartbeat prompt to
>> analyzing the results.
>> We also sample these prompts (to avoid spamming users with repeated
> I'm really sorry, but this makes it sound like the only way you have of
> avoiding a user gets shown a prompt twice is by using low sampling
> percentages. That... that wouldn't be good (also because birthday
> paradox). Please tell me I'm reading too much into this and you have
> better ways of avoiding "repeat" showings? (FWIW, as I raised this exact
> issue earlier in the email, this was one of the issues the self-repair
> stuff had and I hope we've learned from that...)
Oh, no, it's not. More accurately, we sample the prompts to avoid
spamming users with several _different_ prompts to avoid prompt fatigue.
I mentioned this above, but actions that prompt the user (including the
survey one) generally store a flag that they've been shown and never
show themselves to an individual user more than once. I know self-repair
ran into some issues with how it stored this kind of flag, and we've got
an issue open for dealing with that kind of problem:
>> and rate of engagement for the prompt is low enough that
>> running a survey on just the Nightly population won't result in much
>> actual testing of the full flow through the prompt.
> I assume we could run things on a higher sample rate against Nightly to
> avoid this problem and still get testing of the full flow, also by QA,
> on "production" Nightlies + servers?
Well, we could bump the rate up high (how many Nightly users do we have
right now?), but I dunno how the product team would feel about hitting
[many/most/all?] Nightly users with a bunch of test prompts.
For QA we'd probably just have them set a pref we can filter on to a
test value and filter on that to target them specifically. I think the
add-on may even have something like this already? Mythmon would know
better than I would.
>> Also also, there is some level of "omg the data from this system is
>> useful enough that if we can get it 6 weeks sooner that'd be awesome"
>> behind wanting to uplift. The backlog of requested studies is
>> looooooooong, and landing the system add-on is key to increasing the
>> rate of studies. These studies help us improve Firefox in ways we know
>> make a difference, since we've measured them.
> Right. We're relatively early in the cycle - I'm not opposed to uplift
> per se, I would just like to see this tested on Nightly (like most other
> things) before that happens.
Yep. The timing of getting out earlier coincides with the greater
opportunity for testing on Aurora vs Nightly, but I don't want to skip
Nightly completely or anything. We still intend to do some testing on
Nightly before requesting uplift.
- Mike Kelly
More information about the firefox-dev