Logging of "quality events"

Benjamin Smedberg benjamin at smedbergs.us
Wed Sep 18 20:52:39 UTC 2013

Right now our stability efforts are primarily focused on crashes. 
However, as we have been very successful at reducing our crash rate, 
some other stability issues which are not crashes have come to be more 
prominent. Some examples:

* very slow startup
* very slow/hung shutdown
* hangs while running
* JS error which cause parts or all of the UI to stop functioning properly
* localization errors, especially entity/DTD errors which cause parts of 
the UI to be ugly or missing

Prompted by several discussions during the stability work week, we need 
to broaden our focus within stability and deal with many more of these 
kinds of events.

The first technical step in this effort needs to be a unified log of 
failure events which includes all types of failure events. This will 
enable two new features right away:

* When a failure event happens and then there is a crash, all the 
failure events leading up to the crash should be contained within the 
crash report.
* Support-facing mechanisms (about:support or perhaps the web frontend 
for FHR) will be able to display recent error events to the user and 
allow the log to be copied into SUMO issues or bug reports.

After we've tested logging features in the wild, we will likely build 
this out into a more complete support mechanism:
* include counts/histograms of error events within the FHR payload 
itself, to correlate errors across user populations and identify common 
* combine about:support and FHR user interfaces into a unified 
troubleshooting UI and allow users to submit error reports for non-crash 
events, including comments about their issues and hopefully provide 
users with automated solutions to common problems (on B2G, this will be 
a support/troubleshooting app built into the system?)

Technically, though I'm not exactly sure how to accomplish this kind of 
logging: whatever system we have should be fairly robust:

* the log must be writable from multiple processes, for B2G, 
multiprocess Firefox, and even Firefox webapp support (note that 
hopefully soon we'll be collecting crash reports from every process on 
B2G devices using debuggerd, not just the B2G/app/content processeses)
* the log it must be writable from multiple threads (even if the main 
thread is deadlocked) so that we can monitor and write hang-detector 
information to the log
* individual log entries such as hang reports may need contain data 
(such as SPS profiles, or invalid responses from Mozilla services)

Does anyone know of prior art that we could apply to this problem, or 
suggestions for how to implement this kind of logging safely, correctly, 
and efficiently? It's possible that the system will need to be different 
across platforms, using a logging service on B2G, some kind of native 
logging system on android, and a custom-built system on desktop.

If people have suggestions for other types of error log events that 
should be include in this system, please let me know.


More information about the firefox-dev mailing list