Garf's blog

Understanding how Leela Chess Zero works

2021-02-25T14:44:00.002+01:00

I wrote the following article in May 2018, now almost 3 years ago. During an interview with Albert Silver (an author affiliated with ChessBase, whom the article was supposedly for) about Leela Chess Zero, there were questions about where to find more information, in an easy to understand manner, how the Alpha Zero ideas worked in chess and how they were implemented in Leela. At that point, most of the easily accessible information was still referring to the Go engine, so I did an attempt to clear up some confusion.

The article and interview were never published (despite being announced), but ChessBase would go on to re-publish Leela Chess Zero as "Fat Fritz 1" a year later.

As I already spent the time writing it, and it might still be interesting to my audience, I'll put it here on my blog, in the hope that someone finds it useful.

Understanding how Leela Zero Chess works (2018)

How the neural network works, and what it calculates.

The neural network takes as input a stack of 8 x 8 bitmaps, enough to fully represent the state of a chess game. Most obviously, this will be the position of the pieces ("is there a pawn belonging to the player to move on this square"), but also some non-visible things, such as "is the player to move still allowed to castle" and "is there an en-passant move" possible?

These inputs are passed through a deep stack of image filters. These stacks are typically 20 to 80 layers deep, and typically have 128 to 256 outputs per layer, which then also form the input for the next layer. In every layer, every output square is calculated by taking the corresponding square and the surrounding ones (in total a 3x3 square) from all outputs from the previous layer, and applying an image filter to them, to compute a new output. These allow the network to gradually compute concepts such as "is this pawn isolated" or "is there an opposing pawn one rank up" and combine them. For example now the next layer can compute "is there an opposing pawn one or two ranks up", until it arrives at higher level concepts "is this pawn on an open file", which it can then combine with previously discovered ones, to form features such as "is this pawn isolated on an open file". Note that no-one is explaining these features to the program or programming them: it has to discover them all by itself by analyzing millions of chess games. Because some processing in the stack (of which we'll not go into detail about) is only done every 2 layers, they are typically thought of as a unit and referred to as a single "residual block".

There are 2 final layers: in the first one, all outputs from the above "stack" are combined and reworked so they map to the possible moves in the position. The network produces a probability for each move that it is the best. This is called the "policy prior".

In the other output, all outputs from the stack are combined to calculate a single value: how likely is it that the player to move wins the game. This is, in essence, the evaluation of the position.

Now, we move on to the tree search.

The search algorithm used by Leela Zero is Monte Carlo Tree Search (referred to as MCTS). It works as follows: first, we evaluate the neural network on the root position. This gives us both an initial evaluation for the position, as well as a list of possible moves with an associated (policy) prior. Now, we investigate the possible moves in
this root node. We will try to find the move that still has odds of being the best (the one with the highest upper confidence bound). If we have not looked at the position before, this will be the one with the highest prior. If we have, this will most of the time be the move that has been scoring best (leading to the position that seems best for the player to move), but sometimes also a move that does not have a terrible score but that we have not looked at very much yet and has a reasonable policy prior (so which could still turn out to be good). Perhaps surprisingly, the mathematics underlying this procedure come from the optimal strategy to play at slot machines! Once we have found the most interesting move, we will play it and evaluate the network on the new position, adding it to our search tree. This procedure, finding the most promising sequence of moves along the search tree, and expanding the final leaf node with the neural network, is repeated until time runs out. Every time we evaluate the network to expand the search tree, we also back up the new evaluation for the final position, and add it to the nodes one level higher, i.e. closer to the root of the tree. This way, at every node in the tree, we keep an average score of the most promising lines of play following that position. As the program homes in on the best moves, this average will gradually shift to reflect the score along the best line of play for both players, converging towards the so called mini-max value that classical programs compute.

Once time expires, we play the move we investigated most of all. This is almost always the move with the highest average score, but it could happen that a new promising move appears with a high score, yet we run out of time before we are able to investigate it more deeply.

Note that despite the name, no Monte Carlo playouts are done at all, i.e. the program does not play out the position till the end during the search. The evaluation from the neural network serves as a high quality approximation for the score the playouts would have produced, based on studying millions of completed training games.

The training.

Leela Zero plays a lot of games against itself, running on the computers of the volunteers who have made their machines available for this. (And you can do so too!). On every move, Leela records what the best moves were after some amount of searching ahead, and at the end of the game, she notes who won. Note that this data corresponds to what the neural network produces! This then gets sent off to a central server. A training machine with fast GPUs collects the data, and adjusts the neural network to more accurately match it. The new network is then sent out back again to the volunteers.

Because searching ahead improves the strength of the moves selected, a given network can produce training data than is stronger than its own output, and we get a cycle of gradual, continuous improvement.

To increase the diversity of the games and to give the network higher chances to discover new things, the engine is very occasionally forced to play a sub-optimal move during the training games, and some noise is added to the policy predictions.

The differences with a classical engine.

The evaluation of Leela Zero represents how often the side to move has been able to win from similar positions. This value effectively represents the program's own experience in actually playing those positions out, and it is natively expressed as a percentage. In a classical engine, the evaluation is expressed in pawn-equivalents, and
the programmers spend their time painstakingly adding rules and making adjustments such as "is an isolated pawn on an open file worth -0.10 pawns or -0.15 pawns?" and "should it be -0.20 pawns if the opponent still has both rooks?". If we are talking about things such as how much a compromised king's position is worth after an exchange sacrifice, it is clear trying to come up with good "pawn equivalents" or rules is a soul crushing experience for the programmers.

For compatibility with existing Chess GUI's such as Fritz or ChessBase that expect the program to be such a bean counter, Leela Zero will convert her percentage to a pawn-equivalent score, based on database statistics on how well players score with various material advantages.

This difference in evaluation, i.e. the reflection of game playing experience versus hard-coded rules, is one of the main reasons why Leela Zero has a fundamentally different understanding of compensation and dynamic advantages in chess.

In a classical engine, the engine starts the search with a nominal search depth, which it gradually increases 1 half-move at a time, and investigates every sequence of moves up to this depth. Of course, modern engines like Stockfish are so strong because they do not exactly do this. They will not investigate unpromising moves deep in the tree (pruning) or decide to investigate them much less deeply (reductions), based on heuristic rules or statistics gathered during the search. Nevertheless, every move sequence will tend to have gotten some minimal amount of search, a brute force safeguard, if you will. This is much less so in Leela Zero. If the network decides that a particular move is very unlikely to be good, it can essentially ignore it even if the main line has been investigated for 25 half moves or more. This is why the engine can occasionally still make tactical mistakes from lack of experience. Note that unlike the heuristics for the classical engines, Leela Zero uses the full neural network, i.e. all of its chess knowledge, for every decision where to search. This is also why, despite searching literally thousands of times slower than a classical engine, she can still be competitive.

If you read the description of Leela Zero's search, you will see that there is no such thing as a nominal depth with which to start investigating each move sequence. Again, for compatibility with existing software, Leela will report a search depth in half-moves, though this value is essentially made up.

In a classical engine, the search and evaluation heuristics are a sequence of rules. IF this AND this AND this THEN IF this ELSE... and so on. Modern engines such as Stockfish use 64-bit integer bitmaps to ask such questions about the whole board at once at high speed. Nevertheless, the execution flow inside the program jumps around a lot as every rule and exception is evaluated. These kind of "branching" flows are the forte of the classical CPU as found in desktop computer but also in mobile phones. In contrast, the processing of Leela Zero's neural network, after some mathematical trickery, essentially boils down to doing a lot of straightforward matrix multiplications. This is a highly parallel task with no branching or divergence, the forte of the Graphics Processing Unit (GPU), found in a video card. The raw computing power of the GPU is much higher for these simpler tasks, and this is why Leela runs comparatively faster on a GPU. Because the neural network is so large, even a fast GPU will run at a much lower nodes per second than you might be used to from a classical engine. She has to make up the difference by having a superior understanding of chess, something which she is getting better at every day.

Linux sandboxing improvements in Firefox 60

2018-05-10T23:01:00.001+02:00

Continuing our past work, Firefox 60 brings further important improvements to security sandboxing on Linux, making it harder for attackers that find security bugs in the browser to escalate those into attacks against the rest of the system.

The most important change is that content processes — which render Web pages and execute JavaScript — are no longer allowed to directly connect to the Internet, or connect to most local services accessed with Unix-domain sockets (for example, PulseAudio).

This means that content processes have to follow any network access restrictions Firefox imposes — for example, if the browser has been set up to use a proxy server, connecting directly to the internet is no longer possible. But more important are the restrictions on connections to local services: they often assume that anything connecting to them has the full authority of the user running it, and either allow it to ask for arbitrary code to run, or aren't careful about preventing that. Normally that's not a security problem because the client could just run that code itself, but if it's a sandboxed Firefox process, that could have meant a sandbox escape.

In case you encounter problems that turn out to be associated with this feature, the `security.sandbox.content.level` setting described previously can be used for troubleshooting; the network/socket isolation is controlled by level 4. Obviously we'd love to hear from you on Bugzilla too.

Additionally, System V IPC is now blocked. This is a relatively old way of communicating between processes on Unix systems, which modern software mostly avoids. It's not needed in content processes, and it can grant access to resources they shouldn't have, so we now block it. (There are exceptions: ATI's legacy `fglrx` graphics drivers and VirtualGL require it, so if either of those is in use, we grant an exception. This does not affect newer AMD drivers, who can run with these restrictions in place.)

These changes also allow something else useful: because content processes no longer need any direct access to the network or filesystem (they can still ask the main process for some whitelisted subset they are allowed to read), we can ask the kernel to take it away.

Specifically, we use Linux namespaces and chroot. This is similar to how containers like Docker work, and complements the existing sandbox: in addition to blocking specific system calls like `open` and `connect`, we can prevent filesystem/network access no matter how it is requested. This is part of our defense in depth: if a clever attacker manages to bypass one layer of protection, that still won't be enough to compromise the system.

The kernel feature we use for this is called “unprivileged user namespaces”. Although it has been in Linux for a while, some distributions don't support it because they rely on older kernels, or they disable it because of security concerns. Although we don't want to take a position on the latter, obviously if the feature is available, Firefox might as well make use of it to make itself more secure. The Sandbox section of about:support will tell you whether it is supported on your distribution or not. If it isn't, you are still secure, just missing one of the extra hardening layers.

The one exception to the network policy, for now, is the X11 protocol which is used to display graphics and receive keyboard/mouse input. Content processes don't need most of its functionality, but they still have some dependencies on it. Hardening work on X11 is ongoing and will be addressed in future releases.

Developer information about Unprivileged User Namespaces:

User namespaces allow a process to behave as if it were root in a restricted subset of the system by granting it extra capabilities. Firefox uses the thus gained CAP_SYS_CHROOT capability to chroot() the content processes into an empty directory, thereby blocking their view of the real filesystem. Using user namespaces avoids the need to install a setuid root binary to achieve this.

Unprivileged user namespaces had some security bugs in their initial Linux implementation, and some Linux distribution maintainers are still wary of keeping them enabled. The problem is that although the process behaves as if it were root in its own namespace, at many levels in the kernel checks must be done to ensure it cannot do things that it couldn't do in its original namespace, where it is not root. There have been some calls to rework the feature to reduce the possibility of errors in this regard.

In this context, we'd like to remark that an application like Firefox only needs CAP_SYS_ADMIN, CAP_SYS_CHROOT, CAP_SET(UG)ID inside the namespace to achieve most effect, and specifically does not require the CAP_NET_ADMIN permission that has been proven rather bug prone in the past.

Linux sandboxing improvements in Firefox 57

2017-11-09T18:06:00.000+01:00

Firefox 57 not only ships a large amount of performance improvements and a UI refresh, it also contains a number of technological improvements under the hood. One of these is that the security sandbox was tightened, making it harder for attackers - should they find a security hole in Firefox in the first place - to escalate that attack against the rest of the system, or your data.

The content process - that is the one that renders the web pages from the internet and executes JavaScript - is now blocked from reading large parts of the filesystem, with some exceptions for libraries, configuration information, themes and fonts. Notably, it is no longer possible to read private information in the home directory or the Firefox user profile, even if Firefox were to be compromised.

We could not block the web content rendering entirely from reading the filesystem because Firefox still uses GTK directly - we draw webpage HTML widgets using the same theming as the native desktop. Rather than postpone the security improvements till this is reworked, we've elected to work around this by allowing a few very specific locations through. Similar things apply to the use of PulseAudio (to be fixed around Firefox 59 with a new Rust based audio backend), ffmpeg (media decoding must stay sandboxed) and WebGL.

We've made sure this works on all common, and many not-so common configurations. So most of you can stop reading here, but for those who like to know more details or are tinkerers, the following might be of interest. Due to the infinite configurability of Linux systems, it's always possible there will be cases where a non-standard setup can break things, and we've kept Firefox configurable, so you can at least help yourself, if you're so inclined.

For example, we know that in Firefox 57, allowing your system font configuration to search for fonts from the same directory as where your downloads are stored (a rather insecure configuration, for that matter!) can cause these fonts to appear blank in web-pages.

The following settings are available in about:config:

security.sandbox.content.level

This determines the strictness of the sandbox. 0 disables everything, 1 filters dangerous system calls, 2 additionally blocks writing to the filesystem, and 3 adds blocking of (most) reading from the filesystem. This is a high level knob, use it only to quickly check if an issue is caused by sandboxing. After changing this, you'll have to restart Firefox for it to take effect.

If lowering security.sandbox.level fixes your problems, turn it back to the default value (3 in Firefox 57) and restart Firefox with the MOZ_SANDBOX_LOGGING=1 environment variable set, which will log any accesses the Sandbox allows or blocks. "Denied" messages will give you a clue what is being blocked. Don't forget to file a bug in Bugzilla, so we can track the problem and if possible, make things work by default.

security.sandbox.content.read_path_whitelist

List of paths (directories and files) that Firefox is additionally allowed to read from, separated by commas. You can add things here if Firefox can't reach some libraries, config files or fonts that are in a non-standard location, but avoid pointing it to directories that contain personal information.

security.sandbox.content.write_path_whitelist

List of paths that Firefox is additionally allowed to write to, separated by commas. It should almost never be necessary to change this.

security.sandbox.content.syscall_whitelist

List of system call numbers that Firefox will additionally allow, separated by commas. A disallowed system call will crash Firefox with a message mentioning "seccomp violation". It should almost never be necessary to change this. We'd particularly like to hear from your in Bugzilla if you require this.

Firefox sandbox on Linux tightened

2016-10-07T16:42:00.000+02:00

As just announced on mozilla.dev.platform, we landed a set of changes in today's Nightly that tightens our sandboxing on Linux. The content process, which is the part of Firefox that renders webpages and executes any JavaScript on them, had been previously restricted in the amount of system calls that it could access. As of today, it no longer has write access to the filesystem, barring an exception for shared memory and /tmp. We plan to also remove the latter, eventually.

As promised, we're continuing to batten down the hatches gradually, making it harder for an attacker to successfully exploit Firefox. The changes that landed this night are an important step, but far from the end of the road, and we'll be continuing to put out iterative improvements.

Some of our next steps will be to address the interactions with the X11 windowing system, as well as implementing read restrictions.

Technical Debt, Episode 1

2016-05-20T21:14:00.000+02:00

One of the projects I'm working on for Mozilla is our Content Sandboxing. We've been using sandboxing for a while to protect some plugins like Flash, as well as media plugins, but now that Firefox can render webpages in a separate process, we can apply restrictions to what those "Web Content" processes can do, too. Those processes are the part of Firefox that is essentially exposed to the internet, and hence to potentially dangerous webpages.

Although we go to great lengths to make this impossible, there is always a chance that a bug in Firefox would allow an attacker to exploit and take over a Web Content process. But by using features provided by the operating system, we can prevent them from taking over the rest of the computing device by disallowing many ways to interact with it, for example by stopping them from starting new programs or reading or writing specific files.

This feature has been enabled on Firefox Nightly builds for a while, at least on Windows and Mac OS X. Due to the diversity of the ecosystem, it's taken a bit longer for Linux, but we are now ready to flip that switch too.

The initial version on Linux will block very, very little. It's our goal to get Firefox working and shipping with this first and foremost, while we iterate rapidly and hammer down the hatches as we go, shipping a gradual stream of improvements to our users.

One of the first things to hammer down is filesystem access. If an attacker is free to write to any file on the filesystem, he can quickly take over the system. Similarly, if he can read any file, it's easy to leak out confidential information to an attacking webpage. We're currently figuring out the list of files and locations the Web Content process needs to access (e.g. system font directories) and which ones it definitely shouldn't (your passwords database).

And that's where this story about technical debt really starts.

While tracing filesystem access, we noticed at some point that the Web Content process accesses /etc/passwd. Although on most modern Unix systems this file doesn't actually contain any (hashed) passwords, it still typically contains the complete real name of the users on the system, so it's definitely not something that we'd want to leave accessible to an attacker.

My first thought was that something was trying to enumerate valid users on the system, because that would've been a good reason to try to read /etc/passwd.

Tracing the system call to its origin revealed another caller, though. libfreebl, a part of NSS (Network Security Services) was reading it during its initialization. Specifically, we traced it to this array in the source. Reading on what it is used for is, eh, quite eyebrow-raising in the modern security age.

The NSS random number generator seeds itself by attempting to read /dev/urandom (good), ignoring whether that fails or not (not so good), and then continuing by reading and hashing the password file into the random number generator as additional entropy. The same code then goes on to read in several temporary directories (and I do mean directories, not the files inside them) and perform the same procedure.

Should all of this have failed, it will make a last ditch effort to fork/exec "netstat -ni" and hash the output of that. Note that the usage of fork here is especially "amusing" from the sandboxing perspective, as it's the one thing you'll absolutely never want to allow.

Now, almost none of this has ever been a *good* idea, but in its defense NSS is old and caters to many exotic and ancient configurations. The discussion about /dev/urandom reliability was raised in 2002, and I'd wager the relevant Linux code has seen a few changes since. I'm sure that 15 years ago, this might've been a defensible decision to make. Apparently one could even argue that some unnamed Oracle product running on Windows 2000 was a defensible use case to keep this code in 2009.

Nevertheless, it's technical debt. Debt that hurt on the release of Firefox 3.5, when it caused Firefox startup to take over 2 minutes on some people's systems.

It's not that people didn't notice this idea was problematic:

I'm fully tired of this particular trail of tears. There's no good reason to waste users' time at startup pretending to scrape entropy off the filesystem.

-- Brendan Eich, July 2009

RNG_SystemInfoForRNG - which tries to make entropy appear out of the air.

-- Ryan Sleevi, April 2014

Though sandboxing was clearly not considered much of a use case in 2006:

Only a subset of particularly messed-up applications suffer from the use of fork.

-- Well meaning contributor, September 2006

Nevertheless, I'm - still - looking at this code in the year of our Lord 2016 and wondering if it shouldn't all just be replaced by a single getrandom() call.

If your system doesn't have getrandom(), well maybe there's a solution for that too.

Don't agree? Can we then at least agree that if your /dev/urandom isn't secure, it's your problem, not ours?

WebRTC support on Android

2013-04-17T20:52:00.000+02:00

Yesterday, WebRTC support landed in Firefox for Android, which means it is available in today's Nightly. Wait a minute, you might say, didn't you demo this already at MWC a month and a half ago? Well yes. But to get the code used for that demo landed in the main Firefox repository (instead of a branch), some cleanup work was needed, and we needed to make sure that older, pre-ICS phones didn't just blow up when you tried to make a phone call on them.

Although, in theory, WebRTC should work on any device Firefox for Android supports, for an ideal experience you're going to want to have an Ice Cream Sandwich or Jelly Bean device. We'll do what we can for older devices, but realistically Android was missing some essential APIs before that time (especially before Android 2.3) and device performance will also limit what experience you get. Nevertheless I had no problems getting a video call up on my Galaxy Tab 10.1 with Android 3.2, and many devices will probably work just fine.

By default WebRTC is included in Firefox for Android, but behind a preference that is off by default. To enable, flip these preferences to "true":

media.navigator.enabled
media.peerconnection.enabled

Secondly, because the UI is currently very provisional, you might want to disable the dialog that asks for permission to use your camera and microphone (but remember the implications of this, you probably don't want to keep it that way after testing!). Enable this setting:

media.navigator.permission.disabled

Some example pages for WebRTC APIs are here:

Mozilla WebRTC landing page

Note that as the code only just landed, there are going to be bugs. Some known ones are the permissions dialog (if enabled) only remembering the first choice you make, giving you audio or video, but not both, and when disabling permissions, we always take the first camera, which is probably the one you didn't want. On some devices the video might be upside down, too. Expect most of these to be fixed in the next days and weeks. If you find any bugs that I haven't listed here, feel free to file in Bugzilla and we'll get to work on them.

Why Twitter pictures are sideways

2013-02-27T19:44:00.000+01:00

We set out to do a demonstration of using WebRTC on an Android device before this years Mobile World Congress. I'm happy to say we made the deadline and were able to demonstrate some nice desktop Firefox to Android video calls. We only got the demonstration code working shortly before the deadline, though. (Actually, about 1 hour after we decided that we'd likely not make it, I got to tell everyone to disregard that because we just did, for a great "stop the press" moment). So here's a story about how we got hung up for almost a week on something that looked quite trivial at first.

The WebRTC stack in Firefox for Android is based on the upstream webrtc.org. This codebase is shared between Firefox and Chrome, but as Firefox is the only one of those two to use the Android parts so far, we sort-of expected quite some work to do before getting anything working. That didn't end up being so bad: we hit a bunch of problems with the gyp buildsystem, another bunch were related to us running all the code off the main thread. One nice example that bit us is that running FindClass off of the main Dalvik thread doesn't do what you want. But overall, we found that we were able to get some audio and video running quite rapidly. Not perfect, but certainly good enough for a first demo.

That is, aside from one small issue: the video was sideways in portrait mode.

Experienced Android developers who are reading this are surely snickering by now.

The underlying cause of this is that the actual cameras in some devices are rotated compared to the phone shell. You're supposed to probe the Android Camera stack for information about the orientation of the camera, probe the phone's orientation sensors to know how the user is holding it, and rotate appropriately.

If you do a Google search for "Android video rotated" or "Android video sideways" you're going to get a lot of hits, so we knew we weren't the first to see this problem. For our use, we're specifically capturing data from an Android Camera object through a preview Surface. There are several threads on this issue on StackOverflow, some of them trying to cheat by pretending to be in landscape mode (a hack that didn't seem feasible for Firefox), but most of them conclude the issue must be solved through using the setDisplayOrientation call.

There are several downsides to this proposed solution: it is API Level >= 8 only, for starters. Given that we have dropped Froyo support (the final WebRTC code will likely require at least Android 2.3 or later, and almost certainly Android 4.0 or later for the best experience), that's not all that much of a problem for us, even more so because Camera support on Froyo or older devices is quite spotty anyway. But it makes me wonder what Camera apps on Froyo are supposed to do.

The second problem is that setDisplayOrientation only changes the orientation of the preview Surface. Now, for a normal Camera app that wants to show the user the video he's capturing, this is fine. But that's not what we're doing. We don't want to show the preview picture (in Java) at all! What to do with the data captured through WebRTC getUserMedia is entirely up to the webpage, which means it's to be handled in JavaScript (Gecko), and not the business of the Java layer at all. Maybe the JavaScript doesn't want to show the local video at all, for that matter. So our application doesn't actually want to display the Android preview, and so rotating it isn't going to make any difference. We just want the raw video data to send off to the video encoder.

Now, to go off on a small tangent, it turns out that not showing the video preview on an Android device isn't actually reliably possible pre-ICS. We currently use a dummy TextureView, which is known not to work on some devices, and will use a SurfaceTexture (which is guaranteed to work) on Android 4.0 or newer devices.

The Android Camera API also has a setRotate call, which might rotate the image data, or it might just put an EXIF flag that the image was rotated. This can only work if you're taking in JPEG data or if you're lucky and your phone actually rotates the data. None of the phones we wanted to use for the demo (Galaxy S3 and Nexus 4) do this. Taking in JPEG is out of the question, imagine the overhead going NV21->JPEG compress->JPEG decompress->I420->VP8 compress for every frame at 60fps! For the same reasons, we discarded the idea of doing the rotation with SurfaceTexture, setMatrix and letting the GPU do the work, because that would require a double NV21<=>RGB conversion as well.

Somewhere around this point, it was starting to dawn on me that this is the reason that pictures I'm taking with my Android phone come out sideways when posting on Twitter. The camera is actually capturing them sideways, and something along the chain misses that EXIF flag, which causes the picture to be displayed as it was taken. What makes this sad is that you can actually rotate JPEG data the right way in a lossless manner, so this is entirely unneeded. Or maybe they could just have made it easier for the camera apps to rotate the data before saving. Crazy idea, I know.

Okay, so here we are. We're getting in the raw camera data, and the only format we can rely on being available on every device is NV21 (except for the emulator, which will use RGB565, because this is Android after all). So, we need to rotate the NV21 data before sending it off towards our WebRTC video codec. We create a Bitmap with that image data and then rotate it using the Bitmap functions, right? Well, that would be fine, but unfortunately, even though NV21 is the only format you can rely on being generated by the camera, there is nothing else in Android that can understand it (there's android.graphics.YuvImage, which "helpfully" only allows JPEG output). The only way to deal with is to write your own conversion code in Java.

Because we have a C++ backend, and the webrtc.org part of that includes libyuv, we ended up passing the data to that. libyuv has optimized code for YUV-style data handling (including NV21), so that's nice, though the implementation in the WebRTC code is buggy as well. At least there were some hooks to request the rotation included, so someone down there must have known about this madness.

So we only need to work around those know bugs, and voila, we have rotated our image. Easy, uh? Android is great!

Phishing and malware protection arrives on mobile devices

2012-10-11T09:37:00.000+02:00

I'm happy to announce malware and phishing protection is now enabled in Firefox for Android, and available in the Aurora and Nightly builds. It will be released in Firefox 18. As far as I could tell, Firefox is the first browser to offer this feature to mobile users.

The phishing and malware protection works by regularly updating a local list of known, bad sites, graciously provided by Google through their SafeBrowsing database. Whenever Firefox detects that you navigate to such a site, or that a page you visit is trying to pull data from it, Firefox will present you with a warning page and allow you to abort the operation.

I blogged previously about our intent to rework the database to a size acceptable to mobile devices, and the subsequent rewrite of the SafeBrowsing backend. The remainder of the work had to be postponed a little as we worked vigorously to finish the new Native Firefox for Android for phones and tablets.

Some of the remaining tasks were verifying that the new backend processes all SafeBrowsing updates correctly. This was done by writing an external Python program that can read in both the old and new format files, which turned out to be very useful in debugging the file format documentation as well.

Another issue that remained was the need for the updates to be very resilient on a mobile platform. For example on Android the application can be killed nearly at will by the Operating System, and care must be taken that the database not only doesn't get corrupted, but that as little data as possible is lost if this happens in the middle of an update. The old backend got this functionality mostly for free through the ACID features of SQLite, but for the new backend some extra work was necessary.

The final step was then updating the SafeBrowsing warning screens and support code for Firefox for Android, which we finished last week. For the future, our UX team is currently busy with a new, nicer visual design for the warning pages that will be shared between Firefox for Android and Firefox OS. But in the mean time, enjoy the added protection already. I sincerely wish you will never have to encounter either of these pages, anyway!

Phishing warning

New UX Design

The effectiveness of practice - a small study of StarCraft 2 balance

2012-06-20T20:25:00.001+02:00

Edit 2012-06-21: Corrected "games" to "won games". Sorry about that!

Win rates have little to do with "balance"

I had some musings about StarCraft 2 balance a while ago and decided to write a program to check them out.

Basically, most discussions about balance end up mentioning win rates a lot. However, because the StarCraft 2 matchmaking tries to guarantee each player has a win rate around 50% in the middle-long term, win rates are actually quite a meaningless indicator of player strength, and hence balance. Regardless of balance, we're always expecting them to end up around 50%.

Imagine you take a player of grandmaster level, and now handicap him by disallowing the use of his races' macro mechanic. The hit to his strength is of a similar nature as would be when one of the races were weaker than the others. After playing for a while (and initially scoring way below 50%), he will likely drop down to somewhere like diamond league, for example, until his opponents are such that he gets a 50% win rate again. So we now know that this player is actually grandmaster, but his race handicap holds him back from realizing his potential. Yet his win rate is the same as always.

Win rates also suffer from temporal fluctuations if effective strategies for once race are developed and it takes a while to find a counter for the other races. Win rates can suddenly shift quite far away from 50%. The actual ability of the players did not change, nor did the game itself suddenly become imbalanced. This would only be the case if it eventually turns out that there is no counter to this strategy.

There is one particular case in which win rates can tell us something about balance issues: Random players. For Random players who play all races equally, we'd expect them to on average score the same with all races. If we take the data of all random players and a certain race appears to outscore the others, it's a good sign of a balance issue.

Effort needed for reaching a league

Unfortunately, gathering statistics of the race performance of Random players is something that seems hard to spider off of Battle.Net. So while Blizzard can and probably does use this information in their balancing discussions, it is not available to us.

Another way to look at race strength is to consider that if we could pick players of similar "innate" skill, give them a race and let them loose on the SC2 ladder, we could check if they end up in the same leagues.

So how do we find these players of similar "real" skill? We can make an assumption: a player improves the more StarCraft 2 games he plays. This isn't a very far fetched conclusion, as its known that most people improve at skilled tasks though practice. We can then rephrase our balance question: for a given amount of practice, will a player of a certain race end up being higher ranked than players of another race?

The data

I spidered data from about 26676 players from the EU StarCraft 2 servers, or about 10% of the active player population in that region. I rejected all players that have no games in Season 8 yet, and all Random players. They're not pertinent to our base question, and they require extra effort to distinguish from each other. I also threw away all players with less than 5 games, which hence have no league yet.

The table below shows the average amount of won games for players in a certain league, both overall, and broken down per race. After the number of games is the 95% standard error on the mean, which essentially gives some idea how accurate the numbers are. (They're mostly accurate to about 5-10%, except for the grandmasters due to extremely low sample size there). The next number is the amount of won games below which 95% of the players in that league are. This gives some idea of the amount of games where a player is extremely likely to get to the next league soon, as well as the actual variance of the amount of games players in a league have. Note that the variance is actually quite big - there's quite some players who have played twice the average amount of games yet are still stuck in a league.

The average number of won games the players in a league have is not the same as the amount of games you expect to need on average to be promoted into that league. That point is (very approximately) somewhere halfway between the average of the lower and the target league. The total number of games is about twice the number of won games - because the matchmaking should keep you around 50%.

So with this introduction, here is the actual data:

    EU Server, 26676 players sampled, 16243 valid samples
    
    6010 players in Protoss (37%)
    5248 players in Terran  (32%)
    4985 players in Zerg    (31%)
    
    Avg wins =   290 (+-    9), 95% upper ( 1021) for  5754 players in bronze     (35%)
    Avg wins =   533 (+-   17), 95% upper ( 1547) for  3468 players in silver     (21%)
    Avg wins =   708 (+-   22), 95% upper ( 1902) for  2715 players in gold       (17%)
    Avg wins =   939 (+-   33), 95% upper ( 2485) for  2133 players in platinum   (13%)
    Avg wins =  1239 (+-   48), 95% upper ( 3047) for  1386 players in diamond     (9%)
    Avg wins =  1680 (+-   86), 95% upper ( 4101) for   783 players in master      (5%)
    Avg wins =  2880 (+- 1544), 95% upper ( 5968) for     4 players in grandmaster (0%)
    
    Avg wins =   293 (+-   15), 95% upper ( 1022) for  2248 players in bronze_Protoss
    Avg wins =   305 (+-   16), 95% upper ( 1079) for  2163 players in bronze_Terran
    Avg wins =   262 (+-   17), 95% upper (  917) for  1343 players in bronze_Zerg
    
    Avg wins =   511 (+-   26), 95% upper ( 1467) for  1269 players in silver_Protoss
    Avg wins =   553 (+-   33), 95% upper ( 1681) for  1162 players in silver_Terran
    Avg wins =   537 (+-   29), 95% upper ( 1483) for  1037 players in silver_Zerg
    
    Avg wins =   687 (+-   34), 95% upper ( 1767) for   977 players in gold_Protoss
    Avg wins =   750 (+-   48), 95% upper ( 2115) for   777 players in gold_Terran
    Avg wins =   695 (+-   37), 95% upper ( 1848) for   961 players in gold_Zerg
    
    Avg wins =   949 (+-   55), 95% upper ( 2480) for   762 players in platinum_Protoss
    Avg wins =   886 (+-   59), 95% upper ( 2292) for   550 players in platinum_Terran
    Avg wins =   966 (+-   57), 95% upper ( 2608) for   821 players in platinum_Zerg
    
    Avg wins =  1217 (+-   77), 95% upper ( 2902) for   476 players in diamond_Protoss
    Avg wins =  1188 (+-   97), 95% upper ( 3065) for   371 players in diamond_Terran
    Avg wins =  1293 (+-   80), 95% upper ( 3156) for   539 players in diamond_Zerg
    
    Avg wins =  1719 (+-  135), 95% upper ( 3974) for   276 players in master_Protoss
    Avg wins =  1687 (+-  177), 95% upper ( 4349) for   225 players in master_Terran
    Avg wins =  1635 (+-  141), 95% upper ( 4017) for   282 players in master_Zerg
    
    Avg wins =  2867 (+- 3537), 95% upper ( 7870) for     2 players in grandmaster_Protoss
    Avg wins =  2893 (+- 1336), 95% upper ( 4782) for     2 players in grandmaster_Zerg

Conclusions

There is a clear correlation between having played more games, and being in a higher league. This validates our earlier assumption. More practice makes you a better player. (Or if you're pedantic about causation-correlation conclusions, at the very least better players practise more.) I know this sounds extremely obvious but it is nice to validate it. It also indicates smurfing etc isn't prevalent enough to mess with our results.

There are no significant differences between the races per league [1]. This means that you cannot get higher ranked faster by picking a specific race. You still need to put in the same amount of practice. This is the main thing we wanted to investigate and it's good news: StarCraft 2 is *really* well balanced, or at least it has been over the last 8 seasons, which this data aggregates.

If you want to get to masters, expect to put in about 3000 practice games, and maybe as much as 6000. If we assume an average game (including postmortem analysis etc) takes 20 real-life minutes, expect to put in about 1000 hours of practice.

If you played 2000 games and are still in bronze league, you're doing it wrong.

[1] This conclusion assumes that the majority of players play one race most of the time, i.e. that offracing only happens occasionally. Without this assumption we can't put the players in race groups to begin with. Note that if offracing were the norm, and the races not balanced, the offracing players would be able to detect this quickly, which in turn would made it less likely that they'd continue to offrace instead of sticking with the strongest race.

History and bookmarks in Firefox for Android

2012-05-24T16:37:00.000+02:00

The feature I have been working on now for quite some time is one I'd jokingly refer to as "a feature I hope most users will never need". It's the system that translates your browser history from XUL Fennec to Native Fennec. New users obviously don't need this, and I hope we get a lot of new users with the switch to the native UI, so there.

More seriously, for something that I initially described in a meeting as "doing a query in one DB and some inserts in another, how hard can it be?", this turned out to be a project that ended up sinking easily over a man-month of engineering time. Of course putting it like that in the meeting was just asking for problems, and I'm going to use this post to give some background into what happened, and explain some more things about our current History/bookmarks architecture and how databases work on Android.

The need for a new History & bookmarks database

First of all, you can ask yourself why we needed to change the format of browser history in the first place. In XUL Fennec, the history and bookmarks are managed by the Places system, the exact same as is powering desktop Firefox. It's a combination of SQLite with a C++ and JavaScript control layer.

The reason for rewriting the UI parts of Fennec in Java was to make it start up faster on Android, so that the user has something he can interact with instantly. Having bookmarks and history available early is quite important as that's what the user will most likely be interacting with, either through the AwesomeBar or the home screen that Native Fennec has. So we cannot wait for Gecko to load to access its Places system - history and bookmarks must be available immediately.

Secondly, Native Fennec now also includes Sync as a true Android service that works with the Android Accounts panel, background synchronization, etc. Having the history and bookmarks available natively in Java removes the need for Sync to load Gecko, which isn't something that you'd want a background service to do.

Lastly, storage on Android devices is often fairly limited. Places databases can get quite large for large if you've been using Firefox a long time and browse a lot, for example on my current desktop system the places.sqlite file is over 70Mb. This isn't desirable on Android, so we probably want to tune history expiration a bit differently.

Evolution of the new database

Our initial implementation of an Android history & bookmarks system was to simply use the built-in default database. This has some nice advantages such as automatic integration with the default Browser. If the user is transitioning from that to Firefox he will immediately have his history & bookmarks available to him. Feedback from the community came in that they weren't entirely comfortable with this. There was a feeling that although users trust Firefox with their private data, they don't necessarily trust Android or other applications that may access this data in the same way.

We started adding our own implementation, a local database (LocalDB) using Android's SQLite library, our own ContentProvider, and a wrapper layer (BrowserDB) that allowed most of the code to seamlessly switch between using the local database and Android's database.

As we started adding more features to our bookmarks system, and Sync implementation was underway, it became clear at some point that we couldn't reasonably support them well with the limited bookmark support of Android itself. For example, by default the Android implementation can only remember the last 250 URL's you visited, which makes the AwesomeBar's search sort-of useless. So eventually, support for Android's history & bookmarks database was phased out.

Remapping Places & Database constraints

Importing the Places database into our LocalDB format is mostly a straightforward remapping, although there are some minor differences. The Places database is very normalized, and for example each bookmarks and history entry refers to an entry in another table which contains the actual URL. LocalDB is a bit simpler in this regard as the URLs are contained within the entries themselves, saving some database work - which can help on slow devices. We made a small design mistake here in the Favicons table. They're big enough that we do not want them duplicated in the database, but Favicons can be the same for multiple URLs. We caught this issue a bit too late and as a result it will not be fixed in the first release - see bug 717428. You can occasionally see constraint violation warnings in the Android logs if this problem occurs.

Another small complication in transferring bookmarks is that they form a folder (tree) structure. We added some (SQL key) constraints in the new LocalDB to ensure no folders get orphaned even if we have bugs in our first versions, but this means that we must reconstruct the tree structure in the correct order when remapping the bookmarks data. A simple breadth-first algorithm makes multiple passes over the bookmarks data until we reach the leaves. Any bookmarks left over were orphaned and are not imported.

Places "tags" are stored in a fairly complicated manner in Places (using a multitude of bookmark folders) and due to this they are not supported in Native Fennec, and likely won't be for a while, so there is no need to migrate them.

Some fragmentation pains

During the development, at some point bug reports started coming in from some users that were migrating to Native Fennec, didn't get any bookmarks & history imported, and got an error "database is corrupted". After investigation, it turns out this was problem of combined SQLite and Android fragmentation. Firefox XUL (and Desktop) ships with the most recent SQLite revision, which means that it is currently using Write-Ahead-Logging for database operations.

Various Android devices come with various versions of SQLite, and only since (approximately) Honeycomb are the SQLite's new enough to understand the new database format that WAL requires. So older Android devices couldn't correctly read Firefox's SQLite databases with their built-in SQLite support.

We ended up writing a bridge driver that uses JNI to talk to Firefox's SQLite library, and provides an interface compatible with the Android SQLite one. This turned out to be useful for a number of other things too, such as accessing the original Passwords and Form History database from Gecko. In general, this kind of experiences made the many work spent on this feature worthwhile - quite a bit of the work was usable in other parts of the project as well.

Another small fragmentation issue we ran into is that foreign key constraints are not supported on some older Android devices. Because we basically only use these for error checking, it was not a large problem.

Performance aspects

The migration initially ran in the background on the first run, but it turns out this leads to horrible performance as the user tries to interact for the first time while there is heavy disk I/O and processing happening at the same time. Because that kind of nullifies the message we were trying to send that Native Fennec is a lot faster, we opted to show a modal screen instead that informs the user of what is happening. Of course, popping up a loading screen doesn't exactly convey a message of speediness either, but we should be fine as long as the user only sees it once and it doesn't take too long.

The latter was a bit of a problem though. Initial implementations ran at a speed of a few bookmarks or history entries per second. It's not unusual for Firefox users to have hundreds of bookmarks and hundreds of thousands of history entries. Android provides an API to allow ContentProviders to support database transactions, and implementing support for those on both sides sped up our migration speed to about 50-100 records per second. This was another example of the work on this little-seen feature providing benefits elsewhere, as Sync is now also using the faster API.

There are some interesting interactions with the batch API in Android if part of the transaction fails. Basically, it is implementation-defined what to do if an operation fails for some operations (applyBatch), and the API seems to be designed to really allow both options (continue or abort) to work. For our purposes, doing as much as possible was the better option, but this turns out to have some problems because the API to tell SQLite what to do in such a case isn't available in Android <= 2.2, which we wanted to support. This lead to some "interesting" code that may make your eyes bleed.

Sync integration

Because that's still not fast enough to migrate a users entire history on the first run, we also made an API that allows Sync to interact with the migration feature. The first run will migrate about 2000 entries and the entire bookmarks table. For users without Sync, this is all that they'll get, but it is enough to make the AwesomeBar quite usable and covers several weeks of browsing even for moderately heavy users.

If the users have Sync set up, the idea was that Sync will continue the migration from the database before starting to use the network. Feedback from the beta made us reconsider the previous approach: the problem of delaying Sync until the Places database has been entirely migrated is that the user will not get his more recent History updates until his updates from potentially weeks or months ago have been migrated. This leads to a suboptimal experience, and it's something we're going to try to address for the release.

The XUL to Native migration includes a feature that tries to migrate your Sync credentials automatically if you have them set up in Native Fennec. Unfortunately this had a bug in the very first beta release, so if you were one of the earliest adopters this will not have worked for you. The issue was fixed in the second beta release.

Future work

Despite the first beta being out, work on this crucial feature is far from finished. Aside from the points already mentioned above, there are a few more notable areas that we plan to get addressed before the release ships:

After support for the Android bookmarks & history database was dropped, users who were previously using the built-in browser have no way to import their bookmarks and history into Fennec right now. We're currently working on adding this feature as a first priority. Eventually we will add export functionality as well (though we'd rather make Fennec so awesome that nobody ever wants to use the built-in browser ever again!).
There is currently no expiration in the history database. So although the format is a bit more compact than Places, it could theoretically grow unbounded if the user visits a lot of sites on his phone and keeps on using Firefox. The algorithms for this that Places uses are fairly complicated, so we need to study which parts are meaningful for a mobile device, and if we want to tune it differently for typical phone or tablet use scenarios.

Java wat

2012-04-14T20:11:00.000+02:00

I'm sure quite a few you have seen the "wat" talk exposing some funny behaviors of Ruby and JavaScript. Working in the industrial/enterprise/boring Java language most of the time for Native Fennec's development, we've been missing out on such interesting moments. Nevertheless, I present you this gem, from bug 745384:

```
(null instanceof <anything>)  = false
```
This is guaranteed by the language spec:
"At run time, the result of the instanceof operator is true if the value of the RelationalExpression is not null and the reference could be cast (§15.16) to the ReferenceType without raising a ClassCastException. Otherwise the result is false."

```
IsInstaceOf(NULL, <anything>) = JNI_TRUE
```
This too is guaranteed by the language spec:
"Returns JNI_TRUE if obj can be cast to clazz; otherwise, returns JNI_FALSE. A NULL object can be cast to any class."

I don't know what the people at Sun were thinking when they designed it this way, but right now I'm all:

For those wondering, these are the relevant Java bugs:

Note the "Closed, Not a Defect, Fixed in spec." It's not a bug, it's a feature, :-)

New SafeBrowsing backend

2012-02-06T17:42:00.005+01:00

Today, the SafeBrowsing rewrite me and Dave Camp have been working on for several months finally landed in the Mozilla Nightlies, and it should be part of the Firefox 13 release, narrowly having missed Firefox 12. It reduces the disk footprint of our anti-phishing and malware protection from about 40-50Mb to 5-6Mb, changes all related I/O from pure random access to a single serial read, and refactors a single 4000+ line C++ file into a bunch of modules. An earlier part of this work landed in Firefox 9 and reduced the memory footprint from potentially up to 40-100M to 1.2M, as well as removing the need to do some random I/O on every page load.
Aside from the performance gains, the reduced footprint is an essential step to enable us to extend our SafeBrowsing protection to Mobile devices, which is why we undertook this in the first place.

It was an interesting assignment, and being my first real project for Mozilla, a bit more involved than we thought at first. I blogged in July last year about our plans for this feature. Some of the optimizations we had in mind didn't work out, while others did end up being implemented.

Eliminating the host keys

One of the things touched upon in the previous blog post was that we used to store 2 32-bit hash prefixes for every blocked entry: one encoding only the host and the other encoding the full URL. Strictly speaking, we only need the full URL part. The old SQLite based code used the host key to SELECT and load all keys for a certain domain at once, but our new in-memory prefix-trie based approach has no such needs. However, as Justin Lebar alread touched upon in the previous blog, this does significantly increase the likelihood that we get a false positive. We now expect to have a false positive for every 7000 URLs visited. This will not cause us to block out any legitimate sites, as any positive hit in the local database is queried against the full, 256-bit hash at a remote server (hosted by Google, who provides the SafeBrowsing data).

This does mean we will increase the traffic to this remote server by a large factor. Scary as it may sound, some back-of-the-envelope estimates shows its not really that bad: say there are about 420M Firefox users, browsing for 8h/day. They load on average 1 URL per second. This means about 140M URL loads per second, causing about 20000 hash requests per second to Google. Google confirmed they can handle this with ease.

Now, there is still a problem when doing this: any collision will appear for all users on exactly the same URLs. This means that if you're unlucky enough to be a a site owner that has an URL that happens to collide, every visitor to your site will have a slightly slower browsing experience. Even worse, should you get linked in a popular forum, or be in the news, there will be a storm of false positives to the server all at once. We thought this to be problematic enough that we implemented a workaround: every user will generate a unique randomization key and re-randomize all his hashes with it. Collisions will happen on a different URL for every user, and consequently also be much better spread through time.

Eliminating chunk numbers

After some discussion, it turned out eliminating the chunk numbers isn't as easy as hoped. First of all, the observation in the previous blog posts that chunk expires only seem to happen when the chunks are in fact old, doesn't hold after observing the protocol for a longer time. It also happens very regularly that a chunk is added, deleted, and added back again, particularly in the malware list. In those cases, it is important to know which add chunk a delete is referring to, so it won't delete the later add. It would still be possible to deal with that if the server recalculates the prefixes to send for every client, but this is less efficient on the server side compared to the bigger, more static downloads that the server can point to now, and which are easily mirrored on Google's network.

Sub prefixes compress harder

In line with the previous paragraph, it happens that we receive sub prefixes for add prefixes we never received. These must be kept around until we receive an expire for them, as we can't know if the add they belong to is outdated or just not downloaded yet. Note also that we usually receive updates backwards, i.e. the most recent data will be sent to the clients first, as it's the one believed to be most relevant. Because sub prefixes contain both an add and a sub chunk, they are also bigger than add prefixes. This causes the eventual size of the database to be a bit more than the minimum guessed in the previous blog post, which more or less ignored sub prefixes entirely. If you peek in your profile, you can see that the goog-malware-shavar.sbstore will tend to be the biggest file: this is exactly due the many sub prefixes in the malware list.

Detection performance

It should be noted that these improvements are purely focused on the footprint of the feature. They will improve the resource usage of the browser, but they do not change the detection performance in any way.

NSS Labs Report

In what is somewhat of a funny coincidence, the same day I am writing this blog NSS Labs published a report "Did Google pull a fast one on Firefox and Safari users?". The main points of this report shouldn't be much news as I pointed out over half a year ago the discrepancy between Chrome, and Firefox and Safari in my previous blog post, as well the reason ("Performance differences between Firefox, Chrome and Safari").

I have two remarks to the report: one, as I've already pointed out in the past, false positive control is an important part of effective malware detection. Internet Explorer flags many malware sites, but it also flags legitimate sites, undermining the true effectiveness.

Secondly, the problem isn't so much that the "new" SafeBrowsing protocol is proprietary or non-documented; it's implemented in Chrome and Chromium is open source, so at the very worst we can go study that code to see how to implement it. The problem is that permission is required to use Google's SafeBrowsing servers, and ~~Firefox does NOT have permission to use the download protection list~~. Edit: Please see the statement from Ian Fette below.

A note on malware detection performance

2011-08-17T11:07:00.001+02:00

Yesterday, most news outlets published stories about a study done by NSS Labs showing that Internet Explorer 9 provides overwhelmingly better protection against phishing and malware sites than competing browsers, including Firefox. Specifically, the claim is that Internet Explorer detects 99.2% of all malicious internet content, compared to a measly 7.6% for Firefox.

This isn't the first such study conducted. Over 2 years ago, at the release of Internet Explorer 8, a similar study was done with similar but less extreme results. That study was harshly criticized for having a number of flaws. There is no indication that any of criticism was addressed in the new study.

Although many in the community have yet again pointed out the problematic aspects of the study, and hence properly put the results under scrunty, I believe some issues are worthy of a more detailed explanation.

Accuracy of the reported results

The main issue with the study is that it does not consider the tradeoff between false positives, namely, reporting malware when a download is in fact safe, and false negatives, namely, allowing the user to visit a dangerous site or allowing him or her to download malware. Due to the way the test was conducted, false positives were mostly neglected. In other words, the browser that simply gave most warnings, regardless of accuracy, had a large advantage compared to its competitors.

The SmartScreen service in Internet Explorer will give the user a warning as soon as he or she tries to download any uncommonly downloaded piece of software. This is done regardless of whether the download is malware or not. It should be obvious that by crying foul in every uncertain situation, you will cry foul in a dangerous situation, too, leading to a very high score in this test.

Unfortunately, giving a large number of false positives severely diminishes the value of the warnings. If you claim "the sky is falling" at every opportunity, users will start to ignore the warnings, and the effectiveness of the protection drops from 99.2% to 0%, which is infinitely worse than what other browsers offer. The situation should be familiar to Microsoft, who faced similar problems with the initial versions of their User Account Control feature in Windows Vista. It wasn't until the number of false positives dropped that the protection really became effective.

The report's tackling of false positives is limited to the following paragraph.

In addition, NSS maintains a collection of ‘clean URLs’ which includes such sites as yahoo, Amazon, Microsoft, Google, NSS Labs, major banks, etc. Periodically clean URLs were run through the system to verify browsers were not over-blocking.

Aside from the lack of real data, it's astonishing how biased this sample is. As if any major browser vendor, or provider of malware protection, would put out an update and fail to notice that he is blocking his users from visiting a major website! Of course, extending the sample even a little bit to, for example, homepages of random people, would have immediately shown Internet Explorer to produce false positives.

The problems of trading false positives for false negatives and evaluating differing tradeoffs are not new - they're a known topic for anti-spam filters, which must deal with similar tradeoffs when being compared against each other. One representation that is often used in anti-spam research papers is to present the results in the form of ROC curves. That certainly would have been more informative than a single percentage.

It's quite possible that Internet Explorer does in fact offer the best malware and phishing protection of the popular browsers. Microsoft has worked hard to improve in their security reputation, clearly focused on this area, and I'm sure they have smart engineers capable of writing good software. However, studies like these do their work no justice, don't help users make informed choices, and don't help us identify the real risks users are exposed to. This is a missed opportunity.

User privacy in collaborative filtering

The SmartScreen Filter that Internet Explorer uses is a collaborative filter. By it's nature, such a filter is dependent on gathering information from its users. The exact information gathered is described in their privacy policy. I've reproduced some relevant paragraphs below.

If you opt in to SmartScreen Filter, it first checks the address of the webpage you are visiting against a list of high-traffic webpage addresses stored on your computer that are believed by Microsoft to be legitimate. Addresses that are not on the local list and the addresses of files you are downloading will be sent to Microsoft and checked against a frequently updated list of webpages and downloads that have been reported to Microsoft as unsafe or suspicious.

When you use SmartScreen Filter to check websites automatically or manually, the address of the website you are visiting will be sent to Microsoft, together with standard computer information and the SmartScreen Filter version number. To help protect your privacy, the information sent to Microsoft is encrypted. Information that may be associated with the address, such as search terms or data you entered in forms might be included.

Some information about files that you download from the web, such as name and file path, may also be sent to Microsoft. Some website addresses that are sent to Microsoft may be stored along with additional information, including web browser version, operating system version, SmartScreen Filter version, the browser language, the referring webpage, and information about whether Compatibility View was enabled for the website.

While the amount and nature of the information sent to Microsoft may indeed be necessary to achieve the level of protection SmartScreen is claimed to give, it obviously comes at a severe cost of user privacy.

I don't believe most people at Mozilla think it's reasonable to collect the information above, nor would I expect most of our user-base to feel comfortable with it. For this reason, its unlikely we would put up an identical service, let alone enable it by default.

The malware protection included in Firefox uses a cookie to provide Quality-Of-Service information wrt. updates to the providers of malware lists (currently Google). Although we did feel that that was a reasonable tradeoff, some users nevertheless object to it, and some effort is underway to remove even this.

That being said, our browser is extensible and has a wide variety of third-party add-ons, including some that extend it to give similar functionality, if the user feels the privacy versus security tradeoff is acceptable. A popular one seems to be Web-Of-Trust. Feel free to check out our security & privacy add-ons site to see additional protection available for Firefox.

Performance differences between Firefox, Chrome and Safari

One interesting thing that did come out of this study is that Chrome offers slightly better protection than Firefox or Safari. This is interesting, because all three browsers use exactly the same malware and phishing protection: the Google SafeBrowsing API. Firefox has some minor tweaks compared to Chrome to improve user privacy, but those should not have worsened the results, so it was expected that they would score similarly.

I spent some time looking at this, and what happened is that Google has been enhancing their SafeBrowsing system to detect malware downloads, tested this protection in Chrome and recently included it in the stable releases. Because the public documentation for the API hasn't been updated, Firefox and Safari have so far have not implemented this extension.

We are currently discussing with Google on rectifying this, so Firefox will probably soon include the improved protection as well.

Concluding remarks

Being exposed to malicious content doesn't mean being infected by it. We do what we can to keep the browser secure and updated, and to inform the user when his plugins are out of date, and potentially exposed. We have made improvements that allow a user to more easily spot when he or she is being phished. Detecting malware is just one protection. Making it ineffective is another.

Regardless of what we think of the Internet Explorer results, we're still left with a claimed 7.6% detection rate for Firefox out of the box. This means that our current default detection is largely ineffective, and users have much better odds to be exposed to malicious content than that they have of being blocked by us. Even if the study numbers are inaccurate, this order-of-magnitude result probably does hold. I doubt we currently detect and block more than 50% of the malware out there.

That is not a result to be proud of, and we should improve it if possible.

Bringing SafeBrowsing to mobile devices

2011-07-29T14:10:00.003+02:00

Firefox Mobile is in most respects the exact same browser as you get with Firefox on the desktop. Nevertheless, to make it usable on mobile devices, some small compromises have been made. One of these is that SafeBrowsing is currently not enabled in Mobile builds. We are now pretty close to fixing this, but there are some interesting issues remaining.

What is SafeBrowsing?

SafeBrowsing is a Google-designed protocol that allows Firefox and Chrome to alert the user whenever (s)he visits a webpage that is either phishing for private information or contains malware. It works by incrementally downloading a compressed and encrypted list of "bad" URLs and checking every URL the user wants to visit against the locally stored list. If we find a potential match, we query the SafeBrowsing server directly to verify if it's a true match, and to see if it is still valid. If it is, we pop up a warning message and block the page from loading.

Mozilla has added some additional features in Firefox to ensure our users' privacy when this feature is enabled. For example, when the Google server is queried to verify an URL, Firefox will send several random requests along with the true one, masking the exact website being visited.

Why is/was it not enabled in Firefox Mobile?

SafeBrowsing was initially prototyped in Google Toolbar for Firefox, and eventually contributed by Google engineers into Firefox itself just before version 3.0. The implementation in Firefox, while working, isn't terribly efficient. There are several reasons for that, including the deadline to get it shipped in Firefox 3.0 and the need to make protocol changes to be able to handle the entire Firefox user-base without putting undue stress on the Google-provided servers.

One of the shortcuts made was to use SQLite to store the URL database. While SQLite is a fine database, it's not the most terribly efficient way to store a highly compressed dataset. Furthermore, because it needs to be probed on each and every URL loaded, and the loads are essentially random, it requires caching to deliver good performance. When processing an update, the behavior can get really bad.

The net result is that the current SafeBrowsing implementation can use up to 50Mb of disk space, and in some circumstances, the same amount of memory. Gobbling up such an amount of resources would not be acceptable to people using a mobile device, so we had to disable it.

Because downloading and constructing the list happens in the background, the memory usage of Firefox can go up even when you are not visiting any sites. This causes many desktop users to mistake it for a memory leak, and it hence has been flagged as a priority in the MemShrink efforts, too.

How is this being solved?

The URL list is distributed as a set of SHA-256 hashes truncated to 32-bits (hereafter called prefixes), generated by passing the real URLs through a canonicalization algorithm and hashing the result. Each bunch of prefixes has an associated chunk-id, which is used during updates, and a list-id, which tells us whether it is malware or a phishing attempt. The real problem is how to store these 32-bit values efficiently and still be able to quickly search through them. The current SafeBrowsing database contains about 600K entries. Each entry contains both a 32-bit hash of the hostname and the prefix hash. We can drop the hostname hashes, as they are only a small performance optimization (*). The raw storage required, if we further assume the chunk-id and the list-id can be merged into a single 32-bit number, is about 5Mb. Simply sorting the list sorted by prefix allows O(log(n)) look-up and O(n) processing on updates, which is fast enough as long as we keep the list in memory.

While 5Mb is an order of magnitude improvement over the 50Mb we have now, it is interesting to try to improve it further. Mobile devices have a limited amount of storage, and particularly for devices that have no option to insert extra memory cards, space comes at a premium. Firefox consumes about 25Mb of flash storage (which already generates a lot of feedback that it is "way bloated"), and increasing that by 20% just for phishing protection seems excessive.

Current improvements

There are 2 interesting data-structures that allow further compression of this prefix set while still allowing efficient lookup: Bloom filters and Prefix Tries. Note that the prefixes themselves, being truncated SHA-256 hashes, are essentially random numbers and will hence resist traditional compression techniques.

Bloom filters work by passing each prefix through a series of hashes, and putting a mark in a bit array for each hash. Because the marks might collide, it is a probabilistic algorithm with potentially false positives. On the upside, you can trade the storage requirements of a Bloom filter against the probability of false positives. We considered using a Bloom filter for our SafeBrowsing database, but the mere possibility of false positives requires a secondary storage to correctly handle updates and to prevent leaking privacy information when it's not strictly necessary. Google found this an acceptable compromise for desktop usage and chose this solution in their re-implementation of SafeBrowsing in Chrome, backing up a 1.9Mb Bloom filter in memory with a larger raw store on disk.

Prefix Tries work by observing that, when sorted, many prefixes will have the leading part in common. By storing only the differences between those prefixes, good compression can be achieved. Given that the list of prefixes is sorted, and we occasionally code a full value to allow restarting the decompression, fast lookup is possible by doing a binary search in the uncompressed values and then decompressing the deltas. On the SafeBrowsing prefixes, about 50% compression is achievable while still maintaining fast lookup. Prefix Tries do not suffer from false positives, and do not need backup storage.

A preliminary implementation of Prefix Tries that allows us to reduce the memory requirements for desktop Firefox from a worst case of 50Mb to about 2.1Mb is now ready in bug 669410 and is expected to be in the Nightly builds soon. We will probably reduce this to 1.2Mb after verifying the performance impact of dropping the (*) optimization mentioned above. Work is ongoing to add (compressed) storage of chunk-id to this, removing SQLite entirely and reducing the disk space requirement to about 2.4Mb. This should be good enough to enable SafeBrowsing on mobile.

Future optimization

While this is a nice improvement, we do want "smaller! faster!" whenever possible, so it's interesting to study if we can go beyond this. Because the prefixes are random values, I personally don't believe much further gain can be had there (though I'd be thrilled to be proven wrong!). Perhaps if there were some way to obtain the original URLs, yes, but I understood from Google that they cannot do this as they license parts of the URL list from a third party.

The chunk information is an interesting candidate. The chunk numbers come from a more or less consecutive range, many chunk numbers reappear multiple times, and there are currently only about 16K distinct chunks in use. If we could sort the data by chunk-id and use delta compression on them, they could probably be stored in less than a bit per entry. Unfortunately, sorting the data would break the prefix trie compression and we end up gaining 2 bytes/entry and losing 2 bytes/entry at the same time. Sorting by chunk makes the lookups slower, so this isn't useful.

Now I have a crazy idea: what if we didn't store the chunk-id information at all? According to my reading of the specification, chunk-id information is only really needed in 2 specific cases:

When the client connects, he has to tell the server which chunk ranges he currently has. There is no need to actually store the chunk-id for every prefix to know this. We can simply track how much prefixes are in every chunk, and increment or decrement this number on every chunk add or delete.
The protocol allows the server to send an adddel or subdel command, which requires the client to expire an entire chunk. Note that normal add or sub commands don't require us to know the chunk-id, as they include the prefixes on which they act as well.

So what do we do if we get an adddel or subdel?

We drop the entire database and start downloading again.

I know you must be thinking I am crazy and wasteful at this point, but observing how Google's server actually behaves, it might very well work perfectly. The key to realize is that expiring entire chunks is not a normal occurrence. What is far more typical is that individual entries in a chunk are expired piece by piece in due time, until the chunk is empty. The only way I seem to be able to convince the server to expire entire chunks, is to purposely claim I have ancient chunks in my database. In normal usage, meaning regular connection to the internet, those expires don't seem to occur. And if we do have a database with ancient chunks, dropping and re-downloading it entirely isn't going to be that wasteful anyhow.

I believe this behavior would be perfectly fine to do and still comply with the current protocol, but obviously something like this shouldn't be implemented until the Google guys can confirm they don't think we're insane and that the server does indeed behave as described (or can be patched to do so).

One optimization that can be applied server-side is to estimate whether it would be cheaper to send an adddel/subdel, and hence drop the database, or to send the sub chunks for all entries it wants to remove. This ensures that the absolute worst-case behavior is in no case more than a doubling of bandwidth, with the typical case being no change in required bandwidth at all. Clients who need this optimization can be detected by the client-id and client-ver that the protocol already requires.

If we're willing to define a new protocol, the chunk-id numbers can go entirely, and be replaced by adding a sequential number to every update. In that case, the client only has to tell the server the first and last update it has ever seen, and server nor client need to transmit or store chunk-ids at all.