Aside from the performance gains, the reduced footprint is an essential step to enable us to extend our SafeBrowsing protection to Mobile devices, which is why we undertook this in the first place.
It was an interesting assignment, and being my first real project for Mozilla, a bit more involved than we thought at first. I blogged in July last year about our plans for this feature. Some of the optimizations we had in mind didn't work out, while others did end up being implemented.
Eliminating the host keys
One of the things touched upon in the previous blog post was that we used to store 2 32-bit hash prefixes for every blocked entry: one encoding only the host and the other encoding the full URL. Strictly speaking, we only need the full URL part. The old SQLite based code used the host key to SELECT and load all keys for a certain domain at once, but our new in-memory prefix-trie based approach has no such needs. However, as Justin Lebar alread touched upon in the previous blog, this does significantly increase the likelihood that we get a false positive. We now expect to have a false positive for every 7000 URLs visited. This will not cause us to block out any legitimate sites, as any positive hit in the local database is queried against the full, 256-bit hash at a remote server (hosted by Google, who provides the SafeBrowsing data).
This does mean we will increase the traffic to this remote server by a large factor. Scary as it may sound, some back-of-the-envelope estimates shows its not really that bad: say there are about 420M Firefox users, browsing for 8h/day. They load on average 1 URL per second. This means about 140M URL loads per second, causing about 20000 hash requests per second to Google. Google confirmed they can handle this with ease.
Now, there is still a problem when doing this: any collision will appear for all users on exactly the same URLs. This means that if you're unlucky enough to be a a site owner that has an URL that happens to collide, every visitor to your site will have a slightly slower browsing experience. Even worse, should you get linked in a popular forum, or be in the news, there will be a storm of false positives to the server all at once. We thought this to be problematic enough that we implemented a workaround: every user will generate a unique randomization key and re-randomize all his hashes with it. Collisions will happen on a different URL for every user, and consequently also be much better spread through time.
Eliminating chunk numbers
After some discussion, it turned out eliminating the chunk numbers isn't as easy as hoped. First of all, the observation in the previous blog posts that chunk expires only seem to happen when the chunks are in fact old, doesn't hold after observing the protocol for a longer time. It also happens very regularly that a chunk is added, deleted, and added back again, particularly in the malware list. In those cases, it is important to know which add chunk a delete is referring to, so it won't delete the later add. It would still be possible to deal with that if the server recalculates the prefixes to send for every client, but this is less efficient on the server side compared to the bigger, more static downloads that the server can point to now, and which are easily mirrored on Google's network.
Sub prefixes compress harder
In line with the previous paragraph, it happens that we receive sub prefixes for add prefixes we never received. These must be kept around until we receive an expire for them, as we can't know if the add they belong to is outdated or just not downloaded yet. Note also that we usually receive updates backwards, i.e. the most recent data will be sent to the clients first, as it's the one believed to be most relevant. Because sub prefixes contain both an add and a sub chunk, they are also bigger than add prefixes. This causes the eventual size of the database to be a bit more than the minimum guessed in the previous blog post, which more or less ignored sub prefixes entirely. If you peek in your profile, you can see that the goog-malware-shavar.sbstore will tend to be the biggest file: this is exactly due the many sub prefixes in the malware list.
It should be noted that these improvements are purely focused on the footprint of the feature. They will improve the resource usage of the browser, but they do not change the detection performance in any way.
NSS Labs Report
In what is somewhat of a funny coincidence, the same day I am writing this blog NSS Labs published a report "Did Google pull a fast one on Firefox and Safari users?". The main points of this report shouldn't be much news as I pointed out over half a year ago the discrepancy between Chrome, and Firefox and Safari in my previous blog post, as well the reason ("Performance differences between Firefox, Chrome and Safari").
I have two remarks to the report: one, as I've already pointed out in the past, false positive control is an important part of effective malware detection. Internet Explorer flags many malware sites, but it also flags legitimate sites, undermining the true effectiveness.
Secondly, the problem isn't so much that the "new" SafeBrowsing protocol is proprietary or non-documented; it's implemented in Chrome and Chromium is open source, so at the very worst we can go study that code to see how to implement it. The problem is that permission is required to use Google's SafeBrowsing servers, and