Thursday, May 24, 2012

History and bookmarks in Firefox for Android

The feature I have been working on now for quite some time is one I'd jokingly refer to as "a feature I hope most users will never need". It's the system that translates your browser history from XUL Fennec to Native Fennec. New users obviously don't need this, and I hope we get a lot of new users with the switch to the native UI, so there.

More seriously, for something that I initially described in a meeting as "doing a query in one DB and some inserts in another, how hard can it be?", this turned out to be a project that ended up sinking easily over a man-month of engineering time. Of course putting it like that in the meeting was just asking for problems, and I'm going to use this post to give some background into what happened, and explain some more things about our current History/bookmarks architecture and how databases work on Android.

The need for a new History & bookmarks database

First of all, you can ask yourself why we needed to change the format of browser history in the first place. In XUL Fennec, the history and bookmarks are managed by the Places system, the exact same as is powering desktop Firefox. It's a combination of SQLite with a C++ and JavaScript control layer.

The reason for rewriting the UI parts of Fennec in Java was to make it start up faster on Android, so that the user has something he can interact with instantly. Having bookmarks and history available early is quite important as that's what the user will most likely be interacting with, either through the AwesomeBar or the home screen that Native Fennec has. So we cannot wait for Gecko to load to access its Places system - history and bookmarks must be available immediately.

Secondly, Native Fennec now also includes Sync as a true Android service that works with the Android Accounts panel, background synchronization, etc. Having the history and bookmarks available natively in Java removes the need for Sync to load Gecko, which isn't something that you'd want a background service to do.

Lastly, storage on Android devices is often fairly limited. Places databases can get quite large for large if you've been using Firefox a long time and browse a lot, for example on my current desktop system the places.sqlite file is over 70Mb. This isn't desirable on Android, so we probably want to tune history expiration a bit differently.

Evolution of the new database

Our initial implementation of an Android history & bookmarks system was to simply use the built-in default database. This has some nice advantages such as automatic integration with the default Browser. If the user is transitioning from that to Firefox he will immediately have his history & bookmarks available to him. Feedback from the community came in that they weren't entirely comfortable with this. There was a feeling that although users trust Firefox with their private data, they don't necessarily trust Android or other applications that may access this data in the same way.

We started adding our own implementation, a local database (LocalDB) using Android's SQLite library, our own ContentProvider, and a wrapper layer (BrowserDB) that allowed most of the code to seamlessly switch between using the local database and Android's database.

As we started adding more features to our bookmarks system, and Sync implementation was underway, it became clear at some point that we couldn't reasonably support them well with the limited bookmark support of Android itself. For example, by default the Android implementation can only remember the last 250 URL's you visited, which makes the AwesomeBar's search sort-of useless. So eventually, support for Android's history & bookmarks database was phased out.

Remapping Places & Database constraints

Importing the Places database into our LocalDB format is mostly a straightforward remapping, although there are some minor differences. The Places database is very normalized, and for example each bookmarks and history entry refers to an entry in another table which contains the actual URL. LocalDB is a bit simpler in this regard as the URLs are contained within the entries themselves, saving some database work - which can help on slow devices. We made a small design mistake here in the Favicons table. They're big enough that we do not want them duplicated in the database, but Favicons can be the same for multiple URLs. We caught this issue a bit too late and as a result it will not be fixed in the first release - see bug 717428. You can occasionally see constraint violation warnings in the Android logs if this problem occurs.

Another small complication in transferring bookmarks is that they form a folder (tree) structure. We added some (SQL key) constraints in the new LocalDB to ensure no folders get orphaned even if we have bugs in our first versions, but this means that we must reconstruct the tree structure in the correct order when remapping the bookmarks data. A simple breadth-first algorithm makes multiple passes over the bookmarks data until we reach the leaves. Any bookmarks left over were orphaned and are not imported.

Places "tags" are stored in a fairly complicated manner in Places (using a multitude of bookmark folders) and due to this they are not supported in Native Fennec, and likely won't be for a while, so there is no need to migrate them.

Some fragmentation pains

During the development, at some point bug reports started coming in from some users that were migrating to Native Fennec, didn't get any bookmarks & history imported, and got an error "database is corrupted". After investigation, it turns out this was problem of combined SQLite and Android fragmentation. Firefox XUL (and Desktop) ships with the most recent SQLite revision, which means that it is currently using Write-Ahead-Logging for database operations.

Various Android devices come with various versions of SQLite, and only since (approximately) Honeycomb are the SQLite's new enough to understand the new database format that WAL requires. So older Android devices couldn't correctly read Firefox's SQLite databases with their built-in SQLite support.

We ended up writing a bridge driver that uses JNI to talk to Firefox's SQLite library, and provides an interface compatible with the Android SQLite one. This turned out to be useful for a number of other things too, such as accessing the original Passwords and Form History database from Gecko. In general, this kind of experiences made the many work spent on this feature worthwhile - quite a bit of the work was usable in other parts of the project as well.

Another small fragmentation issue we ran into is that foreign key constraints are not supported on some older Android devices. Because we basically only use these for error checking, it was not a large problem.

Performance aspects

The migration initially ran in the background on the first run, but it turns out this leads to horrible performance as the user tries to interact for the first time while there is heavy disk I/O and processing happening at the same time. Because that kind of nullifies the message we were trying to send that Native Fennec is a lot faster, we opted to show a modal screen instead that informs the user of what is happening. Of  course, popping up a loading screen doesn't exactly convey a message of speediness either, but we should be fine as long as the user only sees it once and it doesn't take too long.

The latter was a bit of a problem though. Initial implementations ran at a speed of a few bookmarks or history entries per second. It's not unusual for Firefox users to have hundreds of bookmarks and hundreds of thousands of history entries. Android provides an API to allow ContentProviders to support database transactions, and implementing support for those on both sides sped up our migration speed to about 50-100 records per second. This was another example of the work on this little-seen feature providing benefits elsewhere, as Sync is now also using the faster API.

There are some interesting interactions with the batch API in Android if part of the transaction fails. Basically, it is implementation-defined what to do if an operation fails for some operations (applyBatch), and the API seems to be designed to really allow both options (continue or abort) to work. For our purposes, doing as much as possible was the better option, but this turns out to have some problems because the API to tell SQLite what to do in such a case isn't available in Android <= 2.2, which we wanted to support. This lead to some "interesting" code that may make your eyes bleed.

Sync integration 

Because that's still not fast enough to migrate a users entire history on the first run, we also made an API that allows Sync to interact with the migration feature. The first run will migrate about 2000 entries and the entire bookmarks table. For users without Sync, this is all that they'll get, but it is enough to make the AwesomeBar quite usable and covers several weeks of browsing even for moderately heavy users.

If the users have Sync set up, the idea was that Sync will continue the migration from the database before starting to use the network. Feedback from the beta made us reconsider the previous approach: the problem of delaying Sync until the Places database has been entirely migrated is that the user will not get his more recent History updates until his updates from potentially weeks or months ago have been migrated. This leads to a suboptimal experience, and it's something we're going to try to address for the release.

The XUL to Native migration includes a feature that tries to migrate your Sync credentials automatically if you have them set up in Native Fennec. Unfortunately this had a bug in the very first beta release, so if you were one of the earliest adopters this will not have worked for you. The issue was fixed in the second beta release.

Future work 

Despite the first beta being out, work on this crucial feature is far from finished. Aside from the points already mentioned above, there are a few more notable areas that we plan to get addressed before the release ships:

  • After support for the Android bookmarks & history database was dropped, users who were previously using the built-in browser have no way to import their bookmarks and history into Fennec right now. We're currently working on adding this feature as a first priority. Eventually we will add export functionality as well (though we'd rather make Fennec so awesome that nobody ever wants to use the built-in browser ever again!).
  • There is currently no expiration in the history database. So although the format is a bit more compact than Places, it could theoretically grow unbounded if the user visits a lot of sites on his phone and keeps on using Firefox. The algorithms for this that Places uses are fairly complicated, so we need to study which parts are meaningful for a mobile device, and if we want to tune it differently for typical phone or tablet use scenarios.

Saturday, April 14, 2012

Java wat

I'm sure quite a few you have seen the "wat" talk exposing some funny behaviors of Ruby and JavaScript. Working in the industrial/enterprise/boring Java language most of the time for Native Fennec's development, we've been missing out on such interesting moments. Nevertheless, I present you this gem, from bug 745384:
  • (null instanceof <anything>)  = false
    This is guaranteed by the language spec:
     "At run time, the result of the instanceof operator is true if the value of the RelationalExpression is not null and the reference could be cast (§15.16) to the ReferenceType without raising a ClassCastException. Otherwise the result is false."
  • IsInstaceOf(NULL, <anything>) = JNI_TRUE
    This too is guaranteed by the language spec:
     "Returns JNI_TRUE if obj can be cast to clazz; otherwise, returns JNI_FALSE. A NULL object can be cast to any class."
I don't know what the people at Sun were thinking when they designed it this way, but right now I'm all:


For those wondering, these are the relevant Java bugs:
Note the "Closed, Not a Defect, Fixed in spec." It's not a bug, it's a feature, :-)

Monday, February 6, 2012

New SafeBrowsing backend

Today, the SafeBrowsing rewrite me and Dave Camp have been working on for several months finally landed in the Mozilla Nightlies, and it should be part of the Firefox 13 release, narrowly having missed Firefox 12. It reduces the disk footprint of our anti-phishing and malware protection from about 40-50Mb to 5-6Mb, changes all related I/O from pure random access to a single serial read, and refactors a single 4000+ line C++ file into a bunch of modules. An earlier part of this work landed in Firefox 9 and reduced the memory footprint from potentially up to 40-100M to 1.2M, as well as removing the need to do some random I/O on every page load.
Aside from the performance gains, the reduced footprint is an essential step to enable us to extend our SafeBrowsing protection to Mobile devices, which is why we undertook this in the first place.

It was an interesting assignment, and being my first real project for Mozilla, a bit more involved than we thought at first. I blogged in July last year about our plans for this feature. Some of the optimizations we had in mind didn't work out, while others did end up being implemented.

Eliminating the host keys

One of the things touched upon in the previous blog post was that we used to store 2 32-bit hash prefixes for every blocked entry: one encoding only the host and the other encoding the full URL. Strictly speaking, we only need the full URL part. The old SQLite based code used the host key to SELECT and load all keys for a certain domain at once, but our new in-memory prefix-trie based approach has no such needs. However, as Justin Lebar alread touched upon in the previous blog, this does significantly increase the likelihood that we get a false positive. We now expect to have a false positive for every 7000 URLs visited. This will not cause us to block out any legitimate sites, as any positive hit in the local database is queried against the full, 256-bit hash at a remote server (hosted by Google, who provides the SafeBrowsing data).

This does mean we will increase the traffic to this remote server by a large factor. Scary as it may sound, some back-of-the-envelope estimates shows its not really that bad: say there are about 420M Firefox users, browsing for 8h/day. They load on average 1 URL per second. This means about 140M URL loads per second, causing about 20000 hash requests per second to Google. Google confirmed they can handle this with ease.

Now, there is still a problem when doing this: any collision will appear for all users on exactly the same URLs. This means that if you're unlucky enough to be a a site owner that has an URL that happens to collide, every visitor to your site will have a slightly slower browsing experience. Even worse, should you get linked in a popular forum, or be in the news, there will be a storm of false positives to the server all at once. We thought this to be problematic enough that we implemented a workaround: every user will generate a unique randomization key and re-randomize all his hashes with it. Collisions will happen on a different URL for every user, and consequently also be much better spread through time.

Eliminating chunk numbers

After some discussion, it turned out eliminating the chunk numbers isn't as easy as hoped. First of all, the observation in the previous blog posts that chunk expires only seem to happen when the chunks are in fact old, doesn't hold after observing the protocol for a longer time. It also happens very regularly that a chunk is added, deleted, and added back again, particularly in the malware list. In those cases, it is important to know which add chunk a delete is referring to, so it won't delete the later add. It would still be possible to deal with that if the server recalculates the prefixes to send for every client, but this is less efficient on the server side compared to the bigger, more static downloads that the server can point to now, and which are easily mirrored on Google's network.

Sub prefixes compress harder

In line with the previous paragraph, it happens that we receive sub prefixes for add prefixes we never received. These must be kept around until we receive an expire for them, as we can't know if the add they belong to is outdated or just not downloaded yet. Note also that we usually receive updates backwards, i.e. the most recent data will be sent to the clients first, as it's the one believed to be most relevant. Because sub prefixes contain both an add and a sub chunk, they are also bigger than add prefixes. This causes the eventual size of the database to be a bit more than the minimum guessed in the previous blog post, which more or less ignored sub prefixes entirely. If you peek in your profile, you can see that the goog-malware-shavar.sbstore will tend to be the biggest file: this is exactly due the many sub prefixes in the malware list.

Detection performance

It should be noted that these improvements are purely focused on the footprint of the feature. They will improve the resource usage of the browser, but they do not change the detection performance in any way.

NSS Labs Report

In what is somewhat of a funny coincidence, the same day I am writing this blog NSS Labs published a report "Did Google pull a fast one on Firefox and Safari users?". The main points of this report shouldn't be much news as I pointed out over half a year ago the discrepancy between Chrome, and Firefox and Safari in my previous blog post, as well the reason ("Performance differences between Firefox, Chrome and Safari").

I have two remarks to the report: one, as I've already pointed out in the past, false positive control is an important part of effective malware detection. Internet Explorer flags many malware sites, but it also flags legitimate sites, undermining the true effectiveness.

Secondly, the problem isn't so much that the "new" SafeBrowsing protocol is proprietary or non-documented; it's implemented in Chrome and Chromium is open source, so at the very worst we can go study that code to see how to implement it. The problem is that permission is required to use Google's SafeBrowsing servers, and Firefox does NOT have permission to use the download protection list. Edit: Please see the statement from Ian Fette below.

Wednesday, August 17, 2011

A note on malware detection performance

Yesterday, most news outlets published stories about a study done by NSS Labs showing that Internet Explorer 9 provides overwhelmingly better protection against phishing and malware sites than competing browsers, including Firefox. Specifically, the claim is that Internet Explorer detects 99.2% of all malicious internet content, compared to a measly 7.6% for Firefox.

This isn't the first such study conducted. Over 2 years ago, at the release of Internet Explorer 8, a similar study was done with similar but less extreme results. That study was harshly criticized for having a number of flaws. There is no indication that any of criticism was addressed in the new study.

Although many in the community have yet again pointed out the problematic aspects of the study, and hence properly put the results under scrunty, I believe some issues are worthy of a more detailed explanation.

Accuracy of the reported results

The main issue with the study is that it does not consider the tradeoff between false positives, namely, reporting malware when a download is in fact safe, and false negatives, namely, allowing the user to visit a dangerous site or allowing him or her to download malware. Due to the way the test was conducted, false positives were mostly neglected. In other words, the browser that simply gave most warnings, regardless of accuracy, had a large advantage compared to its competitors.

The SmartScreen service in Internet Explorer will give the user a warning as soon as he or she tries to download any uncommonly downloaded piece of software. This is done regardless of whether the download is malware or not. It should be obvious that by crying foul in every uncertain situation, you will cry foul in a dangerous situation, too, leading to a very high score in this test.

Unfortunately, giving a large number of false positives severely diminishes the value of the warnings. If you claim "the sky is falling" at every opportunity, users will start to ignore the warnings, and the effectiveness of the protection drops from 99.2% to 0%, which is infinitely worse than what other browsers offer. The situation should be familiar to Microsoft, who faced similar problems with the initial versions of their User Account Control feature in Windows Vista. It wasn't until the number of false positives dropped that the protection really became effective.

The report's tackling of false positives is limited to the following paragraph.
In addition, NSS maintains a collection of ‘clean URLs’ which includes such sites as yahoo, Amazon, Microsoft, Google, NSS Labs, major banks, etc. Periodically clean URLs were run through the system to verify browsers were not over-blocking.
Aside from the lack of real data, it's astonishing how biased this sample is. As if any major browser vendor, or provider of malware protection, would put out an update and fail to notice that he is blocking his users from visiting a major website! Of course, extending the sample even a little bit to, for example, homepages of random people, would have immediately shown Internet Explorer to produce false positives.

The problems of trading false positives for false negatives and evaluating differing tradeoffs are not new - they're a known topic for anti-spam filters, which must deal with similar tradeoffs when being compared against each other. One representation that is often used in anti-spam research papers is to present the results in the form of ROC curves. That certainly would have been more informative than a single percentage.

It's quite possible that Internet Explorer does in fact offer the best malware and phishing protection of the popular browsers. Microsoft has worked hard to improve in their security reputation, clearly focused on this area, and I'm sure they have smart engineers capable of writing good software. However, studies like these do their work no justice, don't help users make informed choices, and don't help us identify the real risks users are exposed to. This is a missed opportunity.

User privacy in collaborative filtering

The SmartScreen Filter that Internet Explorer uses is a collaborative filter. By it's nature, such a filter is dependent on gathering information from its users. The exact information gathered is described in their privacy policy. I've reproduced some relevant paragraphs below.
If you opt in to SmartScreen Filter, it first checks the address of the webpage you are visiting against a list of high-traffic webpage addresses stored on your computer that are believed by Microsoft to be legitimate. Addresses that are not on the local list and the addresses of files you are downloading will be sent to Microsoft and checked against a frequently updated list of webpages and downloads that have been reported to Microsoft as unsafe or suspicious.

When you use SmartScreen Filter to check websites automatically or manually, the address of the website you are visiting will be sent to Microsoft, together with standard computer information and the SmartScreen Filter version number. To help protect your privacy, the information sent to Microsoft is encrypted. Information that may be associated with the address, such as search terms or data you entered in forms might be included.

Some information about files that you download from the web, such as name and file path, may also be sent to Microsoft. Some website addresses that are sent to Microsoft may be stored along with additional information, including web browser version, operating system version, SmartScreen Filter version, the browser language, the referring webpage, and information about whether Compatibility View was enabled for the website.
While the amount and nature of the information sent to Microsoft may indeed be necessary to achieve the level of protection SmartScreen is claimed to give, it obviously comes at a severe cost of user privacy. 

I don't believe most people at Mozilla think it's reasonable to collect the information above, nor would I expect most of our user-base to feel comfortable with it. For this reason, its unlikely we would put up an identical service, let alone enable it by default.
The malware protection included in Firefox uses a cookie to provide Quality-Of-Service information wrt. updates to the providers of malware lists (currently Google). Although we did feel that that was a reasonable tradeoff, some users nevertheless object to it, and some effort is underway to remove even this.

That being said, our browser is extensible and has a wide variety of third-party add-ons, including some that extend it to give similar functionality, if the user feels the privacy versus security tradeoff is acceptable. A popular one seems to be Web-Of-Trust. Feel free to check out our security & privacy add-ons site to see additional protection available for Firefox.

Performance differences between Firefox, Chrome and Safari

One interesting thing that did come out of this study is that Chrome offers slightly better protection than Firefox or Safari. This is interesting, because all three browsers use exactly the same malware and phishing protection: the Google SafeBrowsing API. Firefox has some minor tweaks compared to Chrome to improve user privacy, but those should not have worsened the results, so it was expected that they would score similarly.

I spent some time looking at this, and what happened is that Google has been enhancing their SafeBrowsing system to detect malware downloads, tested this protection in Chrome and recently included it in the stable releases. Because the public documentation for the API hasn't been updated, Firefox and Safari have so far have not implemented this extension. 

We are currently discussing with Google on rectifying this, so Firefox will probably soon include the improved protection as well.

Concluding remarks

Being exposed to malicious content doesn't mean being infected by it. We do what we can to keep the browser secure and updated, and to inform the user when his plugins are out of date, and potentially exposed. We have made improvements that allow a user to more easily spot when he or she is being phished. Detecting malware is just one protection. Making it ineffective is another.

Regardless of what we think of the Internet Explorer results, we're still left with a claimed 7.6% detection rate for Firefox out of the box. This means that our current default detection is largely ineffective, and users have much better odds to be exposed to malicious content than that they have of being blocked by us. Even if the study numbers are inaccurate, this order-of-magnitude result probably does hold. I doubt we currently detect and block more than 50% of the malware out there.

That is not a result to be proud of, and we should improve it if possible.

Friday, July 29, 2011

Bringing SafeBrowsing to mobile devices

Firefox Mobile is in most respects the exact same browser as you get with Firefox on the desktop. Nevertheless, to make it usable on mobile devices, some small compromises have been made. One of these is that SafeBrowsing is currently not enabled in Mobile builds. We are now pretty close to fixing this, but there are some interesting issues remaining.

What is SafeBrowsing?
    SafeBrowsing is a Google-designed protocol that allows Firefox and Chrome to alert the user whenever (s)he visits a webpage that is either phishing for private information or contains malware. It works by incrementally downloading a compressed and encrypted list of "bad" URLs and checking every URL the user wants to visit against the locally stored list. If we find a potential match, we query the SafeBrowsing server directly to verify if it's a true match, and to see if it is still valid. If it is, we pop up a warning message and block the page from loading.
    Mozilla has added some additional features in Firefox to ensure our users' privacy when this feature is enabled. For example, when the Google server is queried to verify an URL, Firefox will send several random requests along with the true one, masking the exact website being visited.

    Why is/was it not enabled in Firefox Mobile?

    SafeBrowsing was initially prototyped in Google Toolbar for Firefox, and eventually contributed by Google engineers into Firefox itself just before version 3.0. The implementation in Firefox, while working, isn't terribly efficient. There are several reasons for that, including the deadline to get it shipped in Firefox 3.0 and the need to make protocol changes to be able to handle the entire Firefox user-base without putting undue stress on the Google-provided servers.
    One of the shortcuts made was to use SQLite to store the URL database. While SQLite is a fine database, it's not the most terribly efficient way to store a highly compressed dataset. Furthermore, because it needs to be probed on each and every URL loaded, and the loads are essentially random, it requires caching to deliver good performance. When processing an update, the behavior can get really bad.
    The net result is that the current SafeBrowsing implementation can use up to 50Mb of disk space, and in some circumstances, the same amount of memory. Gobbling up such an amount of resources would not be acceptable to people using a mobile device, so we had to disable it. 
    Because downloading and constructing the list happens in the background, the memory usage of Firefox can go up even when you are not visiting any sites. This causes many desktop users to mistake it for a memory leak, and it hence has been flagged as a priority in the MemShrink efforts, too.

    How is this being solved?

    The URL list is distributed as a set of SHA-256 hashes truncated to 32-bits (hereafter called prefixes), generated by passing the real URLs through a canonicalization algorithm and hashing the result. Each bunch of prefixes has an associated chunk-id, which is used during updates, and a list-id, which tells us whether it is malware or a phishing attempt. The real problem is how to store these 32-bit values efficiently and still be able to quickly search through them. The current SafeBrowsing database contains about 600K entries. Each entry contains both a 32-bit hash of the hostname and the prefix hash. We can drop the hostname hashes, as they are only a small performance optimization (*). The raw storage required, if we further assume the chunk-id and the list-id can be merged into a single 32-bit number, is about 5Mb. Simply sorting the list sorted by prefix allows O(log(n)) look-up and O(n) processing on updates, which is fast enough as long as we keep the list in memory.

    While 5Mb is an order of magnitude improvement over the 50Mb we have now, it is interesting to try to improve it further. Mobile devices have a limited amount of storage, and particularly for devices that have no option to insert extra memory cards, space comes at a premium. Firefox consumes about 25Mb of flash storage (which already generates a lot of feedback that it is "way bloated"), and increasing that by 20% just for phishing protection seems excessive. 

    Current improvements

    There are 2 interesting data-structures that allow further compression of this prefix set while still allowing efficient lookup: Bloom filters and Prefix Tries. Note that the prefixes themselves, being truncated SHA-256 hashes, are essentially random numbers and will hence resist traditional compression techniques.

    Bloom filters work by passing each prefix through a series of hashes, and putting a mark in a bit array for each hash. Because the marks might collide, it is a probabilistic algorithm with potentially false positives. On the upside, you can trade the storage requirements of a Bloom filter against the probability of false positives. We considered using a Bloom filter for our SafeBrowsing database, but the mere possibility of false positives requires a secondary storage to correctly handle updates and to prevent leaking privacy information when it's not strictly necessary. Google found this an acceptable compromise for desktop usage and chose this solution in their re-implementation of SafeBrowsing in Chrome, backing up a 1.9Mb Bloom filter in memory with a larger raw store on disk.

    Prefix Tries work by observing that, when sorted, many prefixes will have the leading part in common. By storing only the differences between those prefixes, good compression can be achieved. Given that the list of prefixes is sorted, and we occasionally code a full value to allow restarting the decompression, fast lookup is possible by doing a binary search in the uncompressed values and then decompressing the deltas. On the SafeBrowsing prefixes, about 50% compression is achievable while still maintaining fast lookup. Prefix Tries do not suffer from false positives, and do not need  backup storage.

    A preliminary implementation of Prefix Tries that allows us to reduce the memory requirements for desktop Firefox from a worst case of 50Mb to about 2.1Mb is now ready in bug 669410 and is expected to be in the Nightly builds soon. We will probably reduce this to 1.2Mb after verifying the performance impact of dropping the (*) optimization mentioned above. Work is ongoing to add (compressed) storage of chunk-id to this, removing SQLite entirely and reducing the disk space requirement to about 2.4Mb. This should be good enough to enable SafeBrowsing on mobile.

    Future optimization

    While this is a nice improvement, we do want "smaller! faster!" whenever possible, so it's interesting to study if we can go beyond this. Because the prefixes are random values, I personally don't believe much further gain can be had there (though I'd be thrilled to be proven wrong!). Perhaps if there were some way to obtain the original URLs, yes, but I understood from Google that they cannot do this as they license parts of the URL list from a third party.

    The chunk information is an interesting candidate. The chunk numbers come from a more or less consecutive range, many chunk numbers reappear multiple times, and there are currently only about 16K distinct chunks in use. If we could sort the data by chunk-id and use delta compression on them, they could probably be stored in less than a bit per entry. Unfortunately, sorting the data would break the prefix trie compression and we end up gaining 2 bytes/entry and losing 2 bytes/entry at the same time. Sorting by chunk makes the lookups slower, so this isn't useful.

    Now I have a crazy idea: what if we didn't store the chunk-id information at all? According to my reading of the specification, chunk-id information is only really needed in 2 specific cases:
    1. When the client connects, he has to tell the server which chunk ranges he currently has. There is no need to actually store the chunk-id for every prefix to know this. We can simply track how much prefixes are in every chunk, and increment or decrement this number on every chunk add or delete.
    2. The protocol allows the server to send an adddel or subdel command, which requires the client to expire an entire chunk. Note that normal add or sub commands don't require us to know the chunk-id, as they include the prefixes on which they act as well.
    So what do we do if we get an adddel or subdel?

    We drop the entire database and start downloading again.

    I know you must be thinking I am crazy and wasteful at this point, but observing how Google's server actually behaves, it might very well work perfectly. The key to realize is that expiring entire chunks is not a normal occurrence. What is far more typical is that individual entries in a chunk are expired piece by piece in due time, until the chunk is empty. The only way I seem to be able to convince the server to expire entire chunks, is to purposely claim I have ancient chunks in my database. In normal usage, meaning regular connection to the internet, those expires don't seem to occur. And if we do have a database with ancient chunks, dropping and re-downloading it entirely isn't going to be that wasteful anyhow.

    I believe this behavior would be perfectly fine to do and still comply with the current protocol, but obviously something like this shouldn't be implemented until the Google guys can confirm they don't think we're insane and that the server does indeed behave as described (or can be patched to do so).

    One optimization that can be applied server-side is to estimate whether it would be cheaper to send an adddel/subdel, and hence drop the database, or to send the sub chunks for all entries it wants to remove. This ensures that the absolute worst-case behavior is in no case more than a doubling of bandwidth, with the typical case being no change in required bandwidth at all. Clients who need this optimization can be detected by the client-id and client-ver that the protocol already requires.

    If we're willing to define a new protocol, the chunk-id numbers can go entirely, and be replaced by adding a sequential number to every update. In that case, the client only has to tell the server the first and last update it has ever seen, and server nor client need to transmit or store chunk-ids at all.