Friday, July 29, 2011

Bringing SafeBrowsing to mobile devices

Firefox Mobile is in most respects the exact same browser as you get with Firefox on the desktop. Nevertheless, to make it usable on mobile devices, some small compromises have been made. One of these is that SafeBrowsing is currently not enabled in Mobile builds. We are now pretty close to fixing this, but there are some interesting issues remaining.

What is SafeBrowsing?
    SafeBrowsing is a Google-designed protocol that allows Firefox and Chrome to alert the user whenever (s)he visits a webpage that is either phishing for private information or contains malware. It works by incrementally downloading a compressed and encrypted list of "bad" URLs and checking every URL the user wants to visit against the locally stored list. If we find a potential match, we query the SafeBrowsing server directly to verify if it's a true match, and to see if it is still valid. If it is, we pop up a warning message and block the page from loading.
    Mozilla has added some additional features in Firefox to ensure our users' privacy when this feature is enabled. For example, when the Google server is queried to verify an URL, Firefox will send several random requests along with the true one, masking the exact website being visited.

    Why is/was it not enabled in Firefox Mobile?

    SafeBrowsing was initially prototyped in Google Toolbar for Firefox, and eventually contributed by Google engineers into Firefox itself just before version 3.0. The implementation in Firefox, while working, isn't terribly efficient. There are several reasons for that, including the deadline to get it shipped in Firefox 3.0 and the need to make protocol changes to be able to handle the entire Firefox user-base without putting undue stress on the Google-provided servers.
    One of the shortcuts made was to use SQLite to store the URL database. While SQLite is a fine database, it's not the most terribly efficient way to store a highly compressed dataset. Furthermore, because it needs to be probed on each and every URL loaded, and the loads are essentially random, it requires caching to deliver good performance. When processing an update, the behavior can get really bad.
    The net result is that the current SafeBrowsing implementation can use up to 50Mb of disk space, and in some circumstances, the same amount of memory. Gobbling up such an amount of resources would not be acceptable to people using a mobile device, so we had to disable it. 
    Because downloading and constructing the list happens in the background, the memory usage of Firefox can go up even when you are not visiting any sites. This causes many desktop users to mistake it for a memory leak, and it hence has been flagged as a priority in the MemShrink efforts, too.

    How is this being solved?

    The URL list is distributed as a set of SHA-256 hashes truncated to 32-bits (hereafter called prefixes), generated by passing the real URLs through a canonicalization algorithm and hashing the result. Each bunch of prefixes has an associated chunk-id, which is used during updates, and a list-id, which tells us whether it is malware or a phishing attempt. The real problem is how to store these 32-bit values efficiently and still be able to quickly search through them. The current SafeBrowsing database contains about 600K entries. Each entry contains both a 32-bit hash of the hostname and the prefix hash. We can drop the hostname hashes, as they are only a small performance optimization (*). The raw storage required, if we further assume the chunk-id and the list-id can be merged into a single 32-bit number, is about 5Mb. Simply sorting the list sorted by prefix allows O(log(n)) look-up and O(n) processing on updates, which is fast enough as long as we keep the list in memory.

    While 5Mb is an order of magnitude improvement over the 50Mb we have now, it is interesting to try to improve it further. Mobile devices have a limited amount of storage, and particularly for devices that have no option to insert extra memory cards, space comes at a premium. Firefox consumes about 25Mb of flash storage (which already generates a lot of feedback that it is "way bloated"), and increasing that by 20% just for phishing protection seems excessive. 

    Current improvements

    There are 2 interesting data-structures that allow further compression of this prefix set while still allowing efficient lookup: Bloom filters and Prefix Tries. Note that the prefixes themselves, being truncated SHA-256 hashes, are essentially random numbers and will hence resist traditional compression techniques.

    Bloom filters work by passing each prefix through a series of hashes, and putting a mark in a bit array for each hash. Because the marks might collide, it is a probabilistic algorithm with potentially false positives. On the upside, you can trade the storage requirements of a Bloom filter against the probability of false positives. We considered using a Bloom filter for our SafeBrowsing database, but the mere possibility of false positives requires a secondary storage to correctly handle updates and to prevent leaking privacy information when it's not strictly necessary. Google found this an acceptable compromise for desktop usage and chose this solution in their re-implementation of SafeBrowsing in Chrome, backing up a 1.9Mb Bloom filter in memory with a larger raw store on disk.

    Prefix Tries work by observing that, when sorted, many prefixes will have the leading part in common. By storing only the differences between those prefixes, good compression can be achieved. Given that the list of prefixes is sorted, and we occasionally code a full value to allow restarting the decompression, fast lookup is possible by doing a binary search in the uncompressed values and then decompressing the deltas. On the SafeBrowsing prefixes, about 50% compression is achievable while still maintaining fast lookup. Prefix Tries do not suffer from false positives, and do not need  backup storage.

    A preliminary implementation of Prefix Tries that allows us to reduce the memory requirements for desktop Firefox from a worst case of 50Mb to about 2.1Mb is now ready in bug 669410 and is expected to be in the Nightly builds soon. We will probably reduce this to 1.2Mb after verifying the performance impact of dropping the (*) optimization mentioned above. Work is ongoing to add (compressed) storage of chunk-id to this, removing SQLite entirely and reducing the disk space requirement to about 2.4Mb. This should be good enough to enable SafeBrowsing on mobile.

    Future optimization

    While this is a nice improvement, we do want "smaller! faster!" whenever possible, so it's interesting to study if we can go beyond this. Because the prefixes are random values, I personally don't believe much further gain can be had there (though I'd be thrilled to be proven wrong!). Perhaps if there were some way to obtain the original URLs, yes, but I understood from Google that they cannot do this as they license parts of the URL list from a third party.

    The chunk information is an interesting candidate. The chunk numbers come from a more or less consecutive range, many chunk numbers reappear multiple times, and there are currently only about 16K distinct chunks in use. If we could sort the data by chunk-id and use delta compression on them, they could probably be stored in less than a bit per entry. Unfortunately, sorting the data would break the prefix trie compression and we end up gaining 2 bytes/entry and losing 2 bytes/entry at the same time. Sorting by chunk makes the lookups slower, so this isn't useful.

    Now I have a crazy idea: what if we didn't store the chunk-id information at all? According to my reading of the specification, chunk-id information is only really needed in 2 specific cases:
    1. When the client connects, he has to tell the server which chunk ranges he currently has. There is no need to actually store the chunk-id for every prefix to know this. We can simply track how much prefixes are in every chunk, and increment or decrement this number on every chunk add or delete.
    2. The protocol allows the server to send an adddel or subdel command, which requires the client to expire an entire chunk. Note that normal add or sub commands don't require us to know the chunk-id, as they include the prefixes on which they act as well.
    So what do we do if we get an adddel or subdel?

    We drop the entire database and start downloading again.

    I know you must be thinking I am crazy and wasteful at this point, but observing how Google's server actually behaves, it might very well work perfectly. The key to realize is that expiring entire chunks is not a normal occurrence. What is far more typical is that individual entries in a chunk are expired piece by piece in due time, until the chunk is empty. The only way I seem to be able to convince the server to expire entire chunks, is to purposely claim I have ancient chunks in my database. In normal usage, meaning regular connection to the internet, those expires don't seem to occur. And if we do have a database with ancient chunks, dropping and re-downloading it entirely isn't going to be that wasteful anyhow.

    I believe this behavior would be perfectly fine to do and still comply with the current protocol, but obviously something like this shouldn't be implemented until the Google guys can confirm they don't think we're insane and that the server does indeed behave as described (or can be patched to do so).

    One optimization that can be applied server-side is to estimate whether it would be cheaper to send an adddel/subdel, and hence drop the database, or to send the sub chunks for all entries it wants to remove. This ensures that the absolute worst-case behavior is in no case more than a doubling of bandwidth, with the typical case being no change in required bandwidth at all. Clients who need this optimization can be detected by the client-id and client-ver that the protocol already requires.

    If we're willing to define a new protocol, the chunk-id numbers can go entirely, and be replaced by adding a sequential number to every update. In that case, the client only has to tell the server the first and last update it has ever seen, and server nor client need to transmit or store chunk-ids at all.

    3 comments:

    1. If you drop the 32-bit hostname hash, then would you be matching only on the 32-bit prefixes? If so, I think you're setting yourself up for false-positives.

      Consider: With 600,000 hashes in the database, the chance of some other website colliding on a 32-bit hash is 600,000 * 2^(-32) = 0.00014. So you'd need to view .5 / 0.00014 = 3,600 pages to have a 50% chance of a false-positive. That's not a lot of pages.

      Using (truncated) hashes is itself probabilistic. So if you tuned the bloom filter so the chance of a false-positive is much lower than the chance of a hash collision, it might be sensible.

      ReplyDelete
    2. Actually, that calculation is wrong. It takes about 5000 webpages to see a collision with probability 0.5. The proof is left as an exercise. :)

      ReplyDelete
    3. Explanations are here:

      https://bugzilla.mozilla.org/show_bug.cgi?id=669410
      https://bugzilla.mozilla.org/show_bug.cgi?id=669407

      Summarizing: the backing store must support deletions, so false positives in it are a show-stopper (and a separate issue from collisions in the 32-bit hashes) for the permanent storage. A Bloom filter does not outperform a Prefix Trie in storage size / false positives for 600k 32-bit hash values, so it's not the best choice for in-memory storage, either.

      ReplyDelete