Bloom Filter

From DCppWiki

One of the larger bandwidth consumers of a hub is search propagation, due to its frequency and broadcast nature. Bloom filters provide one method of mitigating that issue.


Bloom Filters and DC

Bloom filters are a way of summarizing the content of a set by hashing its elements' values into a specified codomain with a range designed to allow for easy manipulation, and enabling a bit in an array at the resulting index. They provide approximate set membership information. By applying this to a user's share, a hub can refrain from sending a search to that user when he's guaranteed not to respond due to a lack of matching files; Bloom filters may exhibit false positives but not false negatives.

Substring considerations

The description of bloom filters above neglects that DC's searches are of a (multiple, but that's not a problem) substring nature. Accordingly, the standard bloom filter will not function. This may be solved by splitting search terms into n-grams, these being all of the substrings of length n in a search term (the word "palindromes"'s n-grams might be palin, alind, lindr, indro, ndrom, drome, and romes for n = 5, for example). When adding a term to the bloom filter table, add all of those n-grams. When checking for a term's presence, check for all of the n-grams as well.

Optimal n-gram length

Simulations of varying values of n have shown that for diverse workloads, n = 5 is optimal, and n = 4 is quite close.

Interaction with hashes

Hashes behave quite differently than other DC search terms, in that they're not substrings. Thus, one should only insert the hashes of entire hash strings.

File-hashes also use up much more space in a table than hashes from text-strings do: text-string n-grams will have a few, out of all possible n-grams, common occurences, while file-hashes are random, so storing file-hashes (atleast in the same table as for text) is not a good idea.

Hub-side treatment

A hub should store the Bloom filter tables for each connected client, and verify that that client might potentially respond to a search query before forwarding it a search. It should assume an initially completely-filled table, so that clients which don't send it a bloom filter still receive all searches.

Protocol Implementation

The client should specify in its $Supports command that it supports bloom filtering. A hub wishing to use that to its advantage should then send reply with its own Supports noting that it too agrees to use Bloom filters, and the size of the tables it's expecting to receive. From then on, any change in shares by the client should be signalled to the hub via xor-updates from the last one sent.