A set data structure uses a hashing function to store values and to verify if a value exists.
Bloom filters are similar in that it uses multiple hashing functions to store values and to verify if a value exists. Conceptually, they use an internal array to store values into indices and to check indices against.
As a whole, they are useful for proving two things:
- It checks if something is not there, guaranteed.
- It checks if something might be there.
Storing data
Let's say that a user visits article \(D\) on a blog webpage. We'll have a bloom filter that checks if a page has never been visited, or it might have been visited. If we store article \(D\) into our bloom filter, our filter will hash it with three different functions, and output the index number of the array to hash it to.
hash1(D) => 4
hash2(D) => 0
hash3(D) => 6
These hash functions are chosen in a way such that they try to give a unique answer as much as possible. In other words, hash1()
shouldn't generate the same index as hash2()
or hash3()
, ideally.
Now, a user visits article \(E\), which produces the following indices after hashing:
hash1(E) => 1
hash2(E) => 3
hash3(E) => 5
Which populates our bloom filter as follows:
Is it in the bloom filter?
Up until now, we've had users visit specific articles on our website. We will stop storing visited articles in our bloom filter - from this point on, we will only query our bloom filter.
Let's say that we now have some sort of recommendation system that generates article recommendations to users. We generate one recommendation article \(G\).
We put \(G\) into the same hashing functions and we get the following indices:
hash1(G) => 8
hash2(G) => 7
hash3(G) => 2
Since all of the indices do not exist in our array, we know that \(G\) has not been visited according to our bloom filter.
Therefore, we can recommend article \(G\) to the user!
Next up, our recommendation system pumps out \(H\), which gives us the following indices:
hash1(H) => 1
hash2(H) => 10
hash3(H) => 9
Here, we see that 10
and 9
are not in our array, but 1
is in the array. Since some of these indices are unmarked, we can still say without a doubt that \(H\) does NOT exist in the bloom filter. That means we can recommend article \(H\) to the user.
Now our recommendation system pumps out \(X\), which gives us the following indices:
hash1(X) => 5
hash2(X) => 3
hash3(X) => 1
Unfortunately, these indices are all marked in our array due to unlucky hash outputs. Therefore, \(X\) might be in our bloom filter, and we cannot say with 100% certainty that the user has not visited article \(X\).
In actuality, we know that \(X\) was never visited by the user since the user has only visited articles \(D, E, G, H\). However, our hashing output of \(X\) resulted in indices that were already marked. As a result, the bloom filter tells us that article \(X\) may have been visited. This is considered a false-positive value, as the reality differs from the bloom filter prediction. In this situation, based on the guidance of our bloom filter, it doesn't make sense to recommend article \(X\) to the user due to the possibility of article \(X\) already being visited.
Conclusion
With the examples seen above, we explored that the bloom filter is effective at telling you if an article has been visited or not. We have also encountered a situation where a false-positive could occur, where the bloom filter has a hard time being 100% certain if an article has been visited or not.
In general, the bloom filter is an attractive option for simplistic existence checks due to very fast lookup speeds. It can quickly tell you that something is not in it, which makes it a great choice for simple queries such as "Does X exist?".
Too many false positives are not helpful, for which there are multiple ways to go about it. One such way is to tailor the existence check to your application's benefit. For example, if article \(X\) was never seen by the user, and our bloom filter says it might have been visited, then simply just skip article \(X\), and try a new article recommendation. Another way to go about it is to increase the number of hashing functions, or changing the hashing algorithms to reduce false-positives, at the expense of performance.