Tuesday, September 01, 2015

A case for Bloom Filter

We provide cache as service to our users. Users can put elements into cache with a TTL. We needed to determine the performance of the cache in terms of the hit rate.

The hit rate can be low because of two factors - LRU in the cache and TTL of the element put by user. We needed to ensure that the hit rate is not low because of LRU - in that case we were ready to allocate more memory for the cache.

One of the ways to determine if the TTL caused removal of the element is to track all the keys in the the cache even if the elements were removed. 

Bloom filter is a good data structure to track the keys. We keep adding keys to the bloom filter. When a cache miss occurs, we check the bloom filter if the key was ever present. If it was, then the cache miss was caused by TTL. If it was not present then it was a real cache miss. 

Bloom filter will give an affirmative answer if an element was not present. Its probabilistic answer if the element was present - in our case this is acceptable since we will count additional cache miss due to TTL. It is acceptable because it is easy to tune TTL than cache.

Also we are adding elements for tracking and no removals - which also matches with the requirement of the bloom filter.

In summary, Bloom filter worked well for our case compared to other alternate set data structures. We were able to determine presence of an element with minimal possible bits and tune our cache performance.