Last time we looked at Bloom filters, which is one type of probabilistic data structure. In this post we will look at another one - Count Min Sketch.
In short, a count min sketch is used to:
It's important to note that due to the risk of collisions and the size of the sketch, it can overestimate the true frequency of the events. It is after all a probabilistic data structure, it doesn't store the raw data, much like the Bloom filter, so this is expected.
So let's see how the sketch works.
We will need:
rows
x rows
(rows
and cols
should be chosed based on the estimated number of unique event types inserted into the sketch)We'll start off with a blank 2D array.
// rows = 3, cols = 5
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Now let's insert some events.
// This will insert an event of type "A".
insert("A")
// For each row, we'll hash the event type with the
// hashing function for that row to get an index k
// h0("A") - 2
// h1("A") - 4
// h2("A") - 0
// And now we index into every row with the computed k
// and increment by one.
0 0 1 0 0
0 0 0 0 1
1 0 0 0 0
Now an event of a different type.
insert("B")
// h0("B") - 1
// h1("B") - 4
// h2("B") - 3
0 1 1 0 0
0 0 0 0 2
1 0 0 1 0
Rememeber we are inserting a stream of events, so we will have many events of the same type. So let's insert another event of type "A";
insert("A")
// h0("A") - 2
// h1("A") - 4
// h2("A") - 0
0 1 2 0 0
0 0 0 0 3
2 0 0 1 0
Great, now how do we query for the frequency of an event?
getFrequency("A")
// This is our state
// 0 1 2 0 0
// 0 0 0 0 3
// 2 0 0 1 0
// We'll hash again, just like when inserting.
// h0("A") - 2
// h1("A") - 4
// h2("A") - 0
// Now for every row we'll get sketch[row, hrow("A")]
// sketch[0, 2] - 2 (the frequency on this row is 2)
// sketch[1, 4] - 3 (the frequency on this row is 3)
// sketch[2, 0] - 2 (the frequency on this row is 2)
// And now we take the min of those
// min(2, 3, 2)
2
Next up, we'll implement this. We already have all the building blocks for double hashing from the Bloom filter so the implementation should be pretty straight-forward.
void CountMinSketch::insert(const std::string & key)
{
for (size_t row = 0; row < m_rows; ++row)
{
// Hash
auto k = m_hash.hashIteration(key, row, m_cols);
// Increment
++(m_vector[row][k]);
}
}
uint32_t CountMinSketch::getFrequency(const std::string & key) const
{
size_t minFrequency = std::numeric_limits<std::size_t>::max();
for (size_t row = 0; row < m_rows; ++row)
{
// Hash
auto k = m_hash.hashIteration(key, row, m_cols);
// Find min
auto currentFrequency = m_vector[row][k];
if (currentFrequency < minFrequency)
{
minFrequency = currentFrequency;
}
}
return minFrequency;
}
You can see the full implementation at Sketch.
Share something about this post.