HyperLogLog in Redis - A Deep Dive with NodeJS and ioredis
Today, we'll be voyaging into the surprisingly exciting world of counting. Yes, you read that correctly - counting! But not just any counting, we'll be delving into the distinct count problem, a challenging area where our usual toolbox falls short.
This is where the wizardry of HyperLogLog in Redis comes into play. HyperLogLog is by far my favorite Redis data structure, and it's a powerful took that I use frequently in my projects. In this blog post, we're going to use this almost mythical algorithm with Node.js and ioredis to do some fun stuff.
Before we begin, fasten your seat belts, put on your favorite nerdy glasses, and brace yourself to dig deep into the entrancing universe of probabilistic data structures. Let's jump in!
An Encounter with HyperLogLog
HyperLogLog, with its fancy name, is an algorithm specially designed for the count-distinct problem, which is, in layman's terms, the problem of counting unique elements in a list. Now you might be thinking - hey Shayan, isn't that a problem easily solved with a Set data structure?
Usually, you'd be right! But here's the twist. Consider a list as large as the total number of Google searches or website visits. Suddenly, counting each unique element becomes a colossal task, consuming astronomical amounts of memory, and as you know, memory is a precious, limited, and expensive resource.
Enter HyperLogLog, our memory-efficient hero. This probabilistic data structure approximates the number of unique elements, allowing you to count billions of distinct elements using very little memory, with a tiny margin of error (~0.81%). This might not be perfect, but when dealing with massive datasets, it's a trade-off we're more than willing to make most of the time.
The Magic Behind HyperLogLog
Let's dive a little deeper. HyperLogLog doesn't store all the elements; instead, it uses a clever hashing technique and then stores some interesting aspects of the hashed output. Specifically, it records the maximum number of leading zeros in the binary representation of these hashes.
These hash functions make each element 'randomly' scatter across a binary space. So, the likelihood of having lots of leading zeros in our hashed output is pretty low. But as our dataset grows and more elements are hashed, the odds of encountering a hash with more leading zeros increases.
HyperLogLog leverages this statistical property to approximate the number of distinct elements. It uses the longest run of leading zeros observed as an estimator of the logarithm of the number of unique elements. Hence, it captures the 'logarithm' of the count in its name. The 'Hyper' part is from the improvement brought about by Philippe Flajolet and his team in the algorithm, hence 'HyperLogLog.'
Redis and HyperLogLog - A Power Duo
Redis, well known for its high-speed and in-memory data storage, houses the HyperLogLog algorithm. This pairing is a match made in heaven, empowering us to perform near real-time analytics on large-scale data, without crippling the system's resources.
Node.js, ioredis, and HyperLogLog - The Trifecta
Enough with the boring theory, let's get our hands dirty! To interact with Redis, we'll use
ioredis, a feature-rich Redis client that offers extensive support for Redis commands, including our main character today - HyperLogLog.
To set up our playground, we need Node.js and Redis installed in our environment. If you haven't done this already, follow the guides on the Node.js and Redis official websites for installation. After you're done, install ioredis in your project directory using
npm install ioredis.
Now, let's get to coding. Create your
index.js file and import ioredis:
This snippet establishes our connection to Redis. If you're running Redis on your local machine, the default host is
127.0.0.1 (localhost), and the default port is
6379. If you're using a different host or port, you can pass them as as an object to the
Now, let's add some elements to our HyperLogLog data structure using the
'pfadd' stands for the command used to add elements to the HyperLogLog in Redis. 'website_visits' is the key identifying our HyperLogLog structure. This function adds three users to our 'website_visits' HyperLogLog.
Let's count our unique users using the
pfcount command fetches us the count of unique elements in our HyperLogLog structure, which should return 3, the number of unique users we added.
Let's add more users and check the count again, reusing
Notice how we've included 'User2' again. When we run this function, it should return 6. Even though we added 'User2' twice, HyperLogLog ensures that it's only counted once, and that's the magic of this algorithm. Think of all the possible applications of this in your projects!
An Ocean of Possibilities
With HyperLogLog, you can count unique IPs visiting a website, distinct search queries, unique users performing some action, and so much more, all while conserving memory. Even though there's an error margin, this trade-off is acceptable considering the resource efficiency HyperLogLog provides, especially for large-scale data.
And that, dear friends, is the end of our deep dive into HyperLogLog. We've navigated the deep waters of probabilistic data structures and emerged victorious, equipped with a potent tool in our developer toolkit.
Remember, the best way to solidify your understanding is through practice. So roll up your sleeves, fire up your text editors, and experiment with HyperLogLog in your projects. Till next time, stay curious, and keep exploring the magical world of algorithms and data structures!
Interested in LogSnag?