Learn how to calculate daily, weekly, and monthly active users on Hacker News data.
Active users is part of almost every business's key metrics suite, especially for online business doing subscription, e-commerce, or social. It answers one of the most fundamental questions for every business leader: how many people are using my product, and is that number growing?
Typically it's defined as unique users who have taken some action (visited, clicked, purchased) within a fixed time window (1 day, 7 days, 30 days, etc). Due to the fixed time window, active users metrics (whether daily, weekly, or monthly) are typically reported with a time dimension.
A common report may look like this:
A user is active in a given time window if they did some action within that time window. For example, if a user places an order on 2021-07-13, they'd be considered a daily active user (DAU) on 2021-07-13, but not on 2021-07-14. On 2021-07-31, they'd be considered a monthly active user (MAU).
In this post, we'll define active users as the number of posters on Hacker News. There is a publicly available, daily updated BigQuery table called bigquery-public-data.hacker_news.full that contains a row for every story post and comment on Hacker News. Each row contains a by column (the username), the timestamp of the action (timestamp), a title value if they posted a story or a text value if they posted a comment. To calculate active users, we'll only need to use the by and timestamp columns.
How to do this in SQL
Since this dataset is publicly available, you should be able to run all queries as is in your BigQuery project.
Create a date spine table containing every single day in a range. This is our "base" table on which we will join all activity data. It's important to use a date spine here so that we don't have missing dates for days when no user was active.
To calculate a date spine from 1 year ago up until today:
To calculate MAU, cross join the date spine table above against the hacker_news.full table on dates that were within thirty days of user activity. Then count distinct users by date:
If we want to be able to calculate things more generically (e.g. MAU and DAU), you need an additional step to calculate days_since_last_active and then do a count distinct with a filter on days_since_last_active:
To save end-users your time, maybe you save this query as a view in your data warehouse called active_users_by_date. Now, users only need to run select * from active_users_by_date.
The corresponding SGQL queries are straightforward:
Extending the SQL approach
The SQL approach here has the advantage of being familiar, but it doesn't allow a lot of flexibility in the analysis. For example, what happens if you actually want weekly active users (WAU)? You have three options:
Add the following line to the view and re-publish:
However, this assumes you have edit access to the view, which is not always the case since the original author probably doesn't want anybody to go in and adjust active user definitions.
Copy-paste the original query and adjust the logic yourself. This is also not great since you're copying a ton of boilerplate (e.g. all the date spine logic) just to make one tweak to a metric.
Save the intermediate view and push definitions to the query layer. In other words, only save the part of the query that emits username, date, and days_since_last_active and have queries do select date, count(username) from active_users where days_since_last_active < 30. This provides too much flexibility to queriers. Is MAU 30 days or 28 days? Or maybe it's calendar month? You probably don't want to leave that decision to every query.
Extending the Supergrain approach
In the Supergrain approach, you can just write:
If WAU is an important metric, it's still best for this to be represented at the SML layer like MAU and DAU. However, in the interim, queriers are not blocked. This is the core advantage of Supergrain: it balances flexibility with consistency of definitions.
The bottom line
Defining and calculating active users requires more boilerplate code than you think e.g. date spines. Using SQL views to lock in definitions works, but requires you to make a choice between consistent metrics and query flexibility.
The Supergrain approach requires more upfront work, but gives you both consistent metrics definitions and query-time flexibility.