Antenna SQL Tutorial: Reach and Frequency

Antenna is a new method of gaining seamless access to your log-level data for querying in a SQL interface without the need for ingesting, cleaning and transforming your log files.

In this tutorial, we’ll explore a few ways you can use the log-level data that Antenna makes available to analyze your data in ways that are unavailable to you in Beeswax’s Query Tool. Throughout this process we’ll work our way up from basic queries to queries that are more advanced.

While we’ll work our way up in difficulty, this tutorial assumes a basic working knowledge of SQL. For a tutorial on SQL basics, one resource is W3Schools’ SQL Tutorial.

Example 1: Calculating Reach and Frequency

Let’s start with a relatively basic analysis: finding the Reach and Frequency of our campaigns. For this analysis we only need data on won impressions, so we’ll use the table that houses this: SHARED_IMPRESSION_DETAILS_VIEW.

Let’s start with the basic framework of the query, selecting the fields we need and getting a basic impression count before we bring in Reach and Frequency. We’ll aggregate over the campaign_id and line_item_id dimensions for the last 30 days of data. We’re pulling a count of all the rows here, as this table represents impressions.

SELECT
	campaign_id,
	line_item_id,
	COUNT(*) AS impressions
FROM shared_impression_details_view
WHERE bid_time >= CURRENT_DATE() - 30
GROUP BY campaign_id, line_item_id
ORDER BY impressions DESC

Now we have all our campaigns broken down by line items sorted by impressions in descending order. This isn’t anything too special though — you can pull this in Query Tool today!

Let’s start getting closer to the analysis we actually want, with Reach and Frequency calculations. To calculate Reach we’ll use Snowflake’s HLL function which runs a HyperLogLog approximate count analysis. This will trade off a small degree of accuracy (+/- 1.62%) for a much more cost-efficient query.

SELECT
	campaign_id,
	line_item_id,
	COUNT(*) AS impressions,
	HLL(user_id) AS reach,
	-- The below could be used instead of HLL with worse performance but an exact count. We've commented it out for now.
	-- COUNT(DISTINCT user_id) AS alt_reach_calculation 
	SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS measurable_reach_impressions,
	HLL(user_id)/SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS frequency
FROM shared_impression_details_view
WHERE bid_time >= CURRENT_DATE() - 30
GROUP BY campaign_id, line_item_id
ORDER BY reach DESC

Great! We’ve now got a report calculating reach and frequency across our line items. However, for good measure let’s swap this out for a dimension not available in Query Tool today — deal_id.

SELECT
	deal_id,
	COUNT(*) AS impressions,
	HLL(user_id) AS reach,
	SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS measurable_reach_impressions,
	HLL(user_id)/SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS frequency
FROM shared_impression_details_view
WHERE bid_time >= CURRENT_DATE() - 30
GROUP BY deal_id
ORDER BY reach DESC

With a few small steps in Antenna, we’ve now gone from a query you could fully run in Query Tool to one that contains both dimensions and metrics in a frequency analysis that were previously unavailable to you.

Limiting to Converters

Let’s do one final alteration that may offer a little bit more insight. Let’s find out the distribution of impression frequency for converters. Let’s start by getting the frequency for each user_id, but limited to only impressions that had a conversion. We’ll use a slightly different methodology than before. This analysis doesn’t handle all of the niche cases, but will pull the frequency of impressions for any user who made at least one conversion and saw at least one impression in the last 30 days.

SELECT
	user_id,
	COUNT(*) AS impressions
FROM shared_impression_details_view
WHERE conversions > 0
  AND bid_time >= CURRENT_DATE() - 30
GROUP BY user_id

Close, but we’re not done yet to get the frequency distribution. Let’s wrap this in a subquery to take these results and get a distribution. Because there may be a long-tail of users with a large number of impressions. Let’s group all the users with 10+ impressions together into a single bucket. To do this we need to also cast impressions_per_user to the string datatype using TO_CHAR.

SELECT
	CASE
		WHEN impressions_per_user >= 10 THEN '10+'
		ELSE TO_CHAR(impressions_per_user)
		END AS impressions_per_user,
	COUNT(*) AS frequency
FROM
	(SELECT
		user_id,
		COUNT(*) AS impressions_per_user
	FROM shared_impression_details_view
	WHERE conversions > 0
  AND bid_time_utc >= CURRENT_DATE() - 30
	GROUP BY user_id)
GROUP BY impressions_per_user
ORDER BY impressions_per_user

With these results, we can take the results into a spreadsheet program and create a nice visualization of the frequency distribution (log scale).

1078

Frequency per user on log scale