Listening to your users: Inferring Affinities and Interests based on actual time spent vs clicks or pageloads

Personalized recommendations rely on the idea the you know the interests of your audience. In absence of explicit feedback, interests are generally derived from clickstream data: session and event (e.g. click) data. But given that sessions can be short lived (bounce) and clicks can be unintentional, they are unlikely to reflect true interests of your audience if you simply count them.

At Blueshift, we choose to actively follow along the individual’s storyline and extract intelligence from each event to gather insights of the user’s intent and interests, so we can provide better recommendations.

Let’s look at a real user example

In the table below, we see an actual clickstream of events from a user on blueshiftreads.com.

Timestamp Session_id Event Category Book title
12:30:24 session_id1 view Biography & Autobiography > Personal Memoirs Eat Pray Love
12:31:29 session_id1 view Drama > American > General Death of a Salesman
13:48:49 session_id2 view Science > Physics > General Physics of the Impossible
13:49:02 session_id2 view Biography & Autobiography > Personal Memoirs Eat Pray Love
13:49:09 session_id2 view Health & Fitness > Diet & Nutrition > Nutrition The Omnivore’s Dilemma
13:49:19 session_id2 view Health & Fitness > Diet & Nutrition > Nutrition The Omnivore’s Dilemma
13:49:35 session_id2 view Poetry > American > General Leaves of Grass
14:09:47 session_id2 view Poetry > American > General Leaves of Grass
14:10:02 session_id2 add_to_cart Poetry > American > General Leaves of Grass

This specific user interacted during two different sessions, browsing books from different categories. If we try to come up with the top categories for this user, based on total number of sessions, we get:

Rank Category Session count
1 Biography & Autobiography > Personal Memoirs 2
2 Health & Fitness > Diet & Nutrition > Nutrition 1
3 Poetry > American > General 1
4 Science > Physics > General 1

As you can see in the table above, Personal Memoirs is the top category while the three other categories tie to second-place (they have been alphabetically ordered in that case), but other tie-breaking rules can be applied.

Time spent ranking

At Blueshift, we developed algorithms to re-rank these categories according to the time the user actually spent on your products and categories:

Rank Category Time spent
1 Poetry > American > General 1212
2 Biography & Autobiography > Personal Memoirs 72
3 Health & Fitness > Diet & Nutrition > Nutrition 26
4 Science > Physics > General 13

Here, we rank ‘Poetry > American > General’ above the other categories. Note that at the end of the original event stream above, the user actually did add the book from that category to the cart. Even if we would have ignored that event, our time based ranking would have indeed capture a category of interest to this user.

There’s more: decayed time spent

You should be careful not to rely on detailed information from a single user on a single day: if the user indeed bought the book he added to the cart, that might just be an indicator of no longer being interested in that specific category of products. Furthermore, you would want to adapt to changing user interest over time.

That’s why we implemented what we call a decayed time spent algorithm, that combines the time spent by users over a certain period of time (say last week) and that weighs recent time spent as more important to the ranking than time the user spent before (say 14 days ago).

Decayed weighting of recency this way allows recommendations to adapt quickly to shifting user interests when they are shopping during holidays and might be looking for gifts for others as well as themselves.

From user-level signal to site-wide signal

Many product recommendations are related to some site-wide top categories of products, like ‘top viewed’. Using our time based algorithms, we can better rank these top categories. Let’s look at another example from blueshiftreads.com where we show you a part (20-25 to be exact) of the top 25 most popular categories.

Using classical session counting, we obtain the following ranking:

category session count
Juvenile Fiction > People & Places > United States > African American 5358
Juvenile Fiction > Girls & Women 5291
Juvenile Fiction > Family > General 5265
Fiction > Contemporary Women 5215
Fiction > Thrillers > Suspense 4971
Fiction > Mystery & Detective > Women Sleuths 4804

However, when we rerank these categories based on actual time spent by the users, we see that ‘Juvenile Fiction > Girls & Woman’ drops from position 21 (above) to position 23 (below), even though it had 76 user sessions more in the 7 days over which this was calculated. User sessions are no guarantee for actual interest (i.e spending time).

category time spent
Juvenile Fiction > People & Places > United States > African American 102164972
Juvenile Fiction > Family > General 100447985
Fiction > Contemporary Women 98897169
Juvenile Fiction > Girls & Women 98340874
Fiction > Thrillers > Suspense 91140081
Fiction > Mystery & Detective > Women Sleuths 87372604

Furthermore, if we rank the categories using our decayed time spent, we see that ‘Fiction > Contemporary Women’ is actually ranked the highest (21) while it was the lowest (23) in the original list. This indicates that this category received the highest time spend by users in the most recent past.

category time score
Juvenile Fiction > People & Places > United States > African American 28461106.29
Fiction > Contemporary Women 28179308.93
Juvenile Fiction > Girls & Women 28068989.26
Juvenile Fiction > Family > General 27608048.02
Fiction > Thrillers > Suspense 26102829.31
Fiction > Mystery & Detective > Women Sleuths 24597921.38
Ok, why bother?

So why bother re-ranking? Well, most catalogs will exhibit a Long Tail in the distribution of popularity of their content: very few items will be very popular while lots of items will be very unpopular. No matter how you rank the popularity of the top-10 categories (sessions, clicks, time, …) out of a 1000 category catalog, these extremely popular categories will always on top. Just have a look at the top 20 categories from blueshiftreads.com:

blog_post_time_spent_top20

As you can see, the top 5 categories do a lot better than the rest. For most businesses there is a lot of value in promoting content from categories other than these few favorites. Therefore, if you can avoid down-ranking interesting categories for users and do this consistently over your whole catalog, you will be able to recommend products from the appropriate category to the users who care for it. In other words, you will avoid the pitfall of recommending an overly popular yet generic product to your users.

But time spent relates to sessions/clicks anyway?

Yes and no. It is true that more sessions correlate to more time users will spend on categories, but not to the same extent: a session length can range from a second to tens of minutes. Have a look at the next graph below.

What we see is the ranking of the 1000+ categories (on the X-axis) for blueshiftreads.com by popularity (on the Y-axis, logarithmic scale) over 7 days, in terms of 3 different metrics:

  • The blue line represents ranking by session count. It is very smooth because it really ranks all categories just in descending order of session count. This is the standard ranking.
  • The red line represents ranking by time spent by the users. It is equally smooth in the beginning (left) because it ‘agrees’ with the session ranking: as mentioned above, the top popular categories will always be on top. But quite soon, the line becomes spiky: the ranking disagrees with session count, and the spikes indicate that this ranking would reorder the categories in a different way (promoting different categories to the top).
  • The green line is the decayed time spent ranking: the same holds as the time spent ranking. This algorithm also disagrees with session count and would reorder lots of categories in the long tail to promote categories of interest to the user.

blog_post_time_spent_ranking_plot

This re-ranking is exactly what you should do to stop recommending the same popular categories to users that might have indicated (time) interest in other categories.

Send Time Optimization or Engage Time Optimization?

Marketers should adapt their send time to each user individually, and send campaigns closer to the times when they are more likely to engage in downstream activity.

As you might have read in our previous blog post “Re-Thinking Send Time Optimization in the age of the Always On Customer“, Blueshift focuses on “Engage Time Optimization” rather than what marketers traditionally call as “Send Time Optimization”. Since we’ve posted this article, we’ve elaborated a bit on the details of the development of that feature on Quora (When is the best time (day) to send out e-mails?). Through this post however, we would like share more of those insights, and advocate for focusing on optimizing downstream user engagement metrics rather than initial open rates.

The idea of “Send Time Optimization” is not new, and has been around for quite some time. One of the more recent reports on this was posted by MailChimp in 2014, but articles and discussions on this topic go back as far as 2009 and older. The data science team at Blueshift followed the hypothesis that if there is a specific hour of the day, or day of the week that an audience is more likely to engage, that should reflect in increased open (or even click) rates when messaged at different times.

Open Rates vs Click Rates

In order to observe this effect (or the absence of it), we analyzed over 2 billion messages that were sent through Blueshift. Some of the results are presented in the graphs below for one of our biggest clients.

Through the Lens of Open Rates

“irrespective of the segment that was targeted, the audience size and the send time, the open rate is the highest in the first two hours after the send”

We looked at the open rate (%, shown on the Y-axis) in the first 24 hours after the send was executed (in hours, shown on the X-axis).

open_rates

What you see are 18 email campaigns from one client over the period of one month (totaling over 20 million emails). On the top left, we see campaigns sent out on Monday, next, Tuesday, and so on – through Saturdays on the bottom right. There were no campaigns on Sunday for this client during this month. These campaigns were sent to audiences ranging from tens of thousands of users in specialized segments (e.g. highly engaged  customers) to large segments of 2–3M users. The send times varied from 5AM – 12PM (in parenthesis in the legend).

What you can see from this graph, is that even though the campaigns were sent out on different days of the week and at different hours, the initial response in term of open rates is very predictable for the first hours. The conclusion from these plots is that irrespective of the segment that was targeted, the audience size and the send time, the open rate is the highest in the first two hours after the send. Depending on the actual time of the send you can achieve a slightly higher open rate in the first hour, but you might loose more ‘area’ in the following hours, accumulating to more or less the same open rates after some hours.

Through the Lens of Click Rates

Naturally, the question comes to mind if there is any measurable effect when we look at clicks, which can be considered as a deeper form of engagement by the users that received the message:

click_rates

But as you can see from these second set of graphs where the Y-axis represents the click rate (%), we observed a very similar behavior: the actual response rate in terms of clicks does not significantly change when a campaign is sent at a different time.

We came to the same conclusion when repeating this experiment for opens and clicks for other clients in our dataset as well. After doing more in-depth analysis on our datasets, we observed that users that were targeted in email campaigns at certain times, showed engagement (e.g. visits to the website or app) at other times. Users prefer to engage deeply at certain hours of the day while casually browsing through out. Marketers should adapt their send time to each user individually, and send campaigns closer to the times when they are more likely to engage in downstream activity. You can find more info about this “Engage Time Optimization” in this post.

 

Passing named arguments to Ruby Rake tasks using docopt for data science pipelines

Introduction

Ever considered using rake for running tasks, but got stuck with the unnatural way that rake tasks pass in the arguments? Or have you seen the fancy argument parsing docopt and alikes can do for you? This article describes how we integrated docopt with the rake command so we can launch our data science pipelines using commands like this:

$ bundle exec rake stats:multi -- --sites=foo.com,bar.com \
--aggregates=pageloads,clicks \
--days-ago=7 --days-ago-end=0

This command would for instance launch daily aggregate computations for clicks and pageloads, for each of the sites foo.com and bar.com, and this for each individual day in the last 7 days.

Not only can you launch your tasks using descriptive arguments, you get automated argument validation on top of it. Suppose we launch the task using the following command:

$ bundle exec rake stats:multi

Then the task would fail with the following help message

Usage: stats:multi -- --sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]
stats:multi -- --help

It will display the mandatory and/or optional arguments and the possible combinations (e.g. mutually exclusive arguments). And the best thing of all is that all you have to do to obtain this advanced validation, is merely specifying the string just like the one you are seeing here: indeed, docopt uses your specification of the help message to process all you wish for your arguments!

The remainder of this post will explain how to set this up yourself and how to use it. This guide assumes you have successfully configured your system for using ruby and rails and the bundle command. Here are guides on how to set up RVM and getting started with Rails.

Configuring your Rails project to use docopt

docopt is an argument parsing library available for many different languages. For more details, on what it does, have a look at the documentation here. We use it as a Ruby gem. You can simply add it to your project by editing your Gemfile in your project root by adding:

gem 'docopt', '0.5.0'

Then run

$ bundle install

in your project directory. This should be sufficient to make your project capable of using the docopt features.

Anatomy of the argument specification string

First, we should elaborate a bit how docopt knows what to expect and how to parse/validate your input. To make this work, you are expected to present docopt with a string that follows certain rules. As mentioned above, this is also the string that is being show as the help text. More specific, what it expects is a string that follows the following schema:

Usage: #{program_name} -- #{argument_spec}
#{program_name} -- --help

where program_name equals to the name of the command that is being run, -- (double dash) – this is not due to doctopt but due to rake (more on that in a moment), and argument_spec which can be anything you want to put there.

Let’s look at the aforementioned example:

Usage: stats:multi -- --sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]

Here, the program_name is stats:multi, which is the actual namespace and task name for our rake task, ‘–‘ and the argument_spec is "--sites= [--aggregates=] [ (--start-date= --end-date=) | (--days-ago= [--days-ago-end=]) ]"

Now, let’s go into details of the argument_spec (split over multiple lines for readability):

--sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]

Basic rules

docopt considers arguments mandatory, unless they are enclosed in brackets [] – then they are optional. So in this example, our only --sites is required. It also requires a value, given that it is being followed by =. However, here could be anything, and is used to give the user an idea of what is expected as the type of argument. If you would enter --sites on the input without specifying a value, docopt will return an error that the value is missing. No effort needed on your end!

Optional arguments

The next argument [--aggregates=] follows the same pattern, except that this one is fully optional. We will in our code handly the case where this is not specified and come up with some default values.

Grouping and mutual exclusion

The last – optional – argument is used to specify the dates we want to run our computation for, and we want to have three ways of doing so this:

  • EITHER by specifically telling the start-date AND end-date
  • OR by specifying the number of days-ago before the time of the command, taking as an end date the date of the command being run (e.g. 7 days ago until now)
  • OR by specifying the number of days-ago until days-ago-end (e.g. to backfill something between 14 days ago and 7 days ago).

Here is where complicated things can be achieved in a simple manner. The formatting we used for this is in fact:

[ ( a AND b ) | ( c [ d ] ) ]

docopt requires all arguments in a group (...) to be presented on the input. If only a or b are given, it will return and inform us about the missing argument.

Similarly, a logical OR can be added via | (pipe). This will make either of the options a valid input.

Furthermore, you can combine optional arguments within a group, like we did with ( c [ d ] ). This will make the parser aware of the fact that d (in the real example above [--days-ago-end=]) is only valid when c (--days-ago= in the example) has been presented. Trying to use this parameter with --start-days will result in an error.

Note that this whole complex group is optional and we again will come up with some defaults in our code that handles the parsed arguments.

Flags

Lastly, it’s noteworthy that flags (i.e. arguments that don’t take a value), such as --force, will result in a true/false value after parsing.

For more information and examples, consult the docopt documentation here. However, the explanation above should get you already a long way.

RakeTaskArguments

Now that you have understanding of how the argument string defines how your input will be parsed and/or be errored out, we can write a class that wraps all of this functionality together, and exposes us only to specifying this string and getting a map with the parsed values in return.

To this end, we wrote the RakeTaskArguments.parse_arguments method:


class RakeTaskArguments
def self.parse_arguments(task_name, argument_spec, args)
# Set up the docopt string that will be used to pass the
# input along
doc = <<DOCOPT
Usage: #{task_name} -- #{argument_spec}
#{task_name} -- --help
DOCOPT
# Prepare the return value
arguments = {}
begin
# Because the new version of rake passes the -- along
# in the args variable,
# we need to filter it out if it's present
args.delete_at(1) if args.length >= 2 and args.second == '--'
# Have docopt parse the provided args (via :argv) against
# the doc spec
Docopt::docopt(doc, {:argv => args}).each do |key, value|
# Store key/value, converting '--key' into 'key'
# for accessibility
# Per docopt pec, the key '--' contains the actual
# task name as a value
# so we label it accordingly
arguments[key == "--" ? 'task_name' :
key.gsub('--', '')] = value
end
rescue Docopt::Exit => e
abort(e.message)
end
return arguments
end
end

The method takes 3 arguments:

  • the rake task_name that we want to execute
  • the argument_spec we discussed before
  • the args actual input that was provided when launching the task and that should be validated.

Parsing and validation happens magically by

Docopt::docopt(doc, {:argv => args})

while it returns a map with the keys and values for our input arguments. We iterate over the key-value pairs and strip the leading -- (double dash – e.g. --sites) from the keys so we can access them in the resulting map later on via their name (e.g. ...['sites'] instead of ...['--sites']), which is just more practical to deal with.

the solo -- (double dash) that keeps coming back

We keep seeing this solo -- floating around in the strings, like stats:multi -- --sites=. As was pointed out here on StackOverflow, this is needed to make the actual rake command stop parse arguments. Indeed, without adding this -- immediately after the rake task you want to execute, rake would consider the subsequent arguments to be related to rake itself. Therefore, we also have it in our docopt spec

#{task_name} -- #{argument_spec}

to make sure the library does not parse it out. It is inconvenient, but hacking this up this way has way more benefits if you get used to it.

WARNING: It seems that in rake version 10.3.x, this -- was not passed along in the ARGV list, but the newer version of rake, 10.4.x DOES pass it along. Therefore we added the following code:

args.delete_at(1) if args.length >= 2 and args.second == '--'

which removes this item from the list before we pass it to docopt. Also note that this line of code removes the second element from the list, as the first element is always the program name.

Rake task with named arguments

Once you have the docopt gem installed and the RakeTaskArguments.rb class available in your project, we can specify the following demo rake task:


namespace :stats do
desc "Compute pageload statistics for a list of sites and a "\
"given window of time"
task :pageloads, [:params] => :environment do |t, args|
# Parse the arguments, either from ARGV in case of direct
# invocation or from args[:params] in the case it was
# called from other rake_tasks
parameters = RakeTaskArguments.parse_arguments(t.name,
"--sites= [ (--start-date= --end-date=) | "\
(--days-ago= [--days-ago-end=]) ]",
args[:params].nil? ? ARGV : args[:params])
# Get the list of sites
sites = parameters["sites"]
# Validate and process the start and end date input
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])
# For each of the sites
sites.split(',').each do |site|
# Pretend to do something meaningful
puts "Computing pageload stats for site='#{site}' "\
"for dates between #{start_date} and #{end_date}"
end # End site loop
end
end

This basic rake task follows a really simple and straightforward template. However, first we need to understand how this task get’s it’s input arguments. As briefly mentioned before, this task will receive the input in the ARGV variable in Ruby. However, when a rake task calls another rake task, this variable might not contain the correct information. Therefore we enable parameter passing into the task by defining the following header:

task :pageloads, [:params] => :environment do |t, args|

This way, IF this task was called from another task, args will contain a field called :params that contains the arguments that the parent task passed alogn to this task. A detailed example of that follows later on. This matters because we decide at runtime what input to pass to the argument validation. So, to pass the input for validation, we just call

parameters = RakeTaskArguments.parse_arguments(t.name, "--sites= "\
"[(--start-date= --end-date=) | (--days-ago= [--days-ago-end=])]",
args[:params].nil? ? ARGV : args[:params])

This command passes the task_name (via t.name), the argument specification and the input (either via ARGV or args[:params]) for validation to docopt. At this point, you are guaranteed that the parameters return value contains everything according to the schema you specified, or your code has already errored out at this point.

If you then want to access some of the variables, you can simply use

sites = parameters["sites"]
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])

This last line sets up a start-date and end-date based on some validation and/or defaults we specified in a method that is not covered in this article. The code is available on github though.

Rake task calling other rake tasks

Finally, we cover the case where a meta-task is actually invoking other tasks (in case you want to group certain computations). As mentioned above, this has an impact on how the arguments get passed into the task. Let’s consider the following meta-task:


desc "Run multiple aggregate computations for a given "\
list of sites and a given window of time"
task :multi, [:params] => :environment do |t, args|
# Parse the arguments, either from ARGV in case of
# direct invocation or from args[:params] in the case
# it was called from other rake_tasks
parameters = RakeTaskArguments.parse_arguments(t.name,
"--sites= [--aggregates=] [ (--start-date= --end-date=) "\
"| (--days-ago= [--days-ago-end=]) ]",
args[:params].nil? ? ARGV : args[:params])
# Get the list of sites
sites = parameters["sites"]
# Just for demo purposes, you would normally fetch
# this elsewhere
available_aggregates = ["pageloads", "clicks"]
# Fetch the list of
aggregates = parameters["aggregates"].nil? ?
available_aggregates.join(",") :
RakeTaskArguments.validate_values(
parameters["aggregates"], available_aggregates)
# Validate and process the start and end date input
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])
# For each of the sites
sites.split(',').each do |site|
# For each of the tables
aggregates.split(',').each do |aggregate|
# Prepare an array with values to pass to the sub rake-tasks
parameters = []
parameters.push("stats:#{aggregate}")
parameters.push("--sites=#{site}") # just one single site
parameters.push("--start-date=#{start_date}")
parameters.push("--end-date=#{end_date}")
self.execute_rake("stats", aggregate, parameters)
end
end # End site loop
end

The template used in this task is very similar to a simple rake task. The main difference is that we added a list of aggregates you can specify on the input, which is validated against certain allowed values (again, out of the scope of this article). The :multi task then calls the appropriate tasks with the given parameters.

What’s new here is the way the meta-task calls the other tasks:


# Prepare an array with values to pass to the sub rake-tasks
parameters = []
parameters.push("stats:#{aggregate}")
parameters.push("--sites=#{site}") # just one single site
parameters.push("--start-date=#{start_date}")
parameters.push("--end-date=#{end_date}")
self.execute_rake("stats", aggregate, parameters)

Basically, we construct a list of arguments that emulates as if the input to the task was provided on the command line. We then call the other rake task using the following helper function:


# Helper method for invoking and re-enabling rake tasks
def self.execute_rake(namespace, task_name, parameters)
# Invoke the actual rake task with the given arguments
Rake::Task["#{namespace}:#{task_name}"].invoke(parameters)
# Re-enable the rake task in case it is being for
# different parameters (e.g. in a loop)
Rake::Task["#{namespace}:#{task_name}"].reenable
end

As a rake task is generally intended to be run only once, invoking it again would have no effect. But as we launch the same tasks with different parameters, we can reenable the tasks for execution. This helper function shields us from these technicalities and we can just call the execute function with the namespace, the task name and the parameters. When Ruby calls .invoke(parameters) on a rake task, these parameters will end up in the args[:params] we discussed before.

Conclusion

So, that concludes or extensive article on how to add a lot of flexibility to arguments you provide to rake. In the end, we covered

  • How you can easily add docopt to a Rails project
  • How docopt argument specification strings look like and how they work
  • How you could write a wrapper class that encapsulates all that functionality
  • How you can plug this into simple rake tasks
  • How you can run meta rake tasks that call other tasks while keeping the flexbility for your input arguments

The full code and working examples of this article are available here on GitHub.

We hope this article helps you to get something like this set up for your own stacks as well, and that it increases your productivity. If you have any comments, questions or suggestions, feel free to let us know!

References

* The code for this article can be found on GitHub
* docopt.rb documentation
* rake documentation
* Brief mention of the double dash issue with Rake