Behind-the-Scenes: Real-time segments with Blueshift

(Here is a behind the scenes look at the segmentation engine that powers Programmatic CRM.)

Real-time segmentation matters: Customers expect messages based on their most recent activity. Customers do not want reminders for products they may have already purchased or messages based on transient past behaviors that are no longer relevant.

However, real-time segmentation is hard: it requires processing large amounts of behavioral data quickly. This requires a technology stack that can:

  • Process event & user attributes immediately, as they occur on your website or mobile apps
  • Track 360-degree customer profiles and deal with data fragmentation challenges
  • Scale underlying data stores to process billions of customer actions and support high write and read throughput.
  • Avoid time consuming steps of data modeling that require human curation and slows down on-boarding

Marketers use Blueshift to reach each customer as a segment-of-one, and deliver highly personalized messages across every marketing channel using Blueshift’s Programmatic CRM capabilities. Unlike previous generation CRM platforms, Segments in Blueshift are always fresh and updated in real-time, enabling marketers to respond to the perpetually connected customer in a timely manner. Marketers use the intuitive and easy to use segmentation builder to define their own custom segments by mixing and matching filters across numerous dimensions including: event behavioral data, demographic attributes, predictive scores, lifetime aggregates, catalog interactions, CRM attributes, channel engagement metrics among others.

Segments support complex filter conditions across numerous dimensions

Segments support complex filter conditions across numerous dimensions

Behind the scenes, Blueshift builds a continually changing graph of users and items in the catalog. The edges in the graph come from user’s behavior (or implied behavior), we call this the “Interaction graph”. The “interaction graph” is further enriched by machine-learning models that add predicted edges and scores to the graph (if you liked item X, you may also like item Y) and also expand user attributes through 3rd party data sources (example: given the firstname “John”, with reasonable confidence we can infer gender is male).

Blueshift interaction graph

Blueshift interaction graph

The segment service can run complex queries against the “interaction graph” like: “Female users that viewed ‘Handbags’ over $500 in last 90 days, with lifetime purchases over $1,000 and not using mobile apps recently and having a high churn probability” and return those users within a few seconds to a couple of minutes.

360-degree user profiles

For every user on your site/mobile app, Blueshift creates a user profile that tracks anonymous user behavior and merges it with their logged-in activities across devices. These rich user profiles combine CRM data, aggregate lifetime statistics, catalog-related activity, predictive attributes, campaign & channel activity and website / mobile app activity. The unified user profiles form the basis for segmentation. A segment query matches these 360 degree user profiles against the segment definition to identify the target set of users.

360-degree user profiles

360-degree user profiles in Blueshift

Multiple data stores (no one store to rule them all)
The segmentation engine is powered by several different data stores. A given user action or attribute that hits the event API is replicated across these data stores including: timeseries stores for events, relational database for metadata, in-memory stores for aggregated data & counters, key-value stores for user lookups, as well as a reverse index to search across any event or user attributes quickly. The segmentation engine is tuned for fast retrieval of complex segment definitions compared to a general purpose SQL-style database where joins across tables could take hours to return results. The segmentation engine leverages data across all these data stores to pull the right set of target users that match the segment definition.

Real-time event processing

Website & mobile apps send data to Blueshift’s event APIs via SDKs and tag managers. The events are received by API end-points and written to in-memory queues. The event queues are processed continuously in-order, and updates are made across multiple data stores (as described above). The user profiles and event attributes are updated continuously with respect to the incoming event stream. Campaigns pull the audience data just-in-time for messaging, which result in segments that are continuously updated and always fresh. Marketers do not have to worry about out of date segment definitions and avoid the “list pull hell” with data-warehouse style segmentation.

Dynamic attribute binding

The segmentation engine further simplifies onboarding user or event attributes by removing the need to model (or declare) attribute types ahead of time. The segmentation engine dynamically assesses the type of each new attribute based on sample usage in real-time. For instance, an attribute called “loyalty_points” with a value of “450”, would be interpreted as a number (and show related numeric operators for segmentation), while an attribute like “membership_level” with a value of “gold” would be dynamically interpreted as a string (and show related string comparison operators for segmentation), or an attribute like “redemption_at” with a value like “2016-09-23” will be interpreted as a timestamp (and show relative time operators).

Several Blueshift customers have thousands of CRM & event attributes, and are able to use these attributes without any data modeling or declaring their data upfront, saving them numerous days of implementing data schemas in SQL-based implementations.

The combination of 360-degree user profiles, real-time event processing, multiple specialized data stores and dynamic attribute binding, empowers marketers to create always fresh and continuously updated segments.

Passing named arguments to Ruby Rake tasks using docopt for data science pipelines


Ever considered using rake for running tasks, but got stuck with the unnatural way that rake tasks pass in the arguments? Or have you seen the fancy argument parsing docopt and alikes can do for you? This article describes how we integrated docopt with the rake command so we can launch our data science pipelines using commands like this:

$ bundle exec rake stats:multi --, \
--aggregates=pageloads,clicks \
--days-ago=7 --days-ago-end=0

This command would for instance launch daily aggregate computations for clicks and pageloads, for each of the sites and, and this for each individual day in the last 7 days.

Not only can you launch your tasks using descriptive arguments, you get automated argument validation on top of it. Suppose we launch the task using the following command:

$ bundle exec rake stats:multi

Then the task would fail with the following help message

Usage: stats:multi -- --sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]
stats:multi -- --help

It will display the mandatory and/or optional arguments and the possible combinations (e.g. mutually exclusive arguments). And the best thing of all is that all you have to do to obtain this advanced validation, is merely specifying the string just like the one you are seeing here: indeed, docopt uses your specification of the help message to process all you wish for your arguments!

The remainder of this post will explain how to set this up yourself and how to use it. This guide assumes you have successfully configured your system for using ruby and rails and the bundle command. Here are guides on how to set up RVM and getting started with Rails.

Configuring your Rails project to use docopt

docopt is an argument parsing library available for many different languages. For more details, on what it does, have a look at the documentation here. We use it as a Ruby gem. You can simply add it to your project by editing your Gemfile in your project root by adding:

gem 'docopt', '0.5.0'

Then run

$ bundle install

in your project directory. This should be sufficient to make your project capable of using the docopt features.

Anatomy of the argument specification string

First, we should elaborate a bit how docopt knows what to expect and how to parse/validate your input. To make this work, you are expected to present docopt with a string that follows certain rules. As mentioned above, this is also the string that is being show as the help text. More specific, what it expects is a string that follows the following schema:

Usage: #{program_name} -- #{argument_spec}
#{program_name} -- --help

where program_name equals to the name of the command that is being run, -- (double dash) – this is not due to doctopt but due to rake (more on that in a moment), and argument_spec which can be anything you want to put there.

Let’s look at the aforementioned example:

Usage: stats:multi -- --sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]

Here, the program_name is stats:multi, which is the actual namespace and task name for our rake task, ‘–‘ and the argument_spec is "--sites= [--aggregates=] [ (--start-date= --end-date=) | (--days-ago= [--days-ago-end=]) ]"

Now, let’s go into details of the argument_spec (split over multiple lines for readability):

--sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]

Basic rules

docopt considers arguments mandatory, unless they are enclosed in brackets [] – then they are optional. So in this example, our only --sites is required. It also requires a value, given that it is being followed by =. However, here could be anything, and is used to give the user an idea of what is expected as the type of argument. If you would enter --sites on the input without specifying a value, docopt will return an error that the value is missing. No effort needed on your end!

Optional arguments

The next argument [--aggregates=] follows the same pattern, except that this one is fully optional. We will in our code handly the case where this is not specified and come up with some default values.

Grouping and mutual exclusion

The last – optional – argument is used to specify the dates we want to run our computation for, and we want to have three ways of doing so this:

  • EITHER by specifically telling the start-date AND end-date
  • OR by specifying the number of days-ago before the time of the command, taking as an end date the date of the command being run (e.g. 7 days ago until now)
  • OR by specifying the number of days-ago until days-ago-end (e.g. to backfill something between 14 days ago and 7 days ago).

Here is where complicated things can be achieved in a simple manner. The formatting we used for this is in fact:

[ ( a AND b ) | ( c [ d ] ) ]

docopt requires all arguments in a group (...) to be presented on the input. If only a or b are given, it will return and inform us about the missing argument.

Similarly, a logical OR can be added via | (pipe). This will make either of the options a valid input.

Furthermore, you can combine optional arguments within a group, like we did with ( c [ d ] ). This will make the parser aware of the fact that d (in the real example above [--days-ago-end=]) is only valid when c (--days-ago= in the example) has been presented. Trying to use this parameter with --start-days will result in an error.

Note that this whole complex group is optional and we again will come up with some defaults in our code that handles the parsed arguments.


Lastly, it’s noteworthy that flags (i.e. arguments that don’t take a value), such as --force, will result in a true/false value after parsing.

For more information and examples, consult the docopt documentation here. However, the explanation above should get you already a long way.


Now that you have understanding of how the argument string defines how your input will be parsed and/or be errored out, we can write a class that wraps all of this functionality together, and exposes us only to specifying this string and getting a map with the parsed values in return.

To this end, we wrote the RakeTaskArguments.parse_arguments method:

class RakeTaskArguments
def self.parse_arguments(task_name, argument_spec, args)
# Set up the docopt string that will be used to pass the
# input along
doc = <<DOCOPT
Usage: #{task_name} -- #{argument_spec}
#{task_name} -- --help
# Prepare the return value
arguments = {}
# Because the new version of rake passes the -- along
# in the args variable,
# we need to filter it out if it's present
args.delete_at(1) if args.length >= 2 and args.second == '--'
# Have docopt parse the provided args (via :argv) against
# the doc spec
Docopt::docopt(doc, {:argv => args}).each do |key, value|
# Store key/value, converting '--key' into 'key'
# for accessibility
# Per docopt pec, the key '--' contains the actual
# task name as a value
# so we label it accordingly
arguments[key == "--" ? 'task_name' :
key.gsub('--', '')] = value
rescue Docopt::Exit => e
return arguments

The method takes 3 arguments:

  • the rake task_name that we want to execute
  • the argument_spec we discussed before
  • the args actual input that was provided when launching the task and that should be validated.

Parsing and validation happens magically by

Docopt::docopt(doc, {:argv => args})

while it returns a map with the keys and values for our input arguments. We iterate over the key-value pairs and strip the leading -- (double dash – e.g. --sites) from the keys so we can access them in the resulting map later on via their name (e.g. ...['sites'] instead of ...['--sites']), which is just more practical to deal with.

the solo -- (double dash) that keeps coming back

We keep seeing this solo -- floating around in the strings, like stats:multi -- --sites=. As was pointed out here on StackOverflow, this is needed to make the actual rake command stop parse arguments. Indeed, without adding this -- immediately after the rake task you want to execute, rake would consider the subsequent arguments to be related to rake itself. Therefore, we also have it in our docopt spec

#{task_name} -- #{argument_spec}

to make sure the library does not parse it out. It is inconvenient, but hacking this up this way has way more benefits if you get used to it.

WARNING: It seems that in rake version 10.3.x, this -- was not passed along in the ARGV list, but the newer version of rake, 10.4.x DOES pass it along. Therefore we added the following code:

args.delete_at(1) if args.length >= 2 and args.second == '--'

which removes this item from the list before we pass it to docopt. Also note that this line of code removes the second element from the list, as the first element is always the program name.

Rake task with named arguments

Once you have the docopt gem installed and the RakeTaskArguments.rb class available in your project, we can specify the following demo rake task:

namespace :stats do
desc "Compute pageload statistics for a list of sites and a "\
"given window of time"
task :pageloads, [:params] => :environment do |t, args|
# Parse the arguments, either from ARGV in case of direct
# invocation or from args[:params] in the case it was
# called from other rake_tasks
parameters = RakeTaskArguments.parse_arguments(,
"--sites= [ (--start-date= --end-date=) | "\
(--days-ago= [--days-ago-end=]) ]",
args[:params].nil? ? ARGV : args[:params])
# Get the list of sites
sites = parameters["sites"]
# Validate and process the start and end date input
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])
# For each of the sites
sites.split(',').each do |site|
# Pretend to do something meaningful
puts "Computing pageload stats for site='#{site}' "\
"for dates between #{start_date} and #{end_date}"
end # End site loop

This basic rake task follows a really simple and straightforward template. However, first we need to understand how this task get’s it’s input arguments. As briefly mentioned before, this task will receive the input in the ARGV variable in Ruby. However, when a rake task calls another rake task, this variable might not contain the correct information. Therefore we enable parameter passing into the task by defining the following header:

task :pageloads, [:params] => :environment do |t, args|

This way, IF this task was called from another task, args will contain a field called :params that contains the arguments that the parent task passed alogn to this task. A detailed example of that follows later on. This matters because we decide at runtime what input to pass to the argument validation. So, to pass the input for validation, we just call

parameters = RakeTaskArguments.parse_arguments(, "--sites= "\
"[(--start-date= --end-date=) | (--days-ago= [--days-ago-end=])]",
args[:params].nil? ? ARGV : args[:params])

This command passes the task_name (via, the argument specification and the input (either via ARGV or args[:params]) for validation to docopt. At this point, you are guaranteed that the parameters return value contains everything according to the schema you specified, or your code has already errored out at this point.

If you then want to access some of the variables, you can simply use

sites = parameters["sites"]
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])

This last line sets up a start-date and end-date based on some validation and/or defaults we specified in a method that is not covered in this article. The code is available on github though.

Rake task calling other rake tasks

Finally, we cover the case where a meta-task is actually invoking other tasks (in case you want to group certain computations). As mentioned above, this has an impact on how the arguments get passed into the task. Let’s consider the following meta-task:

desc "Run multiple aggregate computations for a given "\
list of sites and a given window of time"
task :multi, [:params] => :environment do |t, args|
# Parse the arguments, either from ARGV in case of
# direct invocation or from args[:params] in the case
# it was called from other rake_tasks
parameters = RakeTaskArguments.parse_arguments(,
"--sites= [--aggregates=] [ (--start-date= --end-date=) "\
"| (--days-ago= [--days-ago-end=]) ]",
args[:params].nil? ? ARGV : args[:params])
# Get the list of sites
sites = parameters["sites"]
# Just for demo purposes, you would normally fetch
# this elsewhere
available_aggregates = ["pageloads", "clicks"]
# Fetch the list of
aggregates = parameters["aggregates"].nil? ?
available_aggregates.join(",") :
parameters["aggregates"], available_aggregates)
# Validate and process the start and end date input
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])
# For each of the sites
sites.split(',').each do |site|
# For each of the tables
aggregates.split(',').each do |aggregate|
# Prepare an array with values to pass to the sub rake-tasks
parameters = []
parameters.push("--sites=#{site}") # just one single site
self.execute_rake("stats", aggregate, parameters)
end # End site loop

The template used in this task is very similar to a simple rake task. The main difference is that we added a list of aggregates you can specify on the input, which is validated against certain allowed values (again, out of the scope of this article). The :multi task then calls the appropriate tasks with the given parameters.

What’s new here is the way the meta-task calls the other tasks:

# Prepare an array with values to pass to the sub rake-tasks
parameters = []
parameters.push("--sites=#{site}") # just one single site
self.execute_rake("stats", aggregate, parameters)

Basically, we construct a list of arguments that emulates as if the input to the task was provided on the command line. We then call the other rake task using the following helper function:

# Helper method for invoking and re-enabling rake tasks
def self.execute_rake(namespace, task_name, parameters)
# Invoke the actual rake task with the given arguments
# Re-enable the rake task in case it is being for
# different parameters (e.g. in a loop)

As a rake task is generally intended to be run only once, invoking it again would have no effect. But as we launch the same tasks with different parameters, we can reenable the tasks for execution. This helper function shields us from these technicalities and we can just call the execute function with the namespace, the task name and the parameters. When Ruby calls .invoke(parameters) on a rake task, these parameters will end up in the args[:params] we discussed before.


So, that concludes or extensive article on how to add a lot of flexibility to arguments you provide to rake. In the end, we covered

  • How you can easily add docopt to a Rails project
  • How docopt argument specification strings look like and how they work
  • How you could write a wrapper class that encapsulates all that functionality
  • How you can plug this into simple rake tasks
  • How you can run meta rake tasks that call other tasks while keeping the flexbility for your input arguments

The full code and working examples of this article are available here on GitHub.

We hope this article helps you to get something like this set up for your own stacks as well, and that it increases your productivity. If you have any comments, questions or suggestions, feel free to let us know!


* The code for this article can be found on GitHub
* docopt.rb documentation
* rake documentation
* Brief mention of the double dash issue with Rake

How do you know if your model is going to work? Part 4: Cross-validation techniques

Concluding our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing.

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this concluding Part 4 of our four part mini-series “How do you know if your model is going to work?” we demonstrate cross-validation techniques.

Previously we worked on:

Cross-validation techniques

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.

For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.

Another variation called “leave one out” (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts about the fitting procedure and not about the actual model in hand (so they are answering a different question than test/train split).

There is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

Read more.


How do you know if your model is going to work? Part 3: Out of sample procedures

Continuing our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?” we develop out of sample procedures.

Previously we worked on:

Out of sample procedures

Let’s try working “out of sample” or with data not seen during training or construction of our model. The attraction of these procedures is they represent a principled attempt at simulating the arrival of new data in the future.

Hold-out tests

Hold out tests are a staple for data scientists. You reserve a fraction of your data (say 10%) for evaluation and don’t use that data in any way during model construction and calibration. There is the issue that the test data is often used to choose between models, but that should not cause a problem of too much data leakage in practice. However, there are procedures to systematically abuse easy access to test performance in contests such as Kaggle (see Blum, Hardt, “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”).

Notional train/test split (first 4 rows are training set, last 2 rows are the test set).

The results of a test/train split produce graphs like the following:



The training panels are the same as we have seen before. We have now added the upper test panels. These are where the models are evaluated on data not used during construction.

Notice on the test graphs random forest is the worst (for this data set, with this set of columns, and this set of random forest parameters) of the non-trivial machine learning algorithms on the test data. Since the test data is the best simulation of future data we have seen so far, we should not select random forest as our one true model in this case- but instead consider GAM logistic regression.

We have definitely learned something about how these models will perform on future data, but why should we settle for a mere point estimate. Let’s get some estimates of the likely distribution of future model behavior.

Read more.


How do you know if your model is going to work? Part 2: In-training set measures

Continuing our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.

Previously we worked on:

  • Part 1: Defining the scoring problem

In-training set measures

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

Run it once procedure

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.



What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).

There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

  1. The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.
  2. The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure).The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

Read more.


How do you know if your model is going to work? Part 1: The problem

This month we have a guest post series from our dear friend and advisor, John Mount, on building reliable predictive models. We are honored to share his hard won learnings with the world.

Authors: John Mount (more articles) and Nina Zumel (more articles) of Win-Vector LLC.

“Essentially, all models are wrong, but some are useful.”
George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Bartolomeu Velho 1568
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)


Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.

Read more.

Connecting Hive and Spark on AWS in five easy steps

Hive and Spark are great tools for big data storing, processing and mining. They are usually deployed individually in many organizations. While they are useful on their own the combination of them is even more powerful. Here is the missing HOWTO on connecting them both and turbo charging your big data adventures.

Step 1 : Set up Spark

If you have not setup Spark 1.X on AWS yet, see this blog post for a how to. OR you can download and compile a stand alone version directly from Spark main site. Don’t forget to compile Spark with -phive option. On a stand alone machine the following command would compile them together.

$ cd $SPARK_HOME; ./sbt/sbt -Phive assembly

Step 2 : You presumably have a Hive cluster setup using EMR, Cloudera or HortonWorks distributions. Copy hive-site.xml from your Hive cluster to your $SPARK_HOME/conf/ dir. Edit the XML file and add these properties

  <description>JDBC connect string for a JDBC metastore</description>

  <description>Driver class name for a JDBC metastore</description>

  <value> XXXXXXXX </value>
  <description>username to use against metastore database</description>

  <value> XXXXXXXX </value>
  <description>password to use against metastore database</description>

Step 3 : Download MySQL JDBC connector and add that to SPARK CLASSPATH, the easiest way would be to edit
bin/ script and add this line


Step 4 : Check AWS security groups to make sure Spark instances and Hive Cluster can talk to each other.

Step 5 : Run Spark Shell to check if you are able to see Hive databases and tables.

$ cd $SPARK_HOME; ./bin/spark-shell
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
scala> sqlContext.sql("show databases").collect().foreach(println);

Viola, you have connected two of the most powerful tools for data analysis and mining. Let us know your adventures on Spark and if you are interested in going on an adventure with us we are hiring.

Running Apache Spark on AWS

Apache Spark is being adopted at rapid pace by organization big and small to speed up and simplify big data mining and analytics architectures. First invented by researchers at AMPLab at UC-Berkeley, Spark codebase is being worked upon by hundreds of open source contributors and development is happening at break neck speed. Keeping up with the latest stable releases has not been easy for organization set up on AWS leveraging their vast infrastructure. In this post we wanted to share how we got it going. Hope you find setting up Spark 1.0 on AWS a breeze after this. Here it is in 5 simple steps.

Step 1) login to your favorite EC2 instance and install latest AWS CLI (Command Line Interface) using Pip if you don’t have it yet

$ sudo pip install awscli

Step 2) Configure AWS and setup secret key and access key of your AWS account

$ aws configure

Step 3) Latest AWS CLI comes with many options to configure and bring up Elastic Map Reduce (EMR) Clusters with Hive, Pig, Mahout, Cascading pre installed. Read this article by AWS team on setting up an older version of Spark 0.8.1 to get understanding of what’s involved. We will need to replace the bootstrap script and use latest Amazon AMI to be able to install Spark 1.0. You might also want to restrict cluster access to your VPC and add your SSH keys. Combining all that the install command would look like below. Feel free to read up the options of EMR CLI invocation and edit the command to fit your needs.

$ aws emr create-cluster --no-auto-terminate --ec2-attributes KeyName=key-name,SubnetId=subnet-XXXXXXX --bootstrap-actions Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name="Spark/Shark" --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --log-uri s3://bucket-name/log-directory-path --name "Spark 1.0 Test Cluster" --use-default-roles

Step 4) The setup takes anywhere from 10 to 20 minutes depending upon the installed packages. You will see the name of the cluster job flow id at the console output. You can login to the master node of the Spark cluster using following CLI command

$ aws emr ssh --cluster-id value --key-pair-file filename


$ ssh -i key-pair-file hadoop@master-hostname

Step 5) After you login you will see soft links in /home/hadoop directory to many of the packages you need. Try

$ ./hive


$ ./spark/bin/spark-shell

You are all set, to terminate the cluster use the following command, let us know in the comments if yo found this information helpful and share your experiences

$ aws emr terminate-clusters --cluster-ids

Setting up Hadoop 2.4 and Pig 0.12 on OSX locally

This is first of many blog posts to come from our dev bootcamp. Often times you want to test your scripts and run code locally before you hit the push button. We want to share  our findings that we think will be helpful to the wider world as we run around the web figuring out how to do things ourselves.

Setting up Hadoop, Hive and Pig can be a hassle on your macbook pro. Here are the steps that worked for us.

1) First install brew, the easiest and safest way to install and manage many kinds of packages

2) Next make sure java 1.7 or later is installed – For OSX 10.9+ you can download it from here

3) set JAVA_HOME, add in your bashrc for future use

$ export JAVA_HOME=$(/usr/libexec/java_home)

4) Install hadoop with brew, as of this writing it will download and install 2.4.1

$ brew install hadoop

5) To make hadoop work on a single node cluster you have to go through several steps outlined here, here are the steps in brief

a) setup ssh to connect to localhost without login

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/ >> ~/.ssh/authorized_keys

b) test being able to login, if you are not able to you have to turn on Remote Login in System Preferences -> Sharing

$ ssh localhost

c)  brew installs Hadoop usually in /usr/local/Cellar/hadoop/

$ cd /usr/local/Cellar/hadoop/2.4.1

d) edit following config files in directory /usr/local/Cellar/hadoop/2.4.1/libexec/etc/hadoop

$vi hdfs-site.xml


$vi core-site.xml


$vi mapred-site.xml


$vi yarn-site.xml


6) format and start HDFS and Yarn

$ cd /usr/local/Cellar/hadoop/2.4.1
$ ./bin/hdfs namenode -format
$ ./sbin/
$ ./bin/hdfs dfs -mkdir /user
$ ./bin/hdfs dfs -mkdir /user/<username>
$ ./sbin/

7) test examples code that came with the hadoop version

$ ./bin/hdfs dfs -put libexec/etc/hadoop input
$ ./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar grep input output 'dfs[a-z.]+'
$ ./bin/hdfs dfs -get output output
$ cat output/*

8) remove tmp files

$ ./bin/hdfs dfs -rmr /users//input
$ ./bin/hdfs dfs -rmr /users//ouput
$ rm -rf output/

9) stop HDFS and Yarn after you are done

$ ./sbin/
$ ./sbin/

10) Add HADOOP_HOME and CONFIG to bashrc for future use

$ export HADOOP_HOME=/usr/local/Cellar/hadoop/2.4.1
$ export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop

11) Install PIG but the current Formula in brew is not compatible with Hadoop 2.4.1 and you will see errors. You can use this one instead, ht akiatoji

$ brew install ant
$ brew install

12) Add PIG to your bashrc for future use

$ export PIG_HOME=/usr/local/Cellar/pig/0.12.0

Yay, you should be all set !! Enjoy hadooping and take a swig 🙂