Big data is an exciting new field. Entrepreneurs and business people keep talking about it, even though half of them have a very vague idea of what the “big data” actually is. That’s okay though. I am not going to focus on them. If you are a Ruby engineer willing to start playing with machine learning in your favorite language this article is for you.
Apache Mahout is an open source machine learning library written in Java that allows engineers to work with ridiculous amounts of data to build recommendation engines, classifiers and cluster analysis tools. Once scaled with a Hadoop cluster, it becomes even more superior, being able to deal with billions of data points in a blink.
I always wondered why there is no good machine learning library for Ruby. Heck, there is not even one gem for Mahout out there either! So, I decided to create one and name it JRuby Mahout. Why JRuby? The answer is simple: it uses JVM and allows for easy integration with Java libraries. Basically, by switching to JRuby, you can use all of your regular Ruby goodness plus implement interfaces from Java libraries (and get shafted by deployment, hehe).
This is the first article in the series, where I will talk about the basics of JRuby Mahout and explain how to use it to generate real recommendations and more. We’ll essentially have a very simple recommendation engine written at the end of this article. What’s a recommendation engine? It’s a piece of software that provides recommendations based on your previous inputs. Look at Netflix or Amazon: they always try to sell you relevant stuff that you are more likely to buy. They collect inputs (purchases or views) from you and millions of other customers, look at what certain customers have in common and recommend based on the overlap of customers’ interests. Sounds complicated? Not so complicated when there are good libraries out there. Let’s get started!
Mahout and JRuby
Apache Mahout is a library for machine learning that effectively deals with recommendations, clustering, classification, pattern mining, regression and other related things. Mahout is the core of the JRuby Mahout gem that I am going to describe in this article.
Installing Mahout for basic purposes is easy: simply download Mahout 0.7 from one of the official mirrors, unzip the file and setup the environmental variable
MAHOUT_DIR that points at your Mahout installation. I just added the following line to my
Mahout is a Java library, so you’ll have to install JVM and JRuby on your machine. For JRuby installation instructions check out the official website. I prefer to use rbenv for switching my Ruby version. Since most of my projects work on the MRI version of Ruby, I have it set as the global default. For projects that require JRuby (like this one), I simply do the following in the project directory:
Clean and simple. After you got JRuby and Mahout installed, it’s time to setup the project. Create a Gemfile in your project directory and hook up
bundle. That’s it. We are finally ready to dive into the world of recommendations!
Mahout recommender can provide all kinds of recommendations based on three basic notions of the user, item and preference. Before I move on with recommendations, let’s talk about the actual data that is used for mining recommendations. In its simplest form (the one we are talking about in this article), data is a collection of user IDs, item IDs and preference values. This collection can get enormous if you have a lot of activity going on in your system. There could be millions or billions of records. Once you get passed 1m records, it becomes difficult to give recommendations in real time, unless you pre-cache them or setup a cluster of computers that is capable of distributing computations.
IDs in Mahout are always integers that point at real users or items in your system. Preference values can be any integer, as long as a larger number represents stronger preference. Many things can be interpreted as preferences. For example, a page view can be represented as a preference with value “1”; thumbs up/down mechanism could be represented with preferences “0” for down and “1” for up; and an explicit five star rating system can have preference values of “1”, “2”, “3”, “4” and “5”. A lot of data that you are going to collect will be redundant or bad. For example, you might not be interested in movie ratings from a user who only rated two movies during their time on your site. The reason is that they won’t have much meaningful overlap with other users’ ratings: giving five stars to “Sex and the City” and “Die Hard”, for example, won’t be sufficient to make a conclusion that these two movies are similar and that you should recommend “Die Hard” to anyone who watched “Sex and the City”. Cleaning up data is beyond the scope of this article. We will talk about it in later parts of the series.
Now, after you have basic understanding of how data for recommenders is structured, let’s setup our first recommender. With JRuby Mahout it’s super easy:
What is going on here? The recommender class takes four arguments: similarity metric, user neighborhood, recommender algorithm and whether similarity is weighted or not. Let’s see what each parameter is responsible for.
Similarity metric defines similarities between users and rated items. Mahout supports different mathematical models for measures of similarities that can be split into two major tags: user similarities (they define similarity between two users) and item similarities (they define similarity between two items). There is no one unique similarity metric that will work for all dataset. So, you’ll have to experiment on your own with the data that you have. JRuby Mahout supports all major Mahout similarity metrics that are not experimental and that proved to work in production environments:
TanimotoCoefficientSimilarity. You can check the details for all implementation of user-based and item-based similarities in Mahout docs.
User neighborhood defines the neighborhood of similar users that can be used when computing recommendations. There are two types of user neighborhoods supported by Mahout: nearest N user neighborhood (
NearestNUserNeighborhood) and threshold-based neighborhood (
ThresholdUserNeighborhood). The former defines a constant number of the nearest most similar users users that are used to derive recommendations based on these similarities. The latter defines a “radius” within which all similar users are included. In a nutshell, the difference between the two methods is how users are included in measuring the similarities: the nearest N neighborhood includes N of the closest users and the threshold-based neighborhood includes all users within a certain radius. In JRuby Mahout you don’t need to explicitly say which neighborhood method you want to use. It automatically selects it based on the parameter. Integers greater than
1 will activate the nearest N neighborhood and floats between
1.0 will activate the threshold user neighborhood.
Recommender algorithm is a mechanism for generating recommendations. There are three basic recommender algorithms that are supported by JRuby Mahout. The first one is the
GenericUserBasedRecommender that checks for similarities between users’ preferences and finds users that have similar preferences.
GenericItemBasedRecommender works backwards: it finds items that are similar to other items and returns them as recommendations. Keep in mind that
SpearmanCorrelationSimilarity can only be used with
GenericUserBasedRecommender because it doesn’t implement
ItemSimilarity, which is required for
GenericItemBasedRecommender. Another thing that is important to mention is that the user-based recommender gets slower with more users and item-based recommender gets slower with more items in the dataset. If the system has a more or less fixed number of items (e.g. store inventory), recommendations generated by the item-based recommender can be cached for a long time, improving the responsiveness of your system. The same applies to the user-based recommender.
The third recommender that is supported by JRuby Mahout is
SlopeOneRecommender that doesn’t require the similarity metric or the user neighborhood. This algorithm provides really accurate results with relatively small datasets quickly. Once the dataset starts to grow in size, you run a risk of running out of memory. Use
SlopeOneRecommender on one machine when you have less than five million records in your dataset.
Similarity weighting is an extra parameter that can be provided for
EuclideanDistanceSimilarity similarities. It helps prevent situations when correlation between two users is undefined due to the structure of the data and some similarity metric intricacies. Depending on your dataset, it can result in better recommendations.
Data model is the structure that contains your data in the format most suitable for Mahout. There are two general types of data models: in-memory and distributed. In this article we are only going to talk about in-memory models that work pretty well for datasets of up to several million entries. Beyond that you are running a risk of running out of memory. For large datasets distributed computations that use Hadoop have to be implemented.
JRuby Mahout’s in-memory data model supports two data sources: CSV file and Postgres database. CSV file is literally a CSV file that you can load your data from. After initializing a recommender specify the data model:
Your data will have to be in the following format:
CSV file is obviously the easiest way to load data, especially if it doesn’t change very often. The other way is to use a real database. JRuby Mahout currently supports only Postgres. I plan to include MySQL support as well soon. When creating a table in Postgres make sure to include three required fields:
item_id primary keys and create indexes on them. Typical SQL statements for a new JRuby Mahout table will look like this:
In order to use JRuby Mahout Postgres-related goodness, you will have to install the JDBC 4 driver for Postgres 9.0 and up. JRuby Mahout uses the
org.postgresql.ds.PGPoolingDataSource interface for certain things during its computations. To setup Postgres data model for your recommender do the following:
JRuby Mahout includes a simple manager for Postgres that uses JDBC. You can create/delete tables and records with it. If you want to use it in your project, initialize the manager:
And execute these self-explanatory methods:
In some cases you might want to implement your own JDBC adapter in JRuby to support more functionality. Unfortunately, you can’t use ActiveRecord for JRuby Mahout functionality because Mahout requires the JDBC data source object that can’t be accessed through default ActiveRecord methods.
You got recommender and data model setup. It’s time to generate some recommendations and evaluate your recommender!
Recommendations and Evaluations
We are finally at the fun part where you tell the recommender to generate some recommendations for you. With JRuby Mahout it’s super simple:
This will generate an array of
10 recommendations and estimated ratings for user with ID
2. The last argument is the
rescorer that can help you rewrite certain rules for recommendations. For example, you might want to only include movies of a certain genre in your recommendations if the user is browsing the movies of this genre. I am going to cover the
rescorer in the next part of the series.
After running the recommender you might wonder: are these recommendations any good? That’s a very important question to answer. JRuby Mahout provides an easy mechanism to answer it. The recommender has a method called
evaluate that takes two parameters: training percentage and evaluation percentage. The former represents which part of your dataset should be used to “train” the recommender. The latter is used to evaluate the recommender. Mahout, basically, tries to guess how users would rate individual items in the evaluation part of the dataset and then gives the average difference between real and guessed preferences. The lower the difference is—the better.
0.0 is the perfect result, meaning that the recommender got all recommendations right. This pretty much never happens in reality.
1.0 or less for a five star rating system would be a decent result.
To run recommender evaluator do the following:
To improve the quality of your recommendations you will have to experiment with your similarity metric, user neighborhood and the type of recommender. The final, most successful combination, will solely depend on your dataset.
After you made some recommendations and evaluated the recommender, it’s time to do some other cool things. For example, user-based recommender can help you find similar users in your data set:
This will return
5 users that are similar to the user with ID
1. The last argument is the
rescorer. This could be helpful if you want your system to suggest whom a particular user should follow or pay attention to.
Item-based recommender can show similar items the same way user-based recommender shows similar users:
This will return
10 items that are similar to the item with ID
Item-based recommender can list the items that were most influential in recommending a given item to a given user:
The first parameter is the user ID, the second is the item ID and the third one is the amount of influential items.
Finally, both user and item-based recommenders can estimate preferences between the user and the item:
The first argument is the user ID and the second one is the item ID. This method returns a float with an estimated rating.
Thank you for reading this article, I hope you enjoyed it! At this point you should be all ready to start experimenting with recommendations and perhaps adding them to your projects. Stay tuned for the following parts in this series by following me on Twitter, GitHub or this blog.