We worked with a tech that hosted digitized fliers and coupons for retailers. How did the company make money? Essentially, off clicks on the coupons.
We set out to build a model that predicted customer lifetime value. And, we productionized it and had it automatically run end-to-end on the servers to predict lifetime value for all the users.
I want to use that experience to highlight how different the deployment of an actual model is from how machine learning is taught in school.
Model objective and use case
We wanted to see what the customer’s value is at the end of the year. The use case was to use predicted customer lifetime value to tailor marketing strategies for users and also to cap acquisition spend per given acquired user cohort.
Tech Stack
The model was created on Spark and we used Scala MLlib to for model and transformer classes. Don’t worry too much about MLlib if you aren’t familiar with it — it’s very similar to scikit learn, except it’s made to work on distributed dataset rather than local pandas dataframes.
How we set up the data
The data we had was users and some engagement metrics that these users accumulated over time. We could use those engagement metrics to calculate revenue accumulated by any given user on any given day. I won’t go too deep into the structure of the data, because the model was done for an actual client and I want to ensure nothing too specific is disclosed. But with their permission, here is an abstraction of how the data looked.
As you remember, this is a 1 year lifetime value calculation. So, to build a model that predicts revenue one year away from a given date, we need to make sure we have at least 365 days of data for the users.
The creation of the base tables from the raw unstructured data required quite a bit of aggregation. Also, scalability became quite important because aggregation of huge amounts of unstructured data can be quite costly.
Also, a note about building features. We built many, many features. And then, we would use feature selection, dimensionality reduction and feature importance criteria to cut the number of features down.
The next step was splitting data into train and test data.
Setting up train and test
The dataset was then split into train and test subsets. The train dataset was used to train the model and test dataset was used to track how the model is performing.
In production, how did we use the test dataset? We used it firstly to select the best model. Also, we use it to track error of the trained models over time. So, we would train a model on day one. And then we would want to test the model on day 1, 2, 3, etc. And, then we want to see by which day the error begins to increase and that determines the time interval at which we want to retrain the model.
Models
We did some exploratory data analysis and dove into the models. We had a bunch of predictive variables like a user’s revenue on day 1, etc. And the variable we wanted to predict was the user’s revenue on day 365.
This was clearly a regression problem, so we threw a couple of regression algorithms at it and ones that worked best given fairly high dimensionality of data were regression trees and ensembles of regression trees.
We wanted to try different algorithms. We wanted to try algorithms with and without PCA to reduce dimensionality. We wanted to try using different sets of columns features. The list went on and on. We wanted to try many different permutations. How did we do it? We set up an interface function that allowed us to try many different parameters for each model iteration without rewriting the whole script.
Error metrics and monitoring error
The error function we picked was RMSE. So, every model iteration was aimed at minimizing the RMSE cost function.
The pipeline in terms of structure:
So, let’s think about it, how does the model work in production?
First, we create our test and train datasets. These run daily, and we collect the error metrics on the test datasets. This lets us know how much time it takes roughly for model performance to degrade significantly. Then if model performance degrades after x days, we retrain models separately on the latest data.
Then, the retrained models are saved to S3. From S3, the models, on a daily basis, are used to score the users. So, on any given day we pull all the users we have. Every user is going to be a different age. Each user, and the metrics they have accumulated by a certain age, are going to be fed into the machine learning models. Once the model finishes predicting for all the users, for each user we generate a prediction. So, the final table is provided below.
Conclusion
I had barely touched on the complex logistics of how we deployed machine learning models. So, what I wanted to do in this article is to get away from the paradigm of how machine learning is taught in most courses and how it is actually deployed. Firstly, there is no clean dataset for you to do machine learning on. Realistically, most of the work you will do will be building features and the base tables for machine learning. Also, when you fit actual machine learning algorithms on data, you will want to fit many permutations of parameters. So, build your functions in a way that allows you to try different combinations.
Finally, it’s good to think about how everything will tie together. When will the new base tables be generated and when will machine learning models run? How will you use error metrics from the trained models to inform how often model gets retrained? And, how will you score your users? All of these nuances are the things that make machine learning so complex and fascinating.