• Petter Hultin Gustafsson

Kickstarter with MLOps

Classifying success of kickstarter projects using PySpark and TensorFlow.



Introduction


Backing inventors on Kickstarter has for me, in 99 % of the cases, lead to years of waiting for a product that just never shows up at my doorstep. So, let’s once and for all remedy this with an amazing kickstarter success classifier. We will use theMLOps platform to our ingest our datasource (you can find it here on Kaggle), then develop our preprocessing and model scripts locally using the SDK, to finally submit the whole package to our AWS account to get it versioned and ready for production. After all, this could be a real money maker.



Preprocessing with PySpark


Some might argue that using Spark for a 50 MB dataset might be a bit overkill. But I like consistency and improvement. After all, it works just as good for MB as TB.

Let’s start with defining our main function:


As usual, I generate the script template using the MLOps platform, reading the datasource from S3. The MLOps SDK gives me a nice wrapper for all the things we don’t want to think about — like reading, writing and version controlling the transformations, so that I can focus on what happens in my_transformations .


Next up, let’s define the PySpark classes I’m going to use:


Nothing fancy going on here, the normal Spark preprocessing classes as well as importing the MLOps SDK.

We are also going to need my favourite PySpark helper function that transform PySpark vectors into columns:


So once we are done with that, we can move on to the main function, that takes the raw input DataFrame, does all transformations and then returns it to main for disk writing. We start off by just cleaning up the DataFrame a bit, asserting that all values has the right types:


We also calculate the total “online” time of the campaign by subtracting the start and end time of the project, as well as filter out some small subclasses like currencies that are not USD (~10 % of the data) and states that are not success or fail (~6 % of the data).


Once we are done with that, we can move on to transforming all categorical string columns to integers:


Once we are done with that, we can move on to scaling all columns. This is an important step for deep learning, since we don’t want our weights to go haywire. I will use theStandardScaler, and once I’m done I will also use the OneHotEncoder to transform my project state column (success/fail) into a one hot encoding representation, since that’s the format TensorFlow desires:


One important thing to notice, which you will only learn by painful experience is the dropLast=False argument for the OneHotEncoder . I know they have a reason for defaulting this one to True , but I honestly think someone decided that while being drunk.


Finally, we can explode the one hot encoded vector into columns and return to main:


While I develop this script, I’m using the MLOpslocal testing environmentavailable inside the platform by doingpip install mlops-local, so that I can quickly iterate on a subset of the data until I’m happy with my transformations:


The final result, with calculated metrics I can then inspect inside the Datasets view:





Training with TensorFlow


As always, training is about the same setup as for preprocessing. As with preprocessing, I have a little bit of template generated code in the main function that handles reading, writing and sending data back to the console.


Besides that, I have amy_networkfunction where I define my architecture. In the console, I provide the script with my hyperparameters, which in this case arebatch_size , learning_rateandepochs that are then available under mlops.hyperparametersalong with the data matrices for training, validation and test. I will run this for 20 epochs, which might be a bit over the top for the amount of data:


As expected, we see a quick run up to almost 96 % accuracy and then a stagnation and the loss going crazy. Seriously overtrained. But then again, it got a solid 96 % on the test set, and this is just me playing around.


Putting into production


Since we are SO happy with our model, it’s time to make some money on it, exposing it to the world (not really, it’s hidden in a VPC that’s accessible from your cloud services). I will create a Live endpoint, meaning a hosted API that is up and running 24/7 scaling the instances horizontally based on the CPU utilisation of the inference machine.

I will also set a data sampling percentage of 10 %, meaning that for every tenth request the MLOps platform will save the input and output from the inference and run analysis on data drift, schema correctness etc. On this I can later set an alarm to notify my DevOps team if shit hits the fan:


So, that’s some end to end magic for you. Hope to see you soon in the Slack community where we discuss machine learning and operations from the aspects of data engineering, data science and DevOps, and how they all can come together!