By Moez Ali, Founder & Author of PyCaret
PyCaret is an open-source, low-code machine learning library and end-to-end model management tool built-in Python for automating machine learning workflows. It is known for its ease of use, simplicity, and ability to quickly and efficiently build and deploy end-to-end ML prototypes.
PyCaret is an alternate low-code library that can replace hundreds of code lines with few lines only. This makes the experiment cycle exponentially fast and efficient.
PyCaret — An open-source, low-code machine learning library in Python
To learn more about PyCaret, you can check out their GitHub.
MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:
MLflow is an open-source platform to manage ML lifecycle
To learn more about MLflow, you can check out GitHub.
Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using a virtual environment to avoid potential conflicts with other libraries.
PyCaret’s default installation is a slim version of pycaret that only installs hard dependencies listed here.
# install slim version (default) pip install pycaret# install the full version pip install pycaret[full]
When you install the full version of pycaret, all the optional dependencies as listed here are also installed. MLflow is part of PyCaret’s dependency and hence does not need to be installed separately.
👉 Let’s get started
Before I talk about MLOps, let’s talk a little bit about the machine learning lifecycle at a high level:
Machine Learning Life Cycle — Image by Author (Read from left-to-right)
- Business Problem — This is the first step of the machine learning workflow. It may take from few days to a few weeks to complete, depending on the use case and complexity of the problem. It is at this stage, data scientists meet with subject matter experts (SME’s) to gain an understanding of the problem, interview key stakeholders, collect information, and set the overall expectations of the project.
- Data Sourcing & ETL — Once the problem understanding is achieved, it then comes to using the information gained during interviews to source the data from the enterprise database.
- Exploratory Data Analysis (EDA) — Modeling hasn’t started yet. EDA is where you analyze the raw data. Your goal is to explore the data and assess the quality of the data, missing values, feature distribution, correlation, etc.
- Data Preparation — Now it’s time to prepare the data model training. This includes things like dividing data into a train and test set, imputing missing values, one-hot-encoding, target encoding, feature engineering, feature selection, etc.
- Model Training & Selection — This is the step everyone is excited about. This involves training a bunch of models, tuning hyperparameters, model ensembling, evaluating performance metrics, model analysis such as AUC, Confusion Matrix, Residuals, etc, and finally selecting one best model to be deployed in production for business use.
- Deployment & Monitoring — This is the final step which is mostly about MLOps. This includes things like packaging your final model, creating a docker image, writing the scoring script, and then making it all work together, and finally publish it as an API that can be used to obtain predictions on the new data coming through the pipeline.
The old way of doing all this is pretty cumbersome, long, and requires a lot of technical know-how and I possibly cannot cover it in one tutorial. However, in this tutorial, I will use PyCaret to demonstrate how easy it has become for a data scientist to do all this very efficiently. Before we get to the practical part, let’s talk a little bit more about MLOps.
👉 What is MLOps?
MLOps is an engineering discipline that aims to combine machine learning development i.e. experimentation (model training, hyperparameter tuning, model ensembling, model selection, etc.), normally performed by Data Scientist with ML engineering and operations in order to standardize and streamline the continuous delivery of machine learning models in production.
If you are an absolute beginner, chances are you have no idea what I am talking about. No worries. Let me give you a simple, non-technical definition:
MLOps are bunch of technical engineering and operational tasks that allows your machine learning model to be used by other users and applications accross the organization. Basically, it’s a way through which your work i.e. machine learning models are published online, so other people can use them and satisfy some business objectives.
This is a very toned-down definition of MLOps. In reality, it involved a little more work and benefits than this but it’s a good start for you if you are new to all this.
Now let’s follow the same workflow as shown in the diagram above to do a practical demo, make sure you have pycaret installed.
👉 Business Problem
For this tutorial, I will be using a very popular case study by Darden School of Business, published in Harvard Business. The case is regarding the story of two people who are going to be married in the future. The guy named Greg wanted to buy a ring to propose to a girl named Sarah. The problem is to find the ring Sarah will like, but after a suggestion from his close friend, Greg decides to buy a diamond stone instead so that Sarah can decide her choice. Greg then collects data of 6000 diamonds with their price and attributes like cut, color, shape, etc.
In this tutorial, I will be using a dataset from a very popular case study by the Darden School of Business, published in Harvard Business. The goal of this tutorial is to predict the diamond price based on its attributes like carat weight, cut, color, etc. You can download the dataset from PyCaret’s repository.
# load the dataset from pycaret from pycaret.datasets import get_data data = get_data('diamond')
Sample rows from data
👉 Exploratory Data Analysis
Let’s do some quick visualization to assess the relationship of independent features (weight, cut, color, clarity, etc.) with the target variable i.e.
# plot scatter carat_weight and Price import plotly.express as px fig = px.scatter(x=data['Carat Weight'], y=data['Price'], facet_col = data['Cut'], opacity = 0.25, template = 'plotly_dark', trendline='ols', trendline_color_override = 'red', title = 'SARAH GETS A DIAMOND - A CASE STUDY') fig.show()
Sarah gets a diamond case study
Let’s check the distribution of the target variable.
# plot histogram fig = px.histogram(data, x=["Price"], template = 'plotly_dark', title = 'Histogram of Price') fig.show()
Notice that distribution of
Price is right-skewed, we can quickly check to see if log transformation can make
Price approximately normal to give fighting chance to algorithms that assume normality.
import numpy as np# create a copy of data data_copy = data.copy()# create a new feature Log_Price data_copy['Log_Price'] = np.log(data['Price'])# plot histogram fig = px.histogram(data_copy, x=["Log_Price"], title = 'Histgram of Log Price', template = 'plotly_dark') fig.show()
This confirms our hypothesis. The transformation will help us to get away with skewness and make the target variable approximately normal. Based on this, we will transform the
Price variable before training our models.
👉 Data Preparation
Common to all modules in PyCaret, the
setup is the first and the only mandatory step in any machine learning experiment using PyCaret. This function takes care of all the data preparation required prior to training models. Besides performing some basic default processing tasks, PyCaret also offers a wide array of pre-processing features. To learn more about all the preprocessing functionalities in PyCaret, you can see this link.
# initialize setup from pycaret.regression import * s = setup(data, target = 'Price', transform_target = True, log_experiment = True, experiment_name = 'diamond')
setup function in pycaret.regression module
When you initialize the
setup function in PyCaret, it profiles the dataset and infers the data types for all input features. If all data types are correctly inferred, you can press enter to continue.
- I have passed
log_experiment = Trueand
experiment_name = 'diamond', this will tell PyCaret to automatically log all the metrics, hyperparameters, and model artifacts behind the scene as you progress through the modeling phase. This is possible due to integration with MLflow.
- Also, I have used
transform_target = Trueinside the
setup. PyCaret will transform the
Pricevariable behind the scene using box-cox transformation. It affects the distribution of data in a similar way as log transformation (technically different). If you would like to learn more about box-cox transformations, you can refer to this link.
Output from setup — truncated for display
👉 Model Training & Selection
Now that data is ready for modeling, let’s start the training process by using
compare_models function. It will train all the algorithms available in the model library and evaluates multiple performance metrics using k-fold cross-validation.
# compare all models best = compare_models()
Output from compare_models
# check the residuals of trained model plot_model(best, plot = 'residuals_interactive')
Residuals and QQ-Plot of the best model
# check feature importance plot_model(best, plot = 'feature')
Finalize and Save Pipeline
Let’s now finalize the best model i.e. train the best model on the entire dataset including the test set and then save the pipeline as a pickle file.
# finalize the model final_best = finalize_model(best)# save model to disk save_model(final_best, 'diamond-pipeline')
save_model function will save the entire pipeline (including the model) as a pickle file on your local disk. By default, it will save the file in the same folder as your Notebook or script is in but you can pass the complete path as well if you would like:
Remember we passed
log_experiment = True in the setup function along with
experiment_name = 'diamond' . Let’s see the magic PyCaret has done with the help of MLflow behind the scene. To see the magic let’s initiate the MLflow server:
# within notebook (notice ! sign infront) !mlflow ui# on command line in the same folder mlflow ui
Now open your browser and type “localhost:5000”. It will open a UI like this:
Each entry in the table above represents a training run resulting in a trained Pipeline and a bunch of metadata such as DateTime of a run, performance metrics, model hyperparameters, tags, etc. Let’s click on one of the models:
Part I — CatBoost Regressor
Part II — CatBoost Regressor (continued)
Part II — CatBoost Regressor (continued)
Notice that you have an address path for the
logged_model. This is the trained Pipeline with Catboost Regressor. You can read this Pipeline using the
# load model from pycaret.regression import load_model pipeline = load_model('C:/Users/moezs/mlruns/1/b8c10d259b294b28a3e233a9d2c209c0/artifacts/model/model')# print pipeline print(pipeline)
Output from print(pipeline)
Let’s now use this Pipeline to generate predictions on the new data
# create a copy of data and drop Price data2 = data.copy() data2.drop('Price', axis=1, inplace=True)# generate predictions from pycaret.regression import predict_model predictions = predict_model(pipeline, data=data2) predictions.head()
Predictions generated from Pipeline
Woohoo! We now have inference from our trained Pipeline. Congrats, if this is your first one. Notice that all the transformations such as target transformation, one-hot-encoding, missing value imputation, etc. happened behind the scene automatically. You get a data frame with prediction in actual scale, and this is what you care about.
What I have shown today is one out of many ways you can serve trained Pipelines from PyCaret in production with the help of MLflow. In the next tutorial, I plan to show how you can using MLflow native serving functionalities to register your models, version them and serve as an API.
There is no limit to what you can achieve using this lightweight workflow automation library in Python. If you find this useful, please do not forget to give us ⭐️ on our GitHub repository.
Join us on our slack channel. Invite link here.
You may also be interested in:
Build your own AutoML in Power BI using PyCaret 2.0
Deploy Machine Learning Pipeline on Azure using Docker
Deploy Machine Learning Pipeline on Google Kubernetes Engine
Deploy Machine Learning Pipeline on AWS Fargate
Build and deploy your first machine learning web app
Deploy PyCaret and Streamlit app using AWS Fargate serverless
Build and deploy machine learning web app using PyCaret and Streamlit
Deploy Machine Learning App built using Streamlit and PyCaret on GKE
Want to learn about a specific module?
Click on the links below to see the documentation and working examples.
Bio: Moez Ali is a Data Scientist, and is Founder & Author of PyCaret.
Original. Reposted with permission.