Data Science, Machine Learning

Machine Learning Notebooks Collection

This post aims to share my Machine Learning notebooks. There are three types of Machine Learning for predicting structured tabular data: (1) supervised learning, (2) unsupervised learning, and (3) reinforcement learning. A supervised learning objective is to build a prediction model from a training dataset to predict an unseen test dataset. Supervised learning can solve regression tasks (for continuous output) and classification tasks (for categorical output). Unsupervised learning aims to learn the dataset patterns to simplify the information by clustering and dimensionality reduction. Cluster analysis groups observations into some clusters according to the similarity of their features. Dimensionality reduction reduces the number of dataset dimensions or features. Previously, I have written a post on basic Machine Learning here.

Fig. 1 Machine Learning illustration
Fig. 2 Machine Learning for tabular data

This post will start by demonstrating Supervised Learning for regression and classification. The following table shows the link to my notebook collection for each Machine Learning problem by building commonly used Machine Learning algorithms using Python. After understanding the problem statement and the provided dataset, each notebook continues with performing simple data pre-processing, such as feature generation, feature selection, or feature extraction, and feature scaling.

TasksScorerNotebook
RegressionRMSE, MAE, RRegression
Binary classificationAccuracy, F1-scoreBinary Classification
Binary classification (with probability)AUC, accuracy, F1-scoreBinary Classification
Multi-class classificationAccuracy, F1-scoreMulti-class Classification
Multi-label classificationcoming soon
AutoML for Regression RMSE, MAE, R Part 1, Part 2
AutoML for Classification Accuracy, F1-score Part 1, Part 2
Clusteringcoming soon
Dimensionality reductioncoming soon
Reinforcement learningcoming soon
Table 1 Machine Learning Notebooks collection

Regression

The regression notebook predicts house prices using the algorithms: (1) Linear Regression, (2) Ridge Regression, (3) Lasso Regression, (4) Elastic-net, (5) K Nearest Neighbors, (6) Support Vector Machine, (7) Decision Tree, (8) Random Forest, (9) Gradient Boosting Machine (GBM), (10) Light GBM, (11) Extreme Gradient Boosting (XGBoost), and (12) Neural Network (Deep Learning). Each algorithm is applied with hyperparameter-tuning with 5-fold cross-validation. The scorer metrics are Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and correlation coefficient between the predicted and true target values (R). The predicted and true values are visualized in a scatter plot.

Machine Learning is not only about making predictions. There can be other goals in performing it, like how each feature affects the output value. Algorithms that create equations, such as Linear Regression, Ridge Regression, and Lasso Regression, have coefficients or slopes for the features. Tree-based algorithms, like Decision Tree, GBM, and XGBoost, have feature importances.

Classification

There are three types of classification. Binary classification requires the models to predict the output into only 2 classes, for example, true or false. If the predicted observation is not true, then it is false. The scorer metrics are accuracy and F1-score (including precision and recall). The above notebook uses the algorithms: (1) Logistic Regression, (2) Naive Bayes, (3) K Nearest Neighbors, (4) Support Vector Machine, (5) Decision Tree, (6) Random Forest, (7) Gradient Boosting Machine, (8) Light GBM, (9) Extreme Gradient Boosting, and (10) Neural Network (Deep Learning) to predict whether every passenger survived from Titanic.

Binary classification also can predict the probability of the output. For instance, if the predicted probability of true is 0.8, it means that it has a 0.2 chance of being false. The scorer for probability prediction is Area Under the ROC Curve (AUC). The above notebook creates models to predict the probability of whether each location, date, and time was in high traffic or not. The result of binary classification can be shown in a confusion matrix. A confusion matrix is a matrix with 2 columns and 2 rows showing the count of true positive, true negative, false positive, and false negative.

Multi-class is similar to binary classification, except for it has more than two output classes to predict. The scorer metric is also accuracy. The probability for the multiple classes also can be predicted. It is called multi-label classification. Just like in regression models, classification models also can be analyzed further by examining the feature coefficients and feature importances.

Automated Machine Learning (AutoML) for Regression and Classification

AutoML is Machine Learning libraries which can do model creation and selection and hyperparameter-tuning automatically. Some of them can even automatically perform pre-processing, model creation and selection, hyperparameter-tuning, ensemble methods, and summary visualization. They sure save lots of code lines. My notebooks collection provide examples of (1) Auto-Sklearn, (2) Tree-based Pipeline Optimization Tool (TPOT), (3) Hyperopt, (4) AutoKeras, (5) MLJAR, (6) AutoGluon, (7) H2O, (8) PyCaret, and (9) Variant Interpretable Machine Learning (AutoViML).

This post will be updated every time a new relevant notebook is created.

Leave a comment