Machine Learning – Data Science and GIS

My Work Samples Portfolio

21st Sep 20231st May 2024 RendyK

This page is meant to store work samples to use as cheat sheets. Thus, the following collection of notebooks are simple and can be modified following required use cases.

Overall, the work samples cover my skill sets, including:

delivering meaningful data-driven insights to support business goals,
automating data processing (with python),
data analysis (tabular, time series, text/NLP, and image),
descriptive and inferential statistical analysis,
GIS or spatial data analysis
data visualization and dashboard development,
Machine Learning modeling (regression, classification, clustering, dimensionality reduction, time series forecasting, recommender engine)
Deep Learning or Artificial Intelligence (regression and classification with MLP, image classification with CNN, time series forecasting with LSTM, text classification with LSTM)
web application development,
developing APIs,
Large Language Model (LLM),
Diffusion (Image Generation) etc.

Highlights:

Task Group	Tasks	Description	Notebook/Repo
Large Language Model (LLM)	Retrieval-Augmented Generation (RAG)	Develop RAG to enhance LLMs with custom documents. Streamlit chatbot as the UI	article, repository
Deep Learning	Image classification with CNN, Multi-label classification	Developed image classification model using CNN to recognize buildings, forest, glacier, mountain, sea, and street images.	CNN
Deep Learning	Time series forecasting with vanilla LSTM, stacked LSTM, bidirectional LSTM, CNN LSTM, and Conv LSTM	Forecast carbon monoxide emission using LSTM and did time series analysis.	Time series
Deep Learning	Text Classification with Dense, LSTM, Bi-LSTM, GRU, CNN, CNN + GRU	Developed text classification model to distinguish tweets into 4 emotions: joy, sadness, anger, and fear.	Text classification
Supervised Learning	Supervised Learning for Remote Sensing	Predicted the spatial distribution of land cover using Remote Sensing/satellite data. Published the result on a web app.	ML + Remote sensing, web app
NLP	NLP and Sentiment Analysis	Performed NLP analysis and text regression for sentiment analysis.	article, part 1, part 2

Table 1 Favorite notebooks

Others:

Task Group	Tasks	Description	Notebook/Repo
Supervised Learning	Regression	Predicted house prices with various regression algorithms.	Regression
Supervised Learning	Binary classification	Predicted survival rate in titanic using various classification algorithms.	Binary Classification
Supervised Learning	Binary classification (with probability)	Predicted high traffic probability using the metrics of AUC, accuracy, and F1-score.	Binary Classification
Supervised Learning	Multi-class classification	Predicted household poverty as a multi-class classification problem.	Multi-class Classification
Supervised Learning	Imbalanced classification	Predicted whether an employee was a best performer as an imbalanced classification task.	Imbalanced
Supervised Learning	Bayesian Optimization: bayes_opt or fmin	Comparing the libraries bayes_opt and fmin to perform Bayesian optimization for hyperparameter-tuning.	Bayesian Optimization
Supervised Learning	Supervised Learning for Remote Sensing	Predicted the spatial distribution of land cover using Remote Sensing/satellite data. Published the result on a web app.	ML + Remote sensing, web app
AutoML	AutoML for Regression	Predicted house prices with various autoML regression algorithms.	Part 1, Part 2
AutoML	AutoML for Classification	Predicted household poverty classes using autoML classification algorithms.	Part 1, Part 2
Unsupervised Learning	Clustering	Clustered customer segmentation using k-means and hierarchical clustering.	k-means, hierarchical clustering
Unsupervised Learning	Geo-spatial clustering and point pattern analysis	Spatial pattern analysis (point/polygon pattern analysis, Spatially Constrained Hierarchical Clustering, etc. ) of e-commerce customers in Brazil.	Geo-spatial clustering
Unsupervised Learning	Dimensionality reduction: PCA with Sagemaker (upcoming)	Performed PCA on environmental variables dataset.	PCA
Unsupervised Learning	Anomaly detection: Random Cut Forest with Sagemaker	Performed anomaly detection on daily climate dataset and deployed the model using sagemaker.	Random Cut Forest
Time series forecasting	Time series forecasting with SARIMAX	Forecast the cash of ATMs across the time.	“not yet published”
Deep Learning	Image classification with CNN, Multi-label classification	Developed image classification model using CNN to recognize buildings, forest, glacier, mountain, sea, and street images.	CNN
Deep Learning	Time series forecasting with LSTM	Forecast carbon monoxide emission using LSTM and did time series analysis.	Time series
Deep Learning	Text Classification with Dense, LSTM, Bi-LSTM, GRU, CNN, CNN + GRU	Developed text classification model to distinguish tweets into 4 emotions: joy, sadness, anger, and fear.	Text classification
NLP	NLP and Sentiment Analysis	Performed NLP analysis and text regression for sentiment analysis.	article, part 1, part 2
Inferential Statistics	Inferential Statistics, hypothesis testing, etc.		“not yet published”
Dashboard	Shiny Dashboard	Visualized daily covid cases in dashboard.	Shiny Dashboard
Dashboard	Tableau Dashboard	Visualized spatiotemporal analysis of house prices	“not yet published” (upcoming)
Web application	Streamlit	Streamlit as the chatbot interface for an RAG or LLM application	https://github.com/rendy-k/LLM-RAG
API	FAST API		“not yet published”
Sagemaker	Sagemaker: classification	Developed and deployed loan default probability classification using AWS sagemaker.	classification
Sagemaker	Sagemaker: invoke model	Developed the API to invoked deployed Machine Learning model.	invoke model
Sagemaker	Multi-model deployment with Sagemaker	Deployed multi-model on AWS instance.	multi-model deployment
Sagemaker	Recommender system	Built and deployed a recommender system to recommend anime titles using Factorization Machine of AWS.	recommender system
Sagemaker	Time series forecasting	Built and deployed DeepAR to forecast the time series of New Delhi daily weather.	Deep AR
Large Language Model (LLM)		Develop RAG to enhance LLMs with custom documents. Streamlit chatbot as the UI	article, repository

Table 2 Notebook collection

In the “Notebook/Repo” column, the URLs will direct to where the notebooks or repositories are stored. Some of them do not have the URLs, but “not yet published”. This means that the notebooks are available in local computer for professional work. They are not yet modified and published.

Data Science, Geographic Information System (GIS), Machine Learning, Remote Sensing

Portfolio

8th Jun 20218th Jun 2021 RendyK

This post is made to display some of my projects. I have completed a number of projects. Some of them are selected to show here.

Continue reading →

Data Science, Machine Learning

Machine Learning Notebooks Collection

1st Jun 202112th Jun 2021 RendyK

This post aims to share my Machine Learning notebooks. There are three types of Machine Learning for predicting structured tabular data: (1) supervised learning, (2) unsupervised learning, and (3) reinforcement learning. A supervised learning objective is to build a prediction model from a training dataset to predict an unseen test dataset. Supervised learning can solve regression tasks (for continuous output) and classification tasks (for categorical output). Unsupervised learning aims to learn the dataset patterns to simplify the information by clustering and dimensionality reduction. Cluster analysis groups observations into some clusters according to the similarity of their features. Dimensionality reduction reduces the number of dataset dimensions or features. Previously, I have written a post on basic Machine Learning here.

Continue reading →

Data Science, Machine Learning

Hierarchical Clustering

13th Oct 2020 RendyK

Hierarchical clustering is an unsupervised machine learning that identify closest cluster and group them together. Basic of Machine Learning article can be found here. Hierarchical clustering works with only 2 steps repeatedly. Firstly, detect 2 or more closest points or clusters. Secondly, group them together. The next steps are the iteration of the first two steps until all of the data points are grouped in clusters. The illustration below describes how hierarchical clustering groups data points and build dendrogram at the same time.

Continue reading →

Data Science, Machine Learning

K-means

13th Oct 2020 RendyK

k-means is a unsupervised Machine Learning to perform cluster analysis. More discussion about Machine Learning can be found here. Cluster analysis is meant to divide a group of data points into clusters according to their similarity pattern. k-means locates a number of k centroids among the data points to which each data point is clustered to the nearest centroid. The location of the k or centroid represents the center of their clusters. The distance of centroid to each data point in the cluster is computed to be as small as possible. The number of k is the number of centroids or clusters that the data points will be distinguished into.

Continue reading →

Data Science, Machine Learning

Linear Regression (Supervised Machine Learning)

13th Oct 202018th Oct 2020 RendyK

Linear regression in a method in Machine Learning. The same term is also used in Statistics. To read about Machine Learning basic, please find my article here. Linear regression finds relationship between one or more continuous predictor variables and the dependent variable to predict. Simple linear regression has only one predictor or independent variable to predict the dependent variable. Plot the variables and draw a fit line with its distance to data points as small as possible. The distance of the fit line to each data point represents the prediction error.

Below is the data of 20 apples with their mass (gram) and volume (cm3). Now, we want to create a model or formula to estimate the volume of apple according to its mass using linear regression.

Continue reading →

Data Science, Machine Learning

Naïve Bayes

13th Oct 202013th Oct 2020 RendyK

Naïve Bayes is a supervised learning to do classification based on categorical parameters using Bayes Theorem. If you want to read about basic Machine Learning, please refer to here, and come back to this article again later. Still remember how to calculate probability from high school lesson? For instance the probability of a die to show 5 is 1/6. It can finally be useful in machine learning.

It is called “naïve” because assumes mutual independence among the predictors. For example, monkey is identified to have 2 arms, be brown color, and be good at jumping. All there characteristic contribute independently to identify that an animal is a monkey despite they actually depend on one another.

Continue reading →

Data Science, Machine Learning

Decision Tree/Classification and Regression Tree and Random Forest

3rd Oct 2020 RendyK

This one article discusses two Machine Learning methods. They are Decision Tree (also known as Classification and Regression Tree) and later Random Forest. Decision tree, as its name suggests, takes the form of tree to decide which classification new observations are in. Mostly, we actually have used this decision tree method in daily life to decide things. If A happens, then do B. Or else, do C. This Decision Tree in Machine Learning will build a Decision Tree according to the data user feeds to the model. If you are not familiar with Machine Learning basic, please find an article discussing it here. After that, come back here to this article again.

How Decision Tree works in simple way is expressed using the following data. This plot shows purchased goods plotted for their price in x axis and quality in y axis. The purchased goods are then divided into “sold out” and “not sold out”. This article will try to build a Decision Tree to detect whether a thing will be sold out or not according to its price and quality.

Continue reading →

Data Science, Machine Learning

K Nearest Neighbors (kNN)

3rd Oct 20203rd Oct 2020 RendyK

kNN is a supervised machine learning that detects the class of a new observation according to the distance to the other nearest neighboring training data. The k defines the number of nearest neighbors or training data point to use to classify the observation. This article discussion assumes that we have understood basic Machine Learning. If not, please go to this article, discussing about Machine Learning basic, and then come back here again.

The figure below shows an illustration of how kNN works in classifying a new observation data according to existing training data. Yellow square with a question mark inside represents new observation that we want to classify. Blue circles and green triangles are labeled training data. They are located in 2-dimension diagram illustrating a set of training data with 2 parameters and classifications, blue circle and green triangle.

Continue reading →

Data Science, Machine Learning

Introduction to Machine Learning

3rd Oct 202018th Oct 2020 RendyK

The word “machine learning” sounds like a machine with robot appearance learning something. Actually, machine learning is very related to the user feeding large amount of training data into the machine to learn. The machine then will learn the pattern of the data and, as a result, can understand the data pattern and create a model. The model from machine learning basically can classify, cluster, and predict test data according to the training data.

There are three kinds of machine learning, supervised learning, unsupervised learning, and reinforcement learning. This article discusses supervised and unsupervised learning only. Supervised learning can classify or predict test data from labeled training data. Supervised learning learns the labels of training dataset to classify or predict new dataset according to the variables. Supervised learning can do classification and regression. If the label is categorical, it is called classification. If the label is continuous number, it called regression.

Continue reading →