Data Science, Geographic Information System (GIS), Machine Learning, Remote Sensing

My Work Samples Portfolio

This page is meant to store work samples to use as cheat sheets. Thus, the following collection of notebooks are simple and can be modified following required use cases.

Overall, the work samples cover my skill sets, including:

  • delivering meaningful data-driven insights to support business goals,
  • automating data processing (with python),
  • data analysis (tabular, time series, text/NLP, and image),
  • descriptive and inferential statistical analysis,
  • GIS or spatial data analysis
  • data visualization and dashboard development,
  • Machine Learning modeling (regression, classification, clustering, dimensionality reduction, time series forecasting, recommender engine)
  • Deep Learning or Artificial Intelligence (regression and classification with MLP, image classification with CNN, time series forecasting with LSTM, text classification with LSTM)
  • web application development,
  • developing APIs,
  • Large Language Model (LLM),
  • Diffusion (Image Generation) etc.

Highlights:

Task GroupTasksDescriptionNotebook/Repo
Large Language Model (LLM)Retrieval-Augmented Generation (RAG)Develop RAG to enhance LLMs with custom documents. Streamlit chatbot as the UIarticle, repository
Deep LearningImage classification with CNN, Multi-label classificationDeveloped image classification model using CNN to recognize buildings, forest, glacier, mountain, sea, and street images.CNN
Deep LearningTime series forecasting with vanilla LSTM, stacked LSTM, bidirectional LSTM, CNN LSTM, and Conv LSTMForecast carbon monoxide emission using LSTM and did time series analysis.Time series
Deep LearningText Classification with Dense, LSTM, Bi-LSTM, GRU, CNN, CNN + GRUDeveloped text classification model to distinguish tweets into 4 emotions: joy, sadness, anger, and fear.Text classification
Supervised LearningSupervised Learning for Remote SensingPredicted the spatial distribution of land cover using Remote Sensing/satellite data. Published the result on a web app.ML + Remote sensing, web app
NLPNLP and Sentiment AnalysisPerformed NLP analysis and text regression for sentiment analysis.article, part 1, part 2
Table 1 Favorite notebooks

Others:

Task GroupTasksDescriptionNotebook/Repo
Supervised LearningRegressionPredicted house prices with various regression algorithms.Regression
Supervised LearningBinary classificationPredicted survival rate in titanic using various classification algorithms.Binary Classification
Supervised LearningBinary classification (with probability)Predicted high traffic probability using the metrics of AUC, accuracy, and F1-score.Binary Classification
Supervised LearningMulti-class classificationPredicted household poverty as a multi-class classification problem.Multi-class Classification
Supervised LearningImbalanced classificationPredicted whether an employee was a best performer as an imbalanced classification task.Imbalanced
Supervised LearningBayesian Optimization: bayes_opt or fminComparing the libraries bayes_opt and fmin to perform Bayesian optimization for hyperparameter-tuning.Bayesian Optimization
Supervised LearningSupervised Learning for Remote SensingPredicted the spatial distribution of land cover using Remote Sensing/satellite data. Published the result on a web app.ML + Remote sensing, web app
AutoMLAutoML for RegressionPredicted house prices with various autoML regression algorithms.Part 1, Part 2
AutoMLAutoML for ClassificationPredicted household poverty classes using autoML classification algorithms.Part 1, Part 2
Unsupervised LearningClusteringClustered customer segmentation using k-means and hierarchical clustering.k-means,
hierarchical clustering
Unsupervised LearningGeo-spatial clustering and point pattern analysisSpatial pattern analysis (point/polygon pattern analysis, Spatially Constrained Hierarchical Clustering, etc. ) of e-commerce customers in Brazil.Geo-spatial clustering
Unsupervised LearningDimensionality reduction: PCA with Sagemaker (upcoming)Performed PCA on environmental variables dataset.PCA
Unsupervised LearningAnomaly detection: Random Cut Forest with SagemakerPerformed anomaly detection on daily climate dataset and deployed the model using sagemaker.Random Cut Forest
Time series forecastingTime series forecasting with SARIMAXForecast the cash of ATMs across the time.“not yet published”
Deep LearningImage classification with CNN, Multi-label classificationDeveloped image classification model using CNN to recognize buildings, forest, glacier, mountain, sea, and street images.CNN
Deep LearningTime series forecasting with LSTMForecast carbon monoxide emission using LSTM and did time series analysis.Time series
Deep LearningText Classification with Dense, LSTM, Bi-LSTM, GRU, CNN, CNN + GRU Developed text classification model to distinguish tweets into 4 emotions: joy, sadness, anger, and fear.Text classification
NLPNLP and Sentiment AnalysisPerformed NLP analysis and text regression for sentiment analysis.article, part 1, part 2
Inferential StatisticsInferential Statistics, hypothesis testing, etc. “not yet published”
DashboardShiny DashboardVisualized daily covid cases in dashboard.Shiny Dashboard
DashboardTableau DashboardVisualized spatiotemporal analysis of house prices“not yet published” (upcoming)
Web applicationStreamlitStreamlit as the chatbot interface for an RAG or LLM applicationhttps://github.com/rendy-k/LLM-RAG
APIFAST API“not yet published”
SagemakerSagemaker: classificationDeveloped and deployed loan default probability classification using AWS sagemaker.classification
SagemakerSagemaker: invoke modelDeveloped the API to invoked deployed Machine Learning model.invoke model
SagemakerMulti-model deployment with SagemakerDeployed multi-model on AWS instance.multi-model deployment
SagemakerRecommender systemBuilt and deployed a recommender system to recommend anime titles using Factorization Machine of AWS.recommender system
SagemakerTime series forecastingBuilt and deployed DeepAR to forecast the time series of New Delhi daily weather.Deep AR
Large Language Model (LLM)Develop RAG to enhance LLMs with custom documents. Streamlit chatbot as the UIarticle, repository
Table 2 Notebook collection

In the “Notebook/Repo” column, the URLs will direct to where the notebooks or repositories are stored. Some of them do not have the URLs, but “not yet published”. This means that the notebooks are available in local computer for professional work. They are not yet modified and published.

Data Science, Machine Learning

Machine Learning Notebooks Collection

This post aims to share my Machine Learning notebooks. There are three types of Machine Learning for predicting structured tabular data: (1) supervised learning, (2) unsupervised learning, and (3) reinforcement learning. A supervised learning objective is to build a prediction model from a training dataset to predict an unseen test dataset. Supervised learning can solve regression tasks (for continuous output) and classification tasks (for categorical output). Unsupervised learning aims to learn the dataset patterns to simplify the information by clustering and dimensionality reduction. Cluster analysis groups observations into some clusters according to the similarity of their features. Dimensionality reduction reduces the number of dataset dimensions or features. Previously, I have written a post on basic Machine Learning here.

Continue reading “Machine Learning Notebooks Collection”
Data Science, Machine Learning

Hierarchical Clustering

Hierarchical clustering is an unsupervised machine learning that identify closest cluster and group them together. Basic of Machine Learning article can be found here. Hierarchical clustering  works with only 2 steps repeatedly. Firstly, detect 2 or more closest points or clusters. Secondly, group them together. The next steps are the iteration of the first two steps until all of the data points are grouped in clusters. The illustration below describes how hierarchical clustering groups data points and build dendrogram at the same time.

Continue reading “Hierarchical Clustering”
Data Science, Machine Learning

K-means

k-means is a unsupervised Machine Learning to perform cluster analysis. More discussion about Machine Learning can be found here. Cluster analysis is meant to divide a group of data points into clusters according to their similarity pattern. k-means locates a number of k centroids among the data points to which each data point is clustered to the nearest centroid. The location of the k or centroid represents the center of their clusters.  The distance of centroid to each data point in the cluster is computed to be as small as possible. The number of k is the number of centroids or clusters that the data points will be distinguished into.

Continue reading “K-means”
Data Science, Machine Learning

Linear Regression (Supervised Machine Learning)

Linear regression in a method in Machine Learning. The same term is also used in Statistics. To read about Machine Learning basic, please find my article here. Linear regression finds relationship between one or more continuous predictor variables and the dependent variable to predict. Simple linear regression has only one predictor or independent variable to predict the dependent variable. Plot the variables and draw a fit line with its distance to data points as small as possible. The distance of the fit line to each data point represents the prediction error.

Below is the data of 20 apples with their mass (gram) and volume (cm3). Now, we want to create a model or formula to estimate the volume of apple according to its mass using linear regression.

Continue reading “Linear Regression (Supervised Machine Learning)”
Data Science, Machine Learning

Naïve Bayes

Naïve Bayes is a supervised learning to do classification based on categorical parameters using Bayes Theorem. If you want to read about basic Machine Learning, please refer to here, and come back to this article again later. Still remember how to calculate probability from high school lesson? For instance the probability of a die to show 5 is 1/6. It can finally be useful in machine learning.

It is called “naïve” because assumes mutual independence among the predictors. For example, monkey is identified to have 2 arms, be brown color, and be good at jumping. All there characteristic contribute independently to identify that an animal is a monkey despite they actually depend on one another.

Continue reading “Naïve Bayes”
Data Science, Machine Learning

Decision Tree/Classification and Regression Tree and Random Forest

This one article discusses two Machine Learning methods. They are Decision Tree (also known as Classification and Regression Tree) and later Random Forest. Decision tree, as its name suggests, takes the form of tree to decide which classification new observations are in. Mostly, we actually have used this decision tree method in daily life to decide things. If A happens, then do B. Or else, do C. This Decision Tree in Machine Learning will build a Decision Tree according to the data user feeds to the model. If you are not familiar with Machine Learning basic, please find an article discussing it here. After that, come back here to this article again.

How Decision Tree works in simple way is expressed using the following data. This plot shows purchased goods plotted for their price in x axis and quality in y axis. The purchased goods are then divided into “sold out” and “not sold out”. This article will try to build a Decision Tree to detect whether a thing will be sold out or not according to its price and quality.

Continue reading “Decision Tree/Classification and Regression Tree and Random Forest”
Data Science, Machine Learning

K Nearest Neighbors (kNN)

kNN is a supervised machine learning that detects the class of a new observation according to the distance to the other nearest neighboring training data. The k defines the number of nearest neighbors or training data point to use to classify the observation. This article discussion assumes that we have understood basic Machine Learning. If not, please go to this article, discussing about Machine Learning basic, and then come back here again.

The figure below shows an illustration of how kNN works in classifying a new observation data according to existing training data. Yellow square with a question mark inside represents new observation that we want to classify. Blue circles and green triangles are labeled training data. They are located in 2-dimension diagram illustrating a set of training data with 2 parameters and classifications, blue circle and green triangle.

Continue reading “K Nearest Neighbors (kNN)”
Data Science, Machine Learning

Introduction to Machine Learning

The word “machine learning” sounds like a machine with robot appearance learning something. Actually, machine learning is very related to the user feeding large amount of training data into the machine to learn. The machine then will learn the pattern of the data and, as a result, can understand the data pattern and create a model. The model from machine learning basically can classify, cluster, and predict test data according to the training data.

There are three kinds of machine learning, supervised learning, unsupervised learning, and reinforcement learning. This article discusses supervised and unsupervised learning only.  Supervised learning can classify or predict test data from labeled training data. Supervised learning learns the labels of training dataset to classify or predict new dataset according to the variables. Supervised learning can do classification and regression. If the label is categorical, it is called classification. If the label is continuous number, it called regression.

Continue reading “Introduction to Machine Learning”