Oct 2020 – Data Science and GIS

Hierarchical Clustering

13th Oct 2020 RendyK

Hierarchical clustering is an unsupervised machine learning that identify closest cluster and group them together. Basic of Machine Learning article can be found here. Hierarchical clustering works with only 2 steps repeatedly. Firstly, detect 2 or more closest points or clusters. Secondly, group them together. The next steps are the iteration of the first two steps until all of the data points are grouped in clusters. The illustration below describes how hierarchical clustering groups data points and build dendrogram at the same time.

Continue reading →

Data Science, Machine Learning

K-means

13th Oct 2020 RendyK

k-means is a unsupervised Machine Learning to perform cluster analysis. More discussion about Machine Learning can be found here. Cluster analysis is meant to divide a group of data points into clusters according to their similarity pattern. k-means locates a number of k centroids among the data points to which each data point is clustered to the nearest centroid. The location of the k or centroid represents the center of their clusters. The distance of centroid to each data point in the cluster is computed to be as small as possible. The number of k is the number of centroids or clusters that the data points will be distinguished into.

Continue reading →

Data Science, Machine Learning

Linear Regression (Supervised Machine Learning)

13th Oct 202018th Oct 2020 RendyK

Linear regression in a method in Machine Learning. The same term is also used in Statistics. To read about Machine Learning basic, please find my article here. Linear regression finds relationship between one or more continuous predictor variables and the dependent variable to predict. Simple linear regression has only one predictor or independent variable to predict the dependent variable. Plot the variables and draw a fit line with its distance to data points as small as possible. The distance of the fit line to each data point represents the prediction error.

Below is the data of 20 apples with their mass (gram) and volume (cm3). Now, we want to create a model or formula to estimate the volume of apple according to its mass using linear regression.

Continue reading →

Data Science, Machine Learning

Naïve Bayes

13th Oct 202013th Oct 2020 RendyK

Naïve Bayes is a supervised learning to do classification based on categorical parameters using Bayes Theorem. If you want to read about basic Machine Learning, please refer to here, and come back to this article again later. Still remember how to calculate probability from high school lesson? For instance the probability of a die to show 5 is 1/6. It can finally be useful in machine learning.

It is called “naïve” because assumes mutual independence among the predictors. For example, monkey is identified to have 2 arms, be brown color, and be good at jumping. All there characteristic contribute independently to identify that an animal is a monkey despite they actually depend on one another.

Continue reading →

Data Science, Machine Learning

Decision Tree/Classification and Regression Tree and Random Forest

3rd Oct 2020 RendyK

This one article discusses two Machine Learning methods. They are Decision Tree (also known as Classification and Regression Tree) and later Random Forest. Decision tree, as its name suggests, takes the form of tree to decide which classification new observations are in. Mostly, we actually have used this decision tree method in daily life to decide things. If A happens, then do B. Or else, do C. This Decision Tree in Machine Learning will build a Decision Tree according to the data user feeds to the model. If you are not familiar with Machine Learning basic, please find an article discussing it here. After that, come back here to this article again.

How Decision Tree works in simple way is expressed using the following data. This plot shows purchased goods plotted for their price in x axis and quality in y axis. The purchased goods are then divided into “sold out” and “not sold out”. This article will try to build a Decision Tree to detect whether a thing will be sold out or not according to its price and quality.

Continue reading →

Data Science, Machine Learning

K Nearest Neighbors (kNN)

3rd Oct 20203rd Oct 2020 RendyK

kNN is a supervised machine learning that detects the class of a new observation according to the distance to the other nearest neighboring training data. The k defines the number of nearest neighbors or training data point to use to classify the observation. This article discussion assumes that we have understood basic Machine Learning. If not, please go to this article, discussing about Machine Learning basic, and then come back here again.

The figure below shows an illustration of how kNN works in classifying a new observation data according to existing training data. Yellow square with a question mark inside represents new observation that we want to classify. Blue circles and green triangles are labeled training data. They are located in 2-dimension diagram illustrating a set of training data with 2 parameters and classifications, blue circle and green triangle.

Continue reading →

Data Science, Machine Learning

Introduction to Machine Learning

3rd Oct 202018th Oct 2020 RendyK

The word “machine learning” sounds like a machine with robot appearance learning something. Actually, machine learning is very related to the user feeding large amount of training data into the machine to learn. The machine then will learn the pattern of the data and, as a result, can understand the data pattern and create a model. The model from machine learning basically can classify, cluster, and predict test data according to the training data.

There are three kinds of machine learning, supervised learning, unsupervised learning, and reinforcement learning. This article discusses supervised and unsupervised learning only. Supervised learning can classify or predict test data from labeled training data. Supervised learning learns the labels of training dataset to classify or predict new dataset according to the variables. Supervised learning can do classification and regression. If the label is categorical, it is called classification. If the label is continuous number, it called regression.

Continue reading →