Naïve Bayes is a supervised learning to do classification based on categorical parameters using Bayes Theorem. If you want to read about basic Machine Learning, please refer to here, and come back to this article again later. Still remember how to calculate probability from high school lesson? For instance the probability of a die to show 5 is 1/6. It can finally be useful in machine learning.
It is called “naïve” because assumes mutual independence among the predictors. For example, monkey is identified to have 2 arms, be brown color, and be good at jumping. All there characteristic contribute independently to identify that an animal is a monkey despite they actually depend on one another.
We will practice Naïve Bayes using “Hotel Customer” dataset I have made. The data contains the information of customers of a hotel. There are 500 observations or rows in this data frame. Each observation represents one customer. There are 12 variables, with data structure as following.
data.frame': 500 obs. of 12 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 1 2 2 ...
$ Age : num 33 30 37 34 33 34 35 30 39 34 ...
$ Purpose : Factor w/ 2 levels "Business","Personal": 2 2 2 2 2 2 2 2 1 2 ...
$ RoomType : Factor w/ 3 levels "Double","Family",..: 1 2 3 1 1 1 1 2 1 1 ...
$ Food : num 21 32 46 72 84 67 56 10 73 97 ...
$ Bed : num 53 32 25 30 7 46 0 19 12 30 ...
$ Bathroom : num 24 18 29 15 43 16 0 1 62 26 ...
$ Cleanness : num 44 44 20 55 78 61 9 53 65 59 ...
$ Service : num 46 74 24 38 51 44 32 58 56 46 ...
$ Satisfaction: Factor w/ 3 levels "Dissatisfied",..: 1 1 1 1 2 2 1 1 2 2 ...
$ Repeat : Factor w/ 2 levels "No","Repeat": 2 1 1 2 2 1 1 2 2 1 ...

We will focus only to predict whether a customer will repeat to come back to the hotel again according to the “room type” parameter. So, for this practice, please ignore the rest variables. There are 3 types of room, “family”, “double”, and “single”. The column “Repeat” contains factors of “Repeat” and “No”. “Repeat” symbolizes that the customer repeated to stay in the hotel. “No” means that the customer only only stayed one time in the hotel and never came back. The matrix below shows the number of repeat and non repeat customer according to the gender and purpose.
| Gender and RoomType | Repeat | No Repeat | Repeat Percentage | No Repeat Percentage | Gender and Room Type Percentage |
| A | B | C | D = B/(B+C) | E = C/(B+C) | |
| Female-Business | 50 | 17 | 74.6% | 25.4% | 16.75% |
| Female-Personal | 26 | 94 | 21.7% | 78.3% | 30% |
| Male-Business | 39 | 11 | 78.0% | 22.0% | 12.5% |
| Male-Personal | 55 | 108 | 33.7% | 66.3% | 40.75% |
| 42.5% | 57.5% |

The table above is derived from training data. The data of the table aove will be fed to the Machine Learning. The table shows that 16.75% of the hotel customers are female with business purpose, P(A) = 67/400 = 16.75%. It also expresses that 42.5% of the total customers would repeat to stay at the hotel again, P(B) = 170/400 = 42.5%.
From the table above, we can conclude that if a new customer is a female who stays in the hotel for business purpose, she will likely to repeat to come back to the hotel in the future. From the previous existing data, 74.6% of female customer for business purpose would return to stay in the hotel again. This is how the Naïve Bayes Machine Learning works. Now, let’s examine the script
# Create ne training and test data by adding a ne column combining "gender" and "purpose"
testBayes <- testCustomer %>% mutate(GenderPurpose = paste(Gender, Purpose, sep = "-"))
trainBayes <- trainCustomer %>% mutate(GenderPurpose = paste(Gender, Purpose, sep = "-"))
table(trainBayes[,13], trainBayes[,12])
# Install naivebayes package
library(naivebayes)
# Create Naive Bayes model
NBmodel <- naive_bayes(Repeat~GenderPurpose , data = trainBayes)
# Predict the test data
predict_NB <- predict(NBmodel, testBayes[13])
predict_NB2 <- predict(NBmodel, testBayes[13], type = "prob")
predict_NB3 <- predict(NBmodel, testBayes[13], type = "prob", laplace = 1)
NBResult <- head(data.frame(testBayes[c(1,13,12)], predict_NB, predict_NB2, predict_NB3),20)
colnames(NBResult)[5:8] <- c("Not Repeat%", "Repeat%","Not Repeat Laplace%", "Repeat Laplace%")
NBResult[,c(1:6)]
Here is the first 20 prediction results.
>NBResult[,c(1:6)]
Id GenderPurpose Repeat predict_NB Not Repeat% Repeat%
1 346 Female-Business No Repeat 0.2537313 0.7462687
2 186 Male-Personal Repeat No 0.6625767 0.3374233
3 280 Male-Personal No No 0.6625767 0.3374233
4 368 Male-Personal Repeat No 0.6625767 0.3374233
5 217 Female-Personal No No 0.7833333 0.2166667
6 115 Female-Business No Repeat 0.2537313 0.7462687
7 127 Male-Personal Repeat No 0.6625767 0.3374233
8 408 Female-Business No Repeat 0.2537313 0.7462687
9 281 Male-Business Repeat Repeat 0.2200000 0.7800000
10 219 Female-Business Repeat Repeat 0.2537313 0.7462687
11 225 Male-Personal No No 0.6625767 0.3374233
12 240 Female-Personal No No 0.7833333 0.2166667
13 6 Male-Personal No No 0.6625767 0.3374233
14 489 Female-Personal No No 0.7833333 0.2166667
15 414 Male-Business No Repeat 0.2200000 0.7800000
16 22 Female-Personal No No 0.7833333 0.2166667
17 14 Male-Personal No No 0.6625767 0.3374233
18 37 Male-Personal No No 0.6625767 0.3374233
19 137 Female-Personal No No 0.7833333 0.2166667
20 36 Female-Personal No No 0.7833333 0.2166667
> # check the accuracy percentage
> mean(predict_NB == testBayes[,12])
[1] 0.79
> table(predict_NB, testBayes[,12])
predict_NB No Repeat
No 56 14
Repeat 7 23
In the example above, Laplace is set equal to be default or 0 in building the Naïve Bayes model. In a certain case, it is important to set the Laplace to be 1 or more. There can be a case where in the training data, condition A never meets condition B. This will develop a model that gives 0 probability to condition A and condition B. Setting the Laplace will give a small probability to it.


1 thought on “Naïve Bayes”