Unsupervised Machine Learning: K-Means Clustering With R
Unsupervised Machine Learning uses machine learning algorithms is used to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention.
One example of the implementation of unsupervised learning is K-Means clustering. K-Means Clustering is a broad set of techniques for finding subgroups of observations within a data set.
K-Means Clustering usage for:
- Customer Segmentation
- Fraud Detection
- Document Classification
This article is based on a short course project in my Data Science Specialization. The dataset used in this demo is Movie Rating form Customer.
The name of the data set I’m using is customer_movie_rating. In this project, I clustered the rating data for each movie genre in the dataset. The steps I did:
Step 1: Import dataset
#Packages that I used
library(tidyverse)
library(gridExtra)
library(factoextra)(customer_movie_rating <- read.csv('customer_movie_rating.csv'))glimpse(customer_movie_rating)df <- customer_movie_rating
df <- na.omit(df)
df <- scale(df)
head(df)
Step 2: K Means Clustering
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)fviz_cluster(k2, data = df)
df %>%
as_tibble() %>%
mutate(cluster = k2$cluster,
movie = row.names(customer_movie_rating)) %>%
ggplot(aes(Horror, Romcom, color = factor(cluster), label = movie)) +
geom_text()
Interpretation of the visualization that has been done: Cluster 1 has a Horror movie rating higher than romantic comedy, Cluster 2 has a Romance Comedy movie rating higher than Horror.
#this chunk for find the optimal k numbers
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
k7 <- kmeans(df, centers = 7, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
Step 3: Determining Optimal Clusters
For determining optimal clusters I use the Elbow Method. The basic idea behind cluster partitioning method is to define clusters in such a way that the total intra-cluster variation is minimized.
#
set.seed(123)
wss <- function(k) {
kmeans(df, k, nstart = 10 )$tot.withinss}
k.values <- 1:15
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
set.seed(123)
fviz_nbclust(df, kmeans, method = "wss")
fviz_nbclust(df, kmeans, method = "silhouette")
Step 3: Find K-Means Result
set.seed(123)
final <- kmeans(df, 4, nstart = 25)
print(final)customer_movie_rating %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
Interpretation:
Cluster 1: cluster of customers who give low ratings to Romcom and Fantasy genre movies. The movie genre with the highest rating is Action.
Cluster 2: customer cluster that gives a low rating on Romcom and Comedy genre movies while the highest rating is on Horror genre movies.
Cluster 3: customer cluster that gives a low rating on Romcom and Fantasy genre movies while the highest rating on Comedy genre movies.
Cluster 4: customer cluster that gives a low rating on Horror genre movies while the highest rating on Comedy genre movies.
The conclusion of this project we can use K-Means clustering to get easier data classification. The movie rating dataset, which I have done clustering to group movie ratings for each movie genre.