What is Clustering?

Imagine you have a big box of mixed candies. Clustering is like sorting those candies into groups based on their colors. So, you might put all the red candies together, all the blue candies together, and so on. Each group of candies that are similar is called a “cluster.”

Types of Clustering

There are a few different ways to sort or cluster things:

  1. K-means Clustering: This is like having a set number of bins (like 3 bins) and you sort the candies into those bins based on which ones look the most similar.
  2. Hierarchical Clustering: This is like building a family tree for candies. First, you find the two most similar candies and put them in a group. Then, you find another candy that is similar to that group and add it. You keep doing this until all candies are part of one big group.
  3. DBSCAN: This is like sorting candies based on how many friends (other candies) they have around them. If a candy has lots of similar candies around it, they all form a cluster. If a candy is alone, it might be considered an outlier and not part of any cluster.

Where is Clustering Used?

Clustering is used in many places! Here are a few examples:

  • In Business: To group customers with similar buying habits.
  • In Schools: To group students with similar learning needs.
  • In Healthcare: To group patients with similar symptoms and many more use such use cases.

How Clustering Helps Businesses

Imagine you are working as data scientist or AIML engineer and you are working with client who owns a toy store. You have to help your client to grow his business. Your client has many customers. So how you can help to grow the business? you can try to use clustering, using which you can group your clients customers into different types based on what toys they buy. This helps you understand what kinds of toys each group likes. You can then:

  • Stock more of the toys that each group likes.
  • Create special promotions for different groups.
  • Improve customer satisfaction by giving them what they want.

How to Implement Clustering

Let’s use the toy store example:

  1. Collect Data: Gather information about what toys each customer buys.
  2. Choose a Clustering Method: Let’s use K-means for simplicity.
  3. Sort the Data: Use a computer program to sort the customers into clusters based on their buying habits.
  4. Analyze the Clusters: Look at what toys each cluster buys and see if there are any patterns.

Business Use Case

Example: Imagine you run an online bookstore. You collect data on what books people buy.

  1. Step 1: Data Collection: You see that some customers buy mostly fiction, some buy textbooks, and others buy children’s books.
  2. Step 2: Choose Clustering Method: You decide to use K-means clustering.
  3. Step 3: Sort the Data: You use a program to sort customers into 3 clusters: Fiction Lovers, Students, and Parents.
  4. Step 4: Analyze the Clusters: You notice that Fiction Lovers often buy books on weekends, Students buy books at the start of the school year, and Parents buy children’s books before holidays.

Result: You can now tailor your marketing:

  • Offer weekend discounts on fiction books.
  • Have a back-to-school sale for textbooks.
  • Promote children’s books before holidays.

This way, clustering helps you understand your customers better and make smarter business decisions!

Simple Implementation Example (in Python)

Let’s dive deeper into the online bookstore use case, generate some data, and implement the three clustering methods: K-means, Hierarchical Clustering, and DBSCAN. We’ll visualize the data before and after clustering for each method.

We will create a dataset representing customers and their purchasing habits at an online bookstore. Each customer will have three features: the number of fiction books, textbooks, and children’s books they purchased. We will then visualize the initial data distribution.

  1. Generate Fiction Lovers: Customers who buy mostly fiction books.
  2. Generate Students: Customers who buy mostly textbooks.
  3. Generate Parents: Customers who buy mostly children’s books.
Dataset Explanation
  • Features:
    • Fiction: Number of fiction books purchased by a customer.
    • Textbooks: Number of textbooks purchased by a customer.
    • Children: Number of children’s books purchased by a customer.
  • TrueLabel:
    • 0: Fiction Lovers
    • 1: Students
    • 2: Parents

Step 1: Generate Sample Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)
n_customers = 100

# Fiction Lovers: buy more fiction books
fiction_lovers = np.random.multinomial(30, [0.7, 0.2, 0.1], size=n_customers // 3)

# Students: buy more textbooks
students = np.random.multinomial(30, [0.1, 0.7, 0.2], size=n_customers // 3)

# Parents: buy more children's books
parents = np.random.multinomial(30, [0.2, 0.1, 0.7], size=n_customers // 3)

# Combine all customer data
data = np.vstack([fiction_lovers, students, parents])
labels_true = np.array([0] * (n_customers // 3) + [1] * (n_customers // 3) + [2] * (n_customers // 3))

# Convert to DataFrame for better visualization
df = pd.DataFrame(data, columns=['Fiction', 'Textbooks', 'Children'])
df['TrueLabel'] = labels_true
df

Step 2: Visualize initial data

def plot_initial_data(df):
    plt.figure(figsize=(10, 4))
    plt.scatter(df['Fiction'], df['Textbooks'], marker='o')
    plt.xlabel('Fiction Books')
    plt.ylabel('Textbooks')
    plt.title('Customer Purchase Data (True Labels)')
    plt.show()

plot_initial_data(df)

Step 3: Apply K-means Clustering

from sklearn.cluster import KMeans

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['KMeansLabel'] = kmeans.fit_predict(df[['Fiction', 'Textbooks', 'Children']])

# Visualize K-means clustering
def plot_kmeans_clusters(df):
    plt.figure(figsize=(10, 6))
    plt.scatter(df['Fiction'], df['Textbooks'], c=df['KMeansLabel'], 
                cmap='viridis', marker='o')
    plt.xlabel('Fiction Books')
    plt.ylabel('Textbooks')
    plt.title('K-means Clustering')
    plt.colorbar(ticks=[0, 1, 2], label='Cluster')
    plt.clim(-0.5, 2.5)
    plt.show()

plot_kmeans_clusters(df)

Step 4: Apply Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering

# Apply Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=3)
df['HierarchicalLabel'] = hierarchical.fit_predict(df[['Fiction', 'Textbooks', 'Children']])

# Visualize Hierarchical clustering
def plot_hierarchical_clusters(df):
    plt.figure(figsize=(10, 4))
    plt.scatter(df['Fiction'], df['Textbooks'], c=df['HierarchicalLabel'], cmap='viridis', marker='o')
    plt.xlabel('Fiction Books')
    plt.ylabel('Textbooks')
    plt.title('Hierarchical Clustering')
    plt.colorbar(ticks=[0, 1, 2], label='Cluster')
    plt.clim(-0.5, 2.5)
    plt.show()

plot_hierarchical_clusters(df)

Step 5: Apply DBSCAN Clustering

from sklearn.cluster import DBSCAN

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=5, min_samples=5)
df['DBSCANLabel'] = dbscan.fit_predict(df[['Fiction', 'Textbooks', 'Children']])

# Visualize DBSCAN clustering
def plot_dbscan_clusters(df):
    plt.figure(figsize=(10, 4))
    plt.scatter(df['Fiction'], df['Textbooks'], c=df['DBSCANLabel'], cmap='viridis', marker='o')
    plt.xlabel('Fiction Books')
    plt.ylabel('Textbooks')
    plt.title('DBSCAN Clustering')
    plt.colorbar(label='Cluster')
    plt.show()

plot_dbscan_clusters(df)

This code generates customer purchasing data, applies K-means, Hierarchical Clustering, and DBSCAN, and then visualizes the results. Each clustering method is represented with a different plot to show how the data is grouped differently.

Future Work

After completing this clustering project, there are several directions for future work that can further enhance and expand your analysis. Here are some suggestions:

Parameter Tuning: Experiment with different parameters for K-means (number of clusters), DBSCAN (epsilon and minimum samples), and Hierarchical Clustering (linkage criteria).

Advanced Clustering Techniques: Explore more sophisticated clustering methods such as Gaussian Mixture Models (GMM), Spectral Clustering, or Density Peaks Clustering.

Conclusion

Clustering is like sorting things into groups based on their similarities. It’s used in many areas, especially in business, to understand and better serve different groups of customers. By using clustering, businesses can make more informed decisions and tailor their products or services to meet the needs of different customer groups.

Leave a Reply