Customer Clustering

Customer Clustering#

All Customers#

Cluster Definition#

For clustering, we will use RFM metrics and the average number of unique products per order.
RFM metrics are well-suited for clustering. Adding the average number of unique products will help account for assortment demand diversity.

selected_metrics = [
    'avg_unique_products_cnt'
]

Create a dataframe with the selected metrics.

We will only consider customers who have made at least one successful purchase.

mask = df_customers.buys_cnt.notna()

cols_to_drop = ['recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_segment']
df_processed = (
    df_customers.loc[mask, ['customer_unique_id', *selected_metrics]]
    .merge(df_rfm, on='customer_unique_id', how='left')
    .drop(cols_to_drop, axis=1)
    .set_index('customer_unique_id')
)

Check for highly correlated features.

corr_matrix = df_processed.corr().abs()

upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

cols_to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.6)]
cols_to_drop

[]

No need to delete anything.

Look for missing values in the columns.

df_processed.isna().sum().nlargest(5)

avg_unique_products_cnt    0
recency                    0
frequency                  0
monetary                   0
dtype: int64

There are very few missing values, and they relate to customers who made only one purchase and have not yet received their item.

We will remove these rows before standardization.

scaler = StandardScaler()
df_processed = df_processed.dropna()
X_scaled = scaler.fit_transform(df_processed)

Determine the optimal number of clusters using the elbow method and silhouette analysis.

wcss = []
for i in range(1, 15):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

px.line(
    x=range(1, 15)
    , y=wcss
    , labels={'x': 'Number of clusters', 'y': 'WCSS'}
    , title='The Elbow Method'
    , width=600
    , height=400
)

../../_images/27b5d12d90e1ea121c15e365c1279cc2b693dec5095d3c90df8ba72d65abccee.jpg

Key Observations:

The elbow method shows a clear break at 5 clusters. We’ll use 5 clusters.

optimal_clusters = 5
kmeans = KMeans(
    n_clusters=optimal_clusters
    , init='k-means++'
    , random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

Examine quality

score = silhouette_score(X_scaled, cluster_labels)
print(f'Silhouette Score: {score:.3f}')

Silhouette Score: 0.492

Key Observations:

Silhouette score of 0.492 indicates good cluster separation.

Add cluster labels to the dataframe.

df_processed['cluster'] = cluster_labels + 1

Cluster Analysis#

Analyze the resulting clusters.

df_processed = df_processed.reset_index()

Distribution by Clustering Metrics

selected_metrics = [
    'avg_unique_products_cnt',
    'recency', 
    'frequency',
    'monetary',
]

Provide more readable names for the metrics on the graphs.

metric_labels = {
    'avg_unique_products_cnt': 'Avg Unique Products',
    'recency': 'Recency', 
    'frequency': 'Frequency', 
    'monetary': 'Monetary', 
}
cluster_label = {'cluster': 'Cluster'}

labels_for_polar={**cluster_label, **base_labels, **metric_labels}

fig = df_processed.analysis.segment_polar(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id'
    , labels=labels_for_polar
)
pb.to_slide(fig, 'cluster all')
fig.show()

../../_images/835184114ed83f1a99453aa0c7535b8638c446c0dfa66548a9fe5d822b81fc4d.jpg

df_processed.analysis.segment_table(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id' 
)
fig.show()

Segment Analysis for Cluster
cluster	3	4	5	2	1
% of Total Count	2.42%	3.03%	2.95%	38.95%	52.66%
avg_unique_products_cnt	1.00	2.00	1.00	1.00	1.00
recency	218.50	221.00	197.50	372.00	128.00
frequency	1.00	1.00	2.00	1.00	1.00
monetary	966.53	187.81	223.66	99.43	101.75

Key Observations:

Most customers fall in Cluster 1 (53%) and Cluster 2 (39%).
Cluster 1: No standout metrics
Cluster 2: Highest Recency
Cluster 3: Highest Monetary, Cluster 4: Highest Avg Unique Products, Cluster 5: Highest Frequency

Number of Customers by Clusters in Different Segments

Add dimensions to the dataframe with clusters.

df_processed = (
    df_processed.merge(df_customers[['customer_unique_id', *customers_dim]], on='customer_unique_id', how='left')
)

df_processed.viz.update_plotly_settings(
    labels={**base_labels, 'cluster': 'Cluster'}
)

pb.configure(
    df = df_processed
    , metric = 'customer_unique_id'
    , metric_label = 'Share of Customers'
    , agg_func = 'nunique'
    , norm_by='all'
    , axis_sort_order='descending'    
    , text_auto='.1%'
    , plotly_kwargs= {'category_orders': {'cluster': list(range(1, 6))}}
)

By Activity Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'activity_segment' 
    , visible_graphs = [2, 3]
)

../../_images/1cfd3bd3e1ce926003fe6e91a2e45a9f27afc38364b1587d4c34bf4c064e4953.jpg

../../_images/5075314b4644e6a105a3e25e2a6cbd0e1cb2ce21686f0af513ed9ebb25deb083.jpg

Key Observations:

Clusters 1-4 are dominated by one-time purchasers.
Cluster 5 dominates all active segments except one-time purchases.
One-time purchase rates are similar across all clusters except Cluster 5.

By Purchase Amount Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'value_segment'
    , visible_graphs = [2, 3]
)

../../_images/d4916cfd64dfb88cd0ba1b86809853c686f74a4f8ec1c507955ab819bbb1d6d3.jpg

../../_images/9aa32f242475a005440dc30449c56e9b3ac8e4e6e5f6fab89469fb0e599c5bcb.jpg

Key Observations:

Clusters 1-2: Mostly medium payment tier
Cluster 3: Entirely high-value segment
Clusters 1-2 are less common in high-value segment
Clusters 1-2 dominate within each value segment

By Loyalty Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'loyalty_segment'
    , visible_graphs = [2, 3]
)

../../_images/410bc02cb052c060aefa2f1ff5cf5b655d9486ae10c45008afe3f0e6f3d90129.jpg

../../_images/7b5cf66ca39297bf959741bb73cc10e08d244273de71ef9c1afed7d33e50af06.jpg

Key Observations:

Cluster 4 has notably more critics
Clusters 1-2 dominate across loyalty segments
Fewer promoters in Cluster 4
More neutrals in Cluster 5
More critics in Cluster 4

By Risk Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'risk_segment'
    , visible_graphs = [2, 3]
)

../../_images/48d10af14353eea91f6e1341499f9e83f895e87a84b105b0f6ddd21fb086f637.jpg

../../_images/d79256f0aa83e9f8c844a3e2d989958fb988b8c284ae63d2b5fc59fb04b57a12.jpg

Key Observations:

Clusters 4-5 are more common among risky customers (with order cancellations)

By Installment Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'installment_segment'
    , visible_graphs = [2, 3]
)

../../_images/7d7a2f56f2415353a1ab9091a4e267fe7cb5c3c3b21fe2bb50c198ea64b04300.jpg

../../_images/479a28ae496ba75b4f667ee53e6730f63fa509cc6f0bab20c78d50bbaec93cf5.jpg

Key Observations:

Installments don’t dominate Clusters 1-2
Installments clearly dominate Clusters 3-5
Highest installment rate in Cluster 3

By Average Number of Products per Order Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'products_cnt_segment'
    , visible_graphs = [2, 3]
)

../../_images/e4c8b63d18df95cd6a42e7bfcfe2a8dfea13e99a3a8d26e4241ab2fcfa0032a9.jpg

../../_images/3c8af52a704814b0b4b7784ef1518e38b88e0dd049671f9a91bef4865e465123.jpg

Key Observations:

Cluster 4 consists entirely of customers averaging >1 product per order

By Average Order Weight Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'weight_segment'
    , visible_graphs = [2, 3]
)

../../_images/3fe4cb33a2fc162a7af477219b093984ebcc2a86760c199c7c4e8eee4b56e09b.jpg

../../_images/edc2d2a2d04455ded8207e2286dbc5ea9bb4b7869b3407665d44ce12ef620933.jpg

Key Observations:

Cluster 3 has more heavy-weight orders
Cluster 1 dominates light-weight orders

By Top Payment Types

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_payment_types'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/af1b06473330e0dc08ad4d15c1a5693c1845c9535944e443850663363e595033.jpg

../../_images/e4beb093bf254255eec2a4b43832a1b884eb46cf36e1da30bf1f0acd993077ac.jpg

Key Observations:

Cluster 1 has more debit card users

By Top Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/0d0bbac6589db42750e170b9d1e827c8f28214c6a31e238f3eec44610100d3c6.jpg

../../_images/e369b262c2d8fadfa13f2a80bb3f6b584b01a366ca10f1765c6ce40f82f4e7a2.jpg

Key Observations:

Cluster 4 dominates Bed Bath Table category

By Top Generalized Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_general_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/aa0e840210e17d3bd60d3649f4a65502d56df1f578b9581d5ab83f1c87b023ef.jpg

../../_images/15909fa28002f77b4b14b6aec4fefceeee4bff2dca192c420479ff2854c03f55.jpg

Key Observations:

Cluster 3: Dominated by electronics
Cluster 4: Dominated by furniture

By Customer State

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_city'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/b4bbea8fae7096cc56558d57c925e1840013a1e4df5e2dbf440450ba444eaa64.jpg

../../_images/9362044d97a8893f9082f57e508508715b8ee5804f1bae9769704a059749ac4b.jpg

Key Observations:

Cluster 2 is more common in Rio de Janeiro
Cluster 1 is less common in Rio de Janeiro

Customers with Multiple Purchases#

Cluster Definition#

We will conduct a separate clustering of customers who have made more than one successful purchase.

We will use the same metrics as in the clustering of all customers.

selected_metrics = [
    'avg_unique_products_cnt'
]

mask = df_customers.buys_cnt >= 2

cols_to_drop = ['recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_segment']
df_processed = (
    df_customers.loc[mask, ['customer_unique_id', *selected_metrics]]
    .merge(df_rfm[lambda x: x.rfm_segment=='Champions'], on='customer_unique_id', how='inner')
    .drop(cols_to_drop, axis=1)
    .set_index('customer_unique_id')
)

Check for missing values.

df_processed.isna().sum().nlargest(5)

avg_unique_products_cnt    0
recency                    0
frequency                  0
monetary                   0
dtype: int64

There are no missing values.

scaler = StandardScaler()
df_processed = df_processed.dropna()
X_scaled = scaler.fit_transform(df_processed)

Determine the optimal number of clusters using the silhouette method.

silhouette_scores = []
for n_clusters in range(2, 15):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X_scaled)
    silhouette_avg = silhouette_score(X_scaled, cluster_labels)
    silhouette_scores.append(silhouette_avg)
px.line(
    x=range(2, 15)
    , y=silhouette_scores
    , labels={'x': 'Number of clusters', 'y': 'Silhouette Score'}
    , title='Silhouette Score Method'
    , width=600
    , height=400
)

../../_images/b0dcc8b8c77d1ff80a2728a6b59f67f12c45d400455657ce12fabaeb1fd7760c.jpg

Key Observations:

Global peak at 7 clusters
Sharp drop after k=7 suggests overfitting

Examine the dendrogram of hierarchical clustering.

linked = linkage(X_scaled, method='ward')
fig = ff.create_dendrogram(
    linked
    , orientation='bottom'
)
fig.update_layout(
    title='Dendrogram of hierarchical clustering',
    xaxis_title='Observations',
    yaxis_title='Distance',
    width=800,
    height=500,
    margin=dict(l=50, r=50, b=50, t=50),
    yaxis_mirror=False,
)
fig.update_xaxes(showticklabels=False, ticks='', mirror=False)
fig.show(config=dict(displayModeBar=False), renderer="png")

../../_images/f351f9d566cbdf03e43ebad27db5ee7a67db58f1243c1d69f4319a15271ebf5a.png

Key Observations:

Optimal cut at 6-7 clusters (where branches lengthen)
We’ll choose 7 clusters

optimal_clusters = 7
model = AgglomerativeClustering(n_clusters=optimal_clusters)
cluster_labels = model.fit_predict(X_scaled)

Evaluate quality.

score = silhouette_score(X_scaled, cluster_labels)
print(f'Silhouette Score: {score:.3f}')

Silhouette Score: 0.410

Key Observations:

Silhouette score of 0.41 indicates good clustering
Clusters remain distinguishable despite score <0.5

Add cluster labels to the dataframe.

df_processed['cluster'] = cluster_labels + 1

Cluster Analysis#

Analyze the resulting clusters.

df_processed = df_processed.reset_index()

Distribution by Clustering Metrics

selected_metrics = [
    'avg_unique_products_cnt',
    'recency', 
    'frequency',
    'monetary',
]

Provide more readable names for the metrics on the graphs.

metric_labels = {
    'avg_unique_products_cnt': 'Avg Unique Products',
    'recency': 'Recency', 
    'frequency': 'Frequency', 
    'monetary': 'Monetary', 
}
cluster_label = {'cluster': 'Cluster'}

labels_for_polar={**cluster_label, **base_labels, **metric_labels}

fig = df_processed.analysis.segment_polar(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id'
    , labels=labels_for_polar
)
pb.to_slide(fig, 'cluster repeat')
fig.show()

../../_images/8d01f61d6c7c22c1302ea427225d5cc0bb3c8687fe4ee3a3fd1be7fdd8756e51.jpg

df_processed.analysis.segment_table(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id' 
)
fig.show()

Segment Analysis for Cluster
cluster	2	6	7	1	5	4	3
% of Total Count	3.97%	0.13%	2.91%	8.33%	8.47%	42.72%	33.47%
avg_unique_products_cnt	1.00	1.00	2.00	1.00	1.50	1.00	1.00
recency	67.00	8.00	83.50	107.00	80.00	103.00	28.00
frequency	2.00	15.00	2.00	3.00	2.00	2.00	2.00
monetary	1308.12	879.27	432.62	480.52	311.05	266.76	291.46

Key Observations:

Most customers in Cluster 4 (43%) and Cluster 3 (34%)
Cluster 1: No standout metrics
Cluster 2: Highest Monetary
Cluster 6: High Frequency and Monetary
Cluster 7: High avg unique products

Number of Customers by Clusters in Different Segments

Add dimensions to the dataframe with clusters.

df_processed = (
    df_processed.merge(df_customers[['customer_unique_id', *customers_dim]], on='customer_unique_id', how='left')
)

df_processed.viz.update_plotly_settings(
    labels={**base_labels, 'cluster': 'Cluster'}
)

pb.configure(
    df = df_processed
    , metric = 'customer_unique_id'
    , metric_label = 'Share of Customers'
    , agg_func = 'nunique'
    , norm_by='all'
    , axis_sort_order='descending'    
    , text_auto='.1%'
)

By Activity Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'activity_segment' 
    , visible_graphs = [2, 3]
)

../../_images/26a90abad47b2ebf935cfc8258b60d46c0c6ef2e2b445ae083be024f43ac497d.jpg

../../_images/87b915b27ff8f00358a2bdd42a8584793ea621b8acfd835c80db12e2410825f7.jpg

Key Observations:

Cluster 6: Entirely core audience
Core also dominates Cluster 1
Cluster 4 dominates potential core segment

By Purchase Amount Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'value_segment'
    , visible_graphs = [2, 3]
)

../../_images/22821968384e380fd69a04f17ff6e5ce25f96471abccbe86a859956eb543de2a.jpg

../../_images/0599e0475624e9bb867f6af669f2ee22275362becbc569f5861fa41cc113fe75.jpg

Key Observations:

Clusters 2 and 6: Entirely high-value segment
Cluster 4 dominates medium-value segment

By Loyalty Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'loyalty_segment'
    , visible_graphs = [2, 3]
)

../../_images/bce1d33800bdecd3cf65d2d04fb4a84df798e7d967ca13c22f7440a1efcba8eb.jpg

../../_images/0c2660136569fcf3566ff376f501aa69e9a47509df6edcf38cc105d09ab4344f.jpg

Key Observations:

Cluster 6: Entirely promoters
Cluster 4 dominates critics
Clusters 4-5 more common among critics

By Installment Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'installment_segment'
    , visible_graphs = [2, 3]
)

../../_images/897089ba0cd916b41deeb959ad8d5ed3d198e70c7022c0b46114538a0ad23b93.jpg

../../_images/2d3aa97073245f9820706fcd7e28b4bf7ef7ec045a5c74a5641d9a9161aee59f.jpg

Key Observations:

Cluster 6: Entirely installment users
Cluster 7 has more full-payment users

By Average Number of Products per Order Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'products_cnt_segment'
    , visible_graphs = [2, 3]
)

../../_images/ca3089c2ad7c9a5070c809cb744017362865992293d584356515e481e28c5dfc.jpg

../../_images/8aacd45855998aba9d89fa3f6707e2f6cca04e597329281065ebf6a131e3f9ff.jpg

Key Observations:

Cluster 6: Entirely single-product orders
Cluster 7: No single-product orders
Clusters 3-4 dominate single-product orders
Clusters 2 and 7 dominate bulk orders (>2 products)
Cluster 5 has more multi-product orders

By Average Order Weight Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'weight_segment'
    , visible_graphs = [2, 3]
)

../../_images/99788d4a0e2f377eb499534a84f034d4d484610c7edadfe6209a701729d4e460.jpg

../../_images/3fb89afd7e442c6ef4f7f1bdba7e71914e1c3c6689eda700d2470a9f72e811a9.jpg

Key Observations:

Cluster 6: Entirely light-weight orders
Cluster 2: Heavy-weight orders dominate

By Top Payment Types

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_payment_types'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/292ec82cdd272bafd6798801f711de248ac0877bb49c25b60c1d56dffb4d81f3.jpg

../../_images/3e67b81d92ff2d2fd284f5f9828e81d71d31bc4fefae057db62d75d9c2673ba0.jpg

Key Observations:

Cluster 6: Entirely credit card users
Cluster 7 has more boleto users

By Top Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/3c24b3fadbe98a9dbb4976ff91446bd25a409c6324dff87aa8ae6080171a7854.jpg

../../_images/0cfe2e7e290335b02363d04d50f5d03215589475495e2f7838300e243ea50e0d.jpg

Key Observations:

Clusters 1-2 dominate Watches Gifts category
Cluster 7 dominates Bed Bath Table

By Top Generalized Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_general_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/39f03c33f058b8168bb6d33f93e274ff49f1b893a022b62b2fdb89316e69e500.jpg

../../_images/52c3e3deb1c7faa8d8a7463834251f11c84b678fa7c9d3550d68a0bcb6a8f2a1.jpg

Key Observations:

Cluster 2 strongly dominates electronics
Cluster 7 dominates furniture and home/garden

By Customer State

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_city'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)

../../_images/b45960c3414c2f787040e71ec8765309db134dfc0d62755c95c8fed167cffc1a.jpg

../../_images/554eb68481734e3922c57b8f20aa5f59775a2e1827410f814feb597dbd9bef1b.jpg

Key Observations:

Cluster 6: Entirely São Paulo customers (top 5 cities)
Cluster 1 more common in Niterói
Cluster 3 more common in Campinas

Customer Clustering

On this page

Customer Clustering#

All Customers#

Cluster Definition#

Cluster Analysis#

Customers with Multiple Purchases#

Cluster Definition#

Cluster Analysis#