Customer Clustering#

All Customers#

Cluster Definition#

  • For clustering, we will use RFM metrics and the average number of unique products per order.

  • RFM metrics are well-suited for clustering. Adding the average number of unique products will help account for assortment demand diversity.

selected_metrics = [
    'avg_unique_products_cnt'
]

Create a dataframe with the selected metrics.

We will only consider customers who have made at least one successful purchase.

mask = df_customers.buys_cnt.notna()
cols_to_drop = ['recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_segment']
df_processed = (
    df_customers.loc[mask, ['customer_unique_id', *selected_metrics]]
    .merge(df_rfm, on='customer_unique_id', how='left')
    .drop(cols_to_drop, axis=1)
    .set_index('customer_unique_id')
)

Check for highly correlated features.

corr_matrix = df_processed.corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
cols_to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.6)]
cols_to_drop
[]

No need to delete anything.

Look for missing values in the columns.

df_processed.isna().sum().nlargest(5)
avg_unique_products_cnt    0
recency                    0
frequency                  0
monetary                   0
dtype: int64

There are very few missing values, and they relate to customers who made only one purchase and have not yet received their item.

We will remove these rows before standardization.

scaler = StandardScaler()
df_processed = df_processed.dropna()
X_scaled = scaler.fit_transform(df_processed)

Determine the optimal number of clusters using the elbow method and silhouette analysis.

wcss = []
for i in range(1, 15):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

px.line(
    x=range(1, 15)
    , y=wcss
    , labels={'x': 'Number of clusters', 'y': 'WCSS'}
    , title='The Elbow Method'
    , width=600
    , height=400
)
../../_images/27b5d12d90e1ea121c15e365c1279cc2b693dec5095d3c90df8ba72d65abccee.jpg

Key Observations:

  • The elbow method shows a clear break at 5 clusters. We’ll use 5 clusters.

optimal_clusters = 5
kmeans = KMeans(
    n_clusters=optimal_clusters
    , init='k-means++'
    , random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

Examine quality

score = silhouette_score(X_scaled, cluster_labels)
print(f'Silhouette Score: {score:.3f}')
Silhouette Score: 0.492

Key Observations:

  • Silhouette score of 0.492 indicates good cluster separation.

Add cluster labels to the dataframe.

df_processed['cluster'] = cluster_labels + 1

Cluster Analysis#

Analyze the resulting clusters.

df_processed = df_processed.reset_index()

Distribution by Clustering Metrics

selected_metrics = [
    'avg_unique_products_cnt',
    'recency', 
    'frequency',
    'monetary',
]

Provide more readable names for the metrics on the graphs.

metric_labels = {
    'avg_unique_products_cnt': 'Avg Unique Products',
    'recency': 'Recency', 
    'frequency': 'Frequency', 
    'monetary': 'Monetary', 
}
cluster_label = {'cluster': 'Cluster'}
labels_for_polar={**cluster_label, **base_labels, **metric_labels}
fig = df_processed.analysis.segment_polar(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id'
    , labels=labels_for_polar
)
pb.to_slide(fig, 'cluster all')
fig.show()
../../_images/835184114ed83f1a99453aa0c7535b8638c446c0dfa66548a9fe5d822b81fc4d.jpg
df_processed.analysis.segment_table(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id' 
)
fig.show()
Segment Analysis for Cluster
cluster 3 4 5 2 1
% of Total Count 2.42% 3.03% 2.95% 38.95% 52.66%
avg_unique_products_cnt 1.00 2.00 1.00 1.00 1.00
recency 218.50 221.00 197.50 372.00 128.00
frequency 1.00 1.00 2.00 1.00 1.00
monetary 966.53 187.81 223.66 99.43 101.75
../../_images/835184114ed83f1a99453aa0c7535b8638c446c0dfa66548a9fe5d822b81fc4d.jpg

Key Observations:

  • Most customers fall in Cluster 1 (53%) and Cluster 2 (39%).

  • Cluster 1: No standout metrics

  • Cluster 2: Highest Recency

  • Cluster 3: Highest Monetary, Cluster 4: Highest Avg Unique Products, Cluster 5: Highest Frequency


Number of Customers by Clusters in Different Segments

Add dimensions to the dataframe with clusters.

df_processed = (
    df_processed.merge(df_customers[['customer_unique_id', *customers_dim]], on='customer_unique_id', how='left')
)
df_processed.viz.update_plotly_settings(
    labels={**base_labels, 'cluster': 'Cluster'}
)
pb.configure(
    df = df_processed
    , metric = 'customer_unique_id'
    , metric_label = 'Share of Customers'
    , agg_func = 'nunique'
    , norm_by='all'
    , axis_sort_order='descending'    
    , text_auto='.1%'
    , plotly_kwargs= {'category_orders': {'cluster': list(range(1, 6))}}
)

By Activity Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'activity_segment' 
    , visible_graphs = [2, 3]
)
../../_images/1cfd3bd3e1ce926003fe6e91a2e45a9f27afc38364b1587d4c34bf4c064e4953.jpg ../../_images/5075314b4644e6a105a3e25e2a6cbd0e1cb2ce21686f0af513ed9ebb25deb083.jpg

Key Observations:

  • Clusters 1-4 are dominated by one-time purchasers.

  • Cluster 5 dominates all active segments except one-time purchases.

  • One-time purchase rates are similar across all clusters except Cluster 5.

By Purchase Amount Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'value_segment'
    , visible_graphs = [2, 3]
)
../../_images/d4916cfd64dfb88cd0ba1b86809853c686f74a4f8ec1c507955ab819bbb1d6d3.jpg ../../_images/9aa32f242475a005440dc30449c56e9b3ac8e4e6e5f6fab89469fb0e599c5bcb.jpg

Key Observations:

  • Clusters 1-2: Mostly medium payment tier

  • Cluster 3: Entirely high-value segment

  • Clusters 1-2 are less common in high-value segment

  • Clusters 1-2 dominate within each value segment

By Loyalty Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'loyalty_segment'
    , visible_graphs = [2, 3]
)
../../_images/410bc02cb052c060aefa2f1ff5cf5b655d9486ae10c45008afe3f0e6f3d90129.jpg ../../_images/7b5cf66ca39297bf959741bb73cc10e08d244273de71ef9c1afed7d33e50af06.jpg

Key Observations:

  • Cluster 4 has notably more critics

  • Clusters 1-2 dominate across loyalty segments

  • Fewer promoters in Cluster 4

  • More neutrals in Cluster 5

  • More critics in Cluster 4

By Risk Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'risk_segment'
    , visible_graphs = [2, 3]
)
../../_images/48d10af14353eea91f6e1341499f9e83f895e87a84b105b0f6ddd21fb086f637.jpg ../../_images/d79256f0aa83e9f8c844a3e2d989958fb988b8c284ae63d2b5fc59fb04b57a12.jpg

Key Observations:

  • Clusters 4-5 are more common among risky customers (with order cancellations)

By Installment Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'installment_segment'
    , visible_graphs = [2, 3]
)
../../_images/7d7a2f56f2415353a1ab9091a4e267fe7cb5c3c3b21fe2bb50c198ea64b04300.jpg ../../_images/479a28ae496ba75b4f667ee53e6730f63fa509cc6f0bab20c78d50bbaec93cf5.jpg

Key Observations:

  • Installments don’t dominate Clusters 1-2

  • Installments clearly dominate Clusters 3-5

  • Highest installment rate in Cluster 3

By Average Number of Products per Order Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'products_cnt_segment'
    , visible_graphs = [2, 3]
)
../../_images/e4c8b63d18df95cd6a42e7bfcfe2a8dfea13e99a3a8d26e4241ab2fcfa0032a9.jpg ../../_images/3c8af52a704814b0b4b7784ef1518e38b88e0dd049671f9a91bef4865e465123.jpg

Key Observations:

  • Cluster 4 consists entirely of customers averaging >1 product per order

By Average Order Weight Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'weight_segment'
    , visible_graphs = [2, 3]
)
../../_images/3fe4cb33a2fc162a7af477219b093984ebcc2a86760c199c7c4e8eee4b56e09b.jpg ../../_images/edc2d2a2d04455ded8207e2286dbc5ea9bb4b7869b3407665d44ce12ef620933.jpg

Key Observations:

  • Cluster 3 has more heavy-weight orders

  • Cluster 1 dominates light-weight orders

By Top Payment Types

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_payment_types'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/af1b06473330e0dc08ad4d15c1a5693c1845c9535944e443850663363e595033.jpg ../../_images/e4beb093bf254255eec2a4b43832a1b884eb46cf36e1da30bf1f0acd993077ac.jpg

Key Observations:

  • Cluster 1 has more debit card users

By Top Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/0d0bbac6589db42750e170b9d1e827c8f28214c6a31e238f3eec44610100d3c6.jpg ../../_images/e369b262c2d8fadfa13f2a80bb3f6b584b01a366ca10f1765c6ce40f82f4e7a2.jpg

Key Observations:

  • Cluster 4 dominates Bed Bath Table category

By Top Generalized Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_general_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/aa0e840210e17d3bd60d3649f4a65502d56df1f578b9581d5ab83f1c87b023ef.jpg ../../_images/15909fa28002f77b4b14b6aec4fefceeee4bff2dca192c420479ff2854c03f55.jpg

Key Observations:

  • Cluster 3: Dominated by electronics

  • Cluster 4: Dominated by furniture

By Customer State

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_city'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/b4bbea8fae7096cc56558d57c925e1840013a1e4df5e2dbf440450ba444eaa64.jpg ../../_images/9362044d97a8893f9082f57e508508715b8ee5804f1bae9769704a059749ac4b.jpg

Key Observations:

  • Cluster 2 is more common in Rio de Janeiro

  • Cluster 1 is less common in Rio de Janeiro

Customers with Multiple Purchases#

Cluster Definition#

We will conduct a separate clustering of customers who have made more than one successful purchase.

We will use the same metrics as in the clustering of all customers.

selected_metrics = [
    'avg_unique_products_cnt'
]
mask = df_customers.buys_cnt >= 2
cols_to_drop = ['recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_segment']
df_processed = (
    df_customers.loc[mask, ['customer_unique_id', *selected_metrics]]
    .merge(df_rfm[lambda x: x.rfm_segment=='Champions'], on='customer_unique_id', how='inner')
    .drop(cols_to_drop, axis=1)
    .set_index('customer_unique_id')
)

Check for missing values.

df_processed.isna().sum().nlargest(5)
avg_unique_products_cnt    0
recency                    0
frequency                  0
monetary                   0
dtype: int64

There are no missing values.

scaler = StandardScaler()
df_processed = df_processed.dropna()
X_scaled = scaler.fit_transform(df_processed)

Determine the optimal number of clusters using the silhouette method.

silhouette_scores = []
for n_clusters in range(2, 15):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X_scaled)
    silhouette_avg = silhouette_score(X_scaled, cluster_labels)
    silhouette_scores.append(silhouette_avg)
px.line(
    x=range(2, 15)
    , y=silhouette_scores
    , labels={'x': 'Number of clusters', 'y': 'Silhouette Score'}
    , title='Silhouette Score Method'
    , width=600
    , height=400
)
../../_images/b0dcc8b8c77d1ff80a2728a6b59f67f12c45d400455657ce12fabaeb1fd7760c.jpg

Key Observations:

  • Global peak at 7 clusters

  • Sharp drop after k=7 suggests overfitting

Examine the dendrogram of hierarchical clustering.

linked = linkage(X_scaled, method='ward')
fig = ff.create_dendrogram(
    linked
    , orientation='bottom'
)
fig.update_layout(
    title='Dendrogram of hierarchical clustering',
    xaxis_title='Observations',
    yaxis_title='Distance',
    width=800,
    height=500,
    margin=dict(l=50, r=50, b=50, t=50),
    yaxis_mirror=False,
)
fig.update_xaxes(showticklabels=False, ticks='', mirror=False)
fig.show(config=dict(displayModeBar=False), renderer="png")
../../_images/f351f9d566cbdf03e43ebad27db5ee7a67db58f1243c1d69f4319a15271ebf5a.png

Key Observations:

  • Optimal cut at 6-7 clusters (where branches lengthen)

  • We’ll choose 7 clusters

optimal_clusters = 7
model = AgglomerativeClustering(n_clusters=optimal_clusters)
cluster_labels = model.fit_predict(X_scaled)

Evaluate quality.

score = silhouette_score(X_scaled, cluster_labels)
print(f'Silhouette Score: {score:.3f}')
Silhouette Score: 0.410

Key Observations:

  • Silhouette score of 0.41 indicates good clustering

  • Clusters remain distinguishable despite score <0.5

Add cluster labels to the dataframe.

df_processed['cluster'] = cluster_labels + 1

Cluster Analysis#

Analyze the resulting clusters.

df_processed = df_processed.reset_index()

Distribution by Clustering Metrics

selected_metrics = [
    'avg_unique_products_cnt',
    'recency', 
    'frequency',
    'monetary',
]

Provide more readable names for the metrics on the graphs.

metric_labels = {
    'avg_unique_products_cnt': 'Avg Unique Products',
    'recency': 'Recency', 
    'frequency': 'Frequency', 
    'monetary': 'Monetary', 
}
cluster_label = {'cluster': 'Cluster'}
labels_for_polar={**cluster_label, **base_labels, **metric_labels}
fig = df_processed.analysis.segment_polar(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id'
    , labels=labels_for_polar
)
pb.to_slide(fig, 'cluster repeat')
fig.show()
../../_images/8d01f61d6c7c22c1302ea427225d5cc0bb3c8687fe4ee3a3fd1be7fdd8756e51.jpg
df_processed.analysis.segment_table(
    metrics=selected_metrics
    , dimension='cluster'
    , count_column='customer_unique_id' 
)
fig.show()
Segment Analysis for Cluster
cluster 2 6 7 1 5 4 3
% of Total Count 3.97% 0.13% 2.91% 8.33% 8.47% 42.72% 33.47%
avg_unique_products_cnt 1.00 1.00 2.00 1.00 1.50 1.00 1.00
recency 67.00 8.00 83.50 107.00 80.00 103.00 28.00
frequency 2.00 15.00 2.00 3.00 2.00 2.00 2.00
monetary 1308.12 879.27 432.62 480.52 311.05 266.76 291.46
../../_images/8d01f61d6c7c22c1302ea427225d5cc0bb3c8687fe4ee3a3fd1be7fdd8756e51.jpg

Key Observations:

  • Most customers in Cluster 4 (43%) and Cluster 3 (34%)

  • Cluster 1: No standout metrics

  • Cluster 2: Highest Monetary

  • Cluster 6: High Frequency and Monetary

  • Cluster 7: High avg unique products


Number of Customers by Clusters in Different Segments

Add dimensions to the dataframe with clusters.

df_processed = (
    df_processed.merge(df_customers[['customer_unique_id', *customers_dim]], on='customer_unique_id', how='left')
)
df_processed.viz.update_plotly_settings(
    labels={**base_labels, 'cluster': 'Cluster'}
)
pb.configure(
    df = df_processed
    , metric = 'customer_unique_id'
    , metric_label = 'Share of Customers'
    , agg_func = 'nunique'
    , norm_by='all'
    , axis_sort_order='descending'    
    , text_auto='.1%'
)

By Activity Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'activity_segment' 
    , visible_graphs = [2, 3]
)
../../_images/26a90abad47b2ebf935cfc8258b60d46c0c6ef2e2b445ae083be024f43ac497d.jpg ../../_images/87b915b27ff8f00358a2bdd42a8584793ea621b8acfd835c80db12e2410825f7.jpg

Key Observations:

  • Cluster 6: Entirely core audience

  • Core also dominates Cluster 1

  • Cluster 4 dominates potential core segment

By Purchase Amount Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'value_segment'
    , visible_graphs = [2, 3]
)
../../_images/22821968384e380fd69a04f17ff6e5ce25f96471abccbe86a859956eb543de2a.jpg ../../_images/0599e0475624e9bb867f6af669f2ee22275362becbc569f5861fa41cc113fe75.jpg

Key Observations:

  • Clusters 2 and 6: Entirely high-value segment

  • Cluster 4 dominates medium-value segment

By Loyalty Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'loyalty_segment'
    , visible_graphs = [2, 3]
)
../../_images/bce1d33800bdecd3cf65d2d04fb4a84df798e7d967ca13c22f7440a1efcba8eb.jpg ../../_images/0c2660136569fcf3566ff376f501aa69e9a47509df6edcf38cc105d09ab4344f.jpg

Key Observations:

  • Cluster 6: Entirely promoters

  • Cluster 4 dominates critics

  • Clusters 4-5 more common among critics

By Installment Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'installment_segment'
    , visible_graphs = [2, 3]
)
../../_images/897089ba0cd916b41deeb959ad8d5ed3d198e70c7022c0b46114538a0ad23b93.jpg ../../_images/2d3aa97073245f9820706fcd7e28b4bf7ef7ec045a5c74a5641d9a9161aee59f.jpg

Key Observations:

  • Cluster 6: Entirely installment users

  • Cluster 7 has more full-payment users

By Average Number of Products per Order Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'products_cnt_segment'
    , visible_graphs = [2, 3]
)
../../_images/ca3089c2ad7c9a5070c809cb744017362865992293d584356515e481e28c5dfc.jpg ../../_images/8aacd45855998aba9d89fa3f6707e2f6cca04e597329281065ebf6a131e3f9ff.jpg

Key Observations:

  • Cluster 6: Entirely single-product orders

  • Cluster 7: No single-product orders

  • Clusters 3-4 dominate single-product orders

  • Clusters 2 and 7 dominate bulk orders (>2 products)

  • Cluster 5 has more multi-product orders

By Average Order Weight Segment

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'weight_segment'
    , visible_graphs = [2, 3]
)
../../_images/99788d4a0e2f377eb499534a84f034d4d484610c7edadfe6209a701729d4e460.jpg ../../_images/3fb89afd7e442c6ef4f7f1bdba7e71914e1c3c6689eda700d2470a9f72e811a9.jpg

Key Observations:

  • Cluster 6: Entirely light-weight orders

  • Cluster 2: Heavy-weight orders dominate

By Top Payment Types

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_payment_types'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/292ec82cdd272bafd6798801f711de248ac0877bb49c25b60c1d56dffb4d81f3.jpg ../../_images/3e67b81d92ff2d2fd284f5f9828e81d71d31bc4fefae057db62d75d9c2673ba0.jpg

Key Observations:

  • Cluster 6: Entirely credit card users

  • Cluster 7 has more boleto users

By Top Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/3c24b3fadbe98a9dbb4976ff91446bd25a409c6324dff87aa8ae6080171a7854.jpg ../../_images/0cfe2e7e290335b02363d04d50f5d03215589475495e2f7838300e243ea50e0d.jpg

Key Observations:

  • Clusters 1-2 dominate Watches Gifts category

  • Cluster 7 dominates Bed Bath Table

By Top Generalized Product Categories

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_top_general_product_categories'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/39f03c33f058b8168bb6d33f93e274ff49f1b893a022b62b2fdb89316e69e500.jpg ../../_images/52c3e3deb1c7faa8d8a7463834251f11c84b678fa7c9d3550d68a0bcb6a8f2a1.jpg

Key Observations:

  • Cluster 2 strongly dominates electronics

  • Cluster 7 dominates furniture and home/garden

By Customer State

pb.cat_compare(
    cat1='cluster'
    , cat2 = 'customer_city'
    , trim_top_n_cat2=5
    , visible_graphs = [2, 3]
)
../../_images/b45960c3414c2f787040e71ec8765309db134dfc0d62755c95c8fed167cffc1a.jpg ../../_images/554eb68481734e3922c57b8f20aa5f59775a2e1827410f814feb597dbd9bef1b.jpg

Key Observations:

  • Cluster 6: Entirely São Paulo customers (top 5 cities)

  • Cluster 1 more common in Niterói

  • Cluster 3 more common in Campinas