Data Merging

Data Merging#

Enriching Table df_items#

Buyer-Seller Distance

Create a distance variable.

tmp_df_customers = df_customers.merge(df_geolocations, left_on='customer_zip_code_prefix', right_on='geolocation_zip_code_prefix', how='left')
tmp_df_customers.rename(columns={'geolocation_lat': 'lat_customer', 'geolocation_lng': 'lng_customer'}, inplace=True)
tmp_df_customers.drop(['geolocation_zip_code_prefix', 'geolocation_zip_code_prefix_3_digits'], axis=1, inplace=True)

tmp_df_sellers = df_sellers.merge(df_geolocations, left_on='seller_zip_code_prefix', right_on='geolocation_zip_code_prefix', how='left')
tmp_df_sellers.rename(columns={'geolocation_lat': 'lat_seller', 'geolocation_lng': 'lng_seller'}, inplace=True)
tmp_df_sellers.drop(['geolocation_zip_code_prefix', 'geolocation_zip_code_prefix_3_digits'], axis=1, inplace=True)

df_items = (
    df_items.merge(df_orders[['order_id', 'customer_id']], on='order_id', how='left')
    .merge(tmp_df_customers[['customer_id', 'lat_customer', 'lng_customer']], on='customer_id', how='left')
    .merge(tmp_df_sellers[['seller_id', 'lat_seller', 'lng_seller']], on='seller_id', how='left')
)

Calculate buyer-seller distances.

df_items['distance_km'] = fron.haversine_vectorized(
    df_items['lat_customer'].values,
    df_items['lng_customer'].values,
    df_items['lat_seller'].values,
    df_items['lng_seller'].values
)

df_items.drop(['customer_id', 'lat_customer', 'lng_customer', 'lat_seller', 'lng_seller'], axis=1, inplace=True)

Check for missing values.

df_items.distance_km.isna().sum()

np.int64(554)

Missing values occur when zip prefixes are absent from the geolocation table for some buyers/sellers.

The quantity is small - leave as is.

Carrier Handoff Delay

Create a carrier handoff delay variable.

df_items = df_items.merge(df_orders[['order_id', 'order_delivered_carrier_dt']], on='order_id', how='left')

df_items['carrier_delivery_delay_days'] = df_items['order_delivered_carrier_dt'] - df_items['shipping_limit_dt']

Convert to days…

df_items['carrier_delivery_delay_days'] = df_items['carrier_delivery_delay_days'].dt.total_seconds() / (24 * 3600)

df_items.drop(['order_delivered_carrier_dt'], axis=1, inplace=True)

Enriching Table df_orders#

From Table df_payments#

Create these order-level metrics:

Payment count
Payment sum
Average payment
Total installment count

Create these order-level dimensions:

Installment presence
Payment types

tmp_df_pay_agg = df_payments.copy()
tmp_df_pay_agg['has_installments'] = tmp_df_pay_agg['has_installments'] == 'Has Installments'

Examine unique payment types per order.

tmp_df_pay_agg.groupby('order_id')['payment_type'].nunique().value_counts()

payment_type
1    96851
2     2241
Name: count, dtype: int64

Since maximum is 2 payment types, concatenate them during aggregation.

tmp_df_pay_agg = (
    tmp_df_pay_agg.groupby('order_id', as_index=False)
    .agg(
        payments_cnt = ('payment_sequential', 'count')
        , total_payment = ('payment_value', 'sum')
        , avg_payment = ('payment_value', 'mean')
        , total_installments_cnt = ('payment_installments', 'sum')
        , order_has_installment = ('has_installments', 'any')
        , order_payment_types = ('payment_type', lambda x: ', '.join(sorted(set(x))))
    )
)

tmp_df_pay_agg['order_payment_types'] = tmp_df_pay_agg['order_payment_types'].astype('category')

tmp_df_pay_agg['order_has_installment'] = tmp_df_pay_agg.order_has_installment.map({True: 'Has Installments', False: 'No Installments'}).astype('category')

Merge with df_orders.

Check key mismatches.

fron.analyze_join_keys(df_orders, tmp_df_pay_agg, "order_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	99,092	99,092	99,092

Merge tables.

df_orders = df_orders.merge(tmp_df_pay_agg, on='order_id', how='left')

From Tables df_items and df_products#

Create these order-level metrics:

Total product count
Unique product count
Seller count
Unique category count
Total product price
Average product price
Total shipping cost
Total order value
Shipping cost ratio
Order weight
Order volume
Average buyer-seller distance
Average carrier handoff delay

Create these order-level dimensions:

Free shipping indicator
Generalized product categories

First merge items and products tables.

Check key mismatches.

fron.analyze_join_keys(df_items, df_products, "product_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
N:1	0	164	112,279	32,951	112,279

Merge tables.

tmp_df_items_prods = df_items.merge(df_products, on='product_id', how='left')

Prepare dataframe.

Examine unique categories per order.

tmp_df_items_prods.groupby('order_id')['product_category'].nunique().value_counts()

product_category
1    97570
2      765
3       18
Name: count, dtype: int64

Maximum 3 categories - concatenate during aggregation.

Examine unique generalized categories per order.

tmp_df_items_prods.groupby('order_id')['general_product_category'].nunique().value_counts()

general_product_category
1    97777
2      567
3        9
Name: count, dtype: int64

Maximum 3 categories - concatenate during aggregation.

tmp_df_items_prods_agg = (tmp_df_items_prods.groupby('order_id', as_index=False)
          .agg(
              products_cnt = ('product_id', 'count')
              , unique_products_cnt = ('product_id', 'nunique')
              , sellers_cnt = ('seller_id', 'nunique')
              , product_categories_cnt = ('product_category', 'nunique')
              , total_products_price = ('price', 'sum')
              , avg_products_price = ('price', 'mean')
              , total_freight_value = ('freight_value', 'sum')
              , total_order_price = ('total_price', 'sum')
              , total_weight_kg = ('product_weight_g', 'sum')
              , total_volume_cm3 = ('product_volume_cm3', 'sum')
              , avg_distance_km = ('distance_km', 'mean')
              , avg_carrier_delivery_delay_days = ('carrier_delivery_delay_days', 'mean')
              , order_product_categories = ('product_category', lambda x: ', '.join(sorted(set(x))))
              , order_general_product_categories = ('general_product_category', lambda x: ', '.join(sorted(set(x))))
          )
)

tmp_df_items_prods_agg['freight_ratio'] = tmp_df_items_prods_agg['total_freight_value'] / tmp_df_items_prods_agg['total_order_price']

tmp_df_items_prods_agg['order_product_categories'] = tmp_df_items_prods_agg['order_product_categories'].astype('category')
tmp_df_items_prods_agg['order_general_product_categories'] = tmp_df_items_prods_agg['order_general_product_categories'].astype('category')

tmp_df_items_prods_agg['total_weight_kg'] = (tmp_df_items_prods_agg['total_weight_kg'] / 1000).round(2)

tmp_df_items_prods_agg['order_is_free_shipping'] = tmp_df_items_prods_agg.total_freight_value == 0

tmp_df_items_prods_agg['order_is_free_shipping'] = tmp_df_items_prods_agg.order_is_free_shipping.map({True: 'Free Shipping', False: 'Paid Shipping'}).astype('category')

tmp_df_items_prods_agg['order_is_free_shipping'].value_counts(dropna=False)

order_is_free_shipping
Paid Shipping    98015
Free Shipping      338
Name: count, dtype: int64

Add new fields to df_orders.

Check key mismatches.

fron.analyze_join_keys(df_orders, tmp_df_items_prods_agg, "order_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	739	0	99,092	98,353	99,092

We previously identified orders in df_orders missing from df_items.

These are all either canceled or unavailable - missing values from items table are expected.

Merge tables.

df_orders = df_orders.merge(tmp_df_items_prods_agg, on='order_id', how='left')

From Table df_customers#

Add customer city and state.

Check key mismatches.

fron.analyze_join_keys(df_orders, df_customers, "customer_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	99,092	99,092	99,092

Add fields to df_orders.

df_orders = df_orders.merge(df_customers[['customer_id', 'customer_unique_id', 'customer_state', 'customer_city']], on='customer_id', how='left')

From Table df_reviews#

Create these order-level metrics:

Review count
Average review score

tmp_df_reviews_agg = (
    df_reviews.groupby('order_id', as_index=False)
    .agg(
        reviews_cnt = ('review_id', 'nunique')
        , order_avg_reviews_score = ('review_score', 'mean')
    )
)

Since secondary reviews tend to be lower-scored, round down.

tmp_df_reviews_agg['order_avg_reviews_score'] = np.floor(tmp_df_reviews_agg['order_avg_reviews_score']).astype(int)

Check key mismatches.

fron.analyze_join_keys(df_orders, tmp_df_reviews_agg, "order_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	99,092	99,092	99,092

Add fields to df_orders.

df_orders = df_orders.merge(tmp_df_reviews_agg, on='order_id', how='left')

Creating New Dimensions#

Create a new payment amount dimension

Let’s look at the quantiles in the column.

df_orders.total_payment.quantile([0.05, 0.25, 0.5, 0.75, 0.95])

05    32.38
25    62.00
50   105.28
75   176.87
95   452.28
Name: total_payment, dtype: float64

Create the following categories:

Cheap: ≤50 R$
Medium: 50-200 R$
Expensive: >200 R$

labels = ['Cheap', 'Medium', 'Expensive']
bins = [-np.inf, 50, 200, np.inf]

df_orders['order_total_payment_cat'] = df_orders.total_payment.preproc.to_categorical(method='custom_bins', labels=labels, bins=bins)

	Count
Medium	61993
Expensive	20189
Cheap	16910

Create a new order weight dimension

Let’s look at the quantiles in the column.

df_orders.total_weight_kg.quantile([0.05, 0.25, 0.5, 0.75, 0.95])

05    0.14
25    0.30
50    0.75
75    2.05
95   10.59
Name: total_weight_kg, dtype: float64

Categories for order weight:

Light: ≤500g
Medium: 500-5000g
Heavy: >5kg

labels = ['Light', 'Medium', 'Heavy']
bins = [-np.inf, 0.5, 5, np.inf]

df_orders['order_total_weight_cat'] = (
    df_orders.total_weight_kg.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Missing in Items')
)

	Count
Medium	45450
Light	39654
Heavy	13249
Missing in Items	739

Create a new order volume dimension

Let’s look at the quantiles in the column.

df_orders.total_volume_cm3.quantile([0.05, 0.25, 0.5, 0.75, 0.95])

05      816.00
25    2,964.00
50    7,260.00
75   19,872.00
95   64,000.00
Name: total_volume_cm3, dtype: float64

Categories for order volume:

Small: ≤3500 cm3
Medium: 3500-10000 cm3
Large: >10000 cm3

labels = ['Small', 'Medium', 'Large']
bins = [-np.inf, 3500, 10000, np.inf]

df_orders['order_total_volume_cat'] = (
    df_orders.total_volume_cm3.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Missing in Items')
)

	Count
Large	39869
Medium	29455
Small	29029
Missing in Items	739

Create a new review score dimension

Categories:

Positive: 4-5
Neutral: 3
Negative: 1-2

rules = {
    "Positive": lambda x: x.isin([4, 5]),
    "Neutral": lambda x: x == 3,
    'Negative': lambda x: x.isin([1, 2]),
    "Missing Score": "default"
}
df_orders['order_review_sentiment'] = df_orders.order_avg_reviews_score.preproc.to_categorical(rules=rules)

	Count
Positive	75966
Negative	14882
Neutral	8244

Let’s look at the missing values.

df_orders.explore.detect_anomalies()

Missings by Column
	Count	Percent
order_approved_dt	121	0.12%
order_delivered_carrier_dt	1714	1.73%
order_delivered_customer_dt	2880	2.91%
from_purchase_to_approved_hours	121	0.12%
from_purchase_to_carrier_days	1714	1.73%
delivery_time_days	2880	2.91%
delivery_delay_days	2880	2.91%
from_approved_to_carrier_days	1714	1.73%
from_carrier_to_customer_days	2880	2.91%
delivery_time_days_cat	2880	2.91%
products_cnt	739	0.75%
unique_products_cnt	739	0.75%
sellers_cnt	739	0.75%
product_categories_cnt	739	0.75%
total_products_price	739	0.75%
avg_products_price	739	0.75%
total_freight_value	739	0.75%
total_order_price	739	0.75%
total_weight_kg	739	0.75%
total_volume_cm3	739	0.75%
avg_distance_km	1228	1.24%
avg_carrier_delivery_delay_days	1714	1.73%
order_product_categories	739	0.75%
order_general_product_categories	739	0.75%
freight_ratio	739	0.75%
order_is_free_shipping	739	0.75%

Missing values in time-related variables occur due to absent dates (normal for undelivered orders)
Missing values in product-related variables occur because some orders aren’t in the items table (canceled/unavailable). These won’t affect sales analysis.

For dimensions, replace missing values with ‘No Order In Items’

df_orders['order_product_categories'] = df_orders['order_product_categories'].cat.add_categories('No Order in Items').fillna('No Order in Items')
df_orders['order_general_product_categories'] = df_orders['order_general_product_categories'].cat.add_categories('No Order in Items').fillna('No Order in Items')
df_orders['order_is_free_shipping'] = df_orders['order_is_free_shipping'].cat.add_categories('No Order in Items').fillna('No Order in Items')

Creating Table df_sales#

The orders table contains order creation time, payment approval time, and order status. After examining statuses:

created: Few orders, old, comments indicate non-delivery
approved: Few orders, old, no comments
processing: >90% have 1-2 star reviews, most undelivered (some reviews mention stockouts)
invoiced: >80% have 1-2 stars, mostly undelivered (some mention stockouts)
shipped: >70% have 1-2 stars, mostly undelivered

Let’s look at how the total delivery time is distributed.

df_orders.viz.histogram(
    x='delivery_time_estimated_days'
    , labels={'delivery_time_estimated_days': 'Delivery time to customer, days'}
    , title='Distribution of delivery time to customer'
)

../../_images/20a78cac15bca9d4625c4526c1de5d96a6195480da5e577d167f83215ce7725f.jpg

Delivery typically takes ≤1 month. We’ll use this threshold to determine delivery status.

Let’s look at how the statuses are distributed for orders that have passed a month from the estimated delivery date.

tmp_last_date = df_orders.order_purchase_dt.max()

tmp_mask = lambda x: ((tmp_last_date - x.order_estimated_delivery_dt).dt.days > 31) & (x.order_status != 'Delivered')

df_orders[tmp_mask].order_status.value_counts()

order_status
Shipped        998
Unavailable    580
Canceled       464
Processing     298
Invoiced       261
Created          5
Approved         2
Delivered        0
Name: count, dtype: int64

Key Observations:

Since delivery usually takes ≤1 month, orders exceeding this are considered undelivered (confirmed by review content mentioning non-delivery/stockouts).

We’ll define a purchase as:

Orders without ‘canceled’/‘unavailable’ status
Orders with ‘delivered’ status
Orders without ‘delivered’ status where <31 days since purchase

tmp_mask = (
    (df_orders.order_status == 'Delivered') |
    (
        ((tmp_last_date - df_orders.order_estimated_delivery_dt).dt.days <= 31) & 
        (~df_orders.order_status.isin(['Canceled', 'Unavailable'])) 
    )
)

df_sales = df_orders[tmp_mask]

Let’s look at the count by statuses.

df_orders.order_status.value_counts()

order_status
Delivered      96211
Shipped         1097
Unavailable      602
Canceled         580
Processing       299
Invoiced         296
Created            5
Approved           2
Name: count, dtype: int64

We’ll create an order-level variable indicating purchase conversion.

df_orders['is_purchase'] = tmp_mask.map({True: 'Purchase', False: 'Not Purchase'}).astype('category')

df_orders.is_purchase.value_counts(dropna=False)

is_purchase
Purchase        96346
Not Purchase     2746
Name: count, dtype: int64

Create first purchase flag

To handle multiple purchases within the same timestamp, use ranking.

# Sort by customer and purchase date for correct ranking
df_sales = df_sales.sort_values(['customer_unique_id', 'order_purchase_dt'])

# Global first purchase flag (with tie-breaking via purchase_rank)
df_sales['customer_first_purchase_dt'] = df_sales.groupby('customer_unique_id')['order_purchase_dt'].transform('min')
df_sales['purchase_rank'] = df_sales.groupby('customer_unique_id').cumcount()
df_sales['sale_is_customer_first_purchase'] = (
    (df_sales['order_purchase_dt'] == df_sales['customer_first_purchase_dt']) 
    & (df_sales['purchase_rank'] == 0)
)

df_sales.drop(columns=['purchase_rank'], inplace=True)

Let’s look at the missing values.

df_sales.explore.detect_anomalies()

Missings by Column
	Count	Percent
order_delivered_carrier_dt	36	0.04%
order_delivered_customer_dt	135	0.14%
from_purchase_to_carrier_days	36	0.04%
delivery_time_days	135	0.14%
delivery_delay_days	135	0.14%
from_approved_to_carrier_days	36	0.04%
from_carrier_to_customer_days	135	0.14%
delivery_time_days_cat	135	0.14%
avg_distance_km	476	0.49%
avg_carrier_delivery_delay_days	36	0.04%

Results are as expected.

Enriching Table df_customers#

customer_unique_id isn’t unique in df_customers (data characteristic). For analysis, we’ll:

Save original table under new name
Remove duplicates from df_customers for user analysis

df_customers_origin = df_customers.copy()

df_customers = df_customers.drop_duplicates('customer_unique_id').drop('customer_id', axis=1)

From Table df_orders#

Create metrics for each customer:

Order count
Canceled order count
Cancelation rate

tmp_df_orders_agg = df_orders.copy()
tmp_df_orders_agg['is_canceled'] = tmp_df_orders_agg['order_status'] == 'Canceled'
tmp_df_orders_agg['is_not_delivered'] = tmp_df_orders_agg['order_status'] != 'Delivered'
tmp_df_orders_agg['is_customer_issue'] = tmp_df_orders_agg['delivery_issue_reason'] == 'Customer Issue'
tmp_df_orders_agg['is_service_issue'] = tmp_df_orders_agg['delivery_issue_reason'] == 'Service Issue'
tmp_df_orders_agg = (
    tmp_df_orders_agg.groupby('customer_unique_id', as_index=False)
    .agg(
        orders_cnt = ('order_id', 'nunique')
        , canceled_share = ('is_canceled', 'mean')
        , canceled_orders_cnt= ('is_canceled', 'sum')
        , not_delivered_share = ('is_not_delivered', 'mean')
        , customer_issue_share = ('is_customer_issue', 'mean')
        , service_issue_share = ('is_service_issue', 'mean')
    )
)

Check key mismatches.

fron.analyze_join_keys(df_orders, tmp_df_reviews_agg, "order_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	99,092	99,092	99,092

Merge tables.

df_customers = df_customers.merge(tmp_df_orders_agg, on='customer_unique_id', how='left')

From Table df_sales#

Create metrics for each customer:

Purchase count
Average delivery time
Average delivery delay
Delayed order rate
Weekend purchase rate
Average inter-purchase time
Average payment count
Total payments
Average order value
Average single payment amount
Installment order rate
Average product count per order
Average unique product count
Average seller count per order
Total product value
Average product value
Average single product value
Average shipping cost
Average order weight
Average order volume
Free shipping rate
Review count
Average review score
Active months count
Max consecutive active months

Create dimensions for each customer:

Installment usage
Payment types
Top 3 purchase weekdays
Top 3 product categories
Top 3 generalized categories
First purchase date

Top 3 Weekdays

df_sales.groupby('customer_unique_id')['purchase_weekday'].nunique().value_counts()

purchase_weekday
  91518
   1619
     76
     17
      5
Name: count, dtype: int64

Aggregate top 3 weekdays by purchase frequency.

top_purchase_weekdays = (
    df_sales.groupby(['customer_unique_id', 'purchase_weekday'], as_index=False, observed=True)
    .size()  
    .sort_values(by=['customer_unique_id', 'size'], ascending=[True, False]) 
    .groupby('customer_unique_id')
    .head(3) 
)
top_purchase_weekdays = (
    top_purchase_weekdays.groupby('customer_unique_id', as_index=False)
    .agg(
        customer_top_purchase_weekdays = ('purchase_weekday', lambda x: ', '.join(sorted(set(x))))
    )
)

Top 3 Payment Types:

payment_types = (
    df_sales.merge(df_payments, on='order_id', how='left')
)
payment_types.groupby('customer_unique_id')['payment_type'].nunique().value_counts()

payment_type
1    90805
2     2417
3       13
Name: count, dtype: int64

Concatenate into string.

payment_types = (
    payment_types.groupby('customer_unique_id', as_index=False)
    .agg(
        customer_payment_types = ('payment_type', lambda x: ', '.join(sorted(set(x))))
    )
)

Top 3 Product Categories:

top_product_categories = (
    df_sales.merge(df_items, on='order_id', how='left')
    .merge(df_products, on='product_id', how='left')
)
(top_product_categories.drop_duplicates(['customer_unique_id', 'order_id', 'product_category'])
.groupby(['customer_unique_id'])['product_category']
.nunique()
.value_counts()
)

product_category
  90973
   2125
    118
     13
      6
Name: count, dtype: int64

Aggregate top 3 categories by frequency.

top_product_categories = (
    top_product_categories.drop_duplicates(['customer_unique_id', 'order_id', 'product_category'])
    .groupby(['customer_unique_id', 'product_category'], as_index=False, observed=True)
    .size()  
    .sort_values(by=['customer_unique_id', 'size'], ascending=[True, False]) 
    .groupby('customer_unique_id')
    .head(3) 
)
top_product_categories = (
    top_product_categories.groupby('customer_unique_id', as_index=False)
    .agg(
        customer_top_product_categories = ('product_category', lambda x: ', '.join(sorted(set(x))))
    )
)

Top 3 Generalized Categories

top_general_product_categories = (
    df_sales.merge(df_items, on='order_id', how='left')
    .merge(df_products, on='product_id', how='left')
)
(top_general_product_categories.drop_duplicates(['customer_unique_id', 'order_id', 'general_product_category'])
.groupby(['customer_unique_id'])['general_product_category']
.nunique()
.value_counts()
)

general_product_category
  91416
   1737
     72
      9
      1
Name: count, dtype: int64

Aggregate top 3 categories by frequency.

top_general_product_categories = (
    top_general_product_categories.drop_duplicates(['customer_unique_id', 'order_id', 'general_product_category'])
    .groupby(['customer_unique_id', 'general_product_category'], as_index=False, observed=True)
    .size()  
    .sort_values(by=['customer_unique_id', 'size'], ascending=[True, False]) 
    .groupby('customer_unique_id')
    .head(3) 
)
top_general_product_categories = (
    top_general_product_categories.groupby('customer_unique_id', as_index=False)
    .agg(
        customer_top_general_product_categories = ('general_product_category', lambda x: ', '.join(sorted(set(x))))
    )
)

tmp_df_sales_agg = df_sales.copy()
tmp_df_sales_agg['is_delayed'] = tmp_df_sales_agg['is_delayed'] == 'Delayed'
tmp_df_sales_agg['is_purchase_weekend'] = tmp_df_sales_agg['purchase_day_type'] == 'Weekend'
tmp_df_sales_agg['order_has_installment'] = tmp_df_sales_agg['order_has_installment'] == 'Has Installments'
tmp_df_sales_agg['order_is_free_shipping'] = tmp_df_sales_agg['order_is_free_shipping'] == 'Free Shipping'

tmp_df_sales_agg = (
    tmp_df_sales_agg.groupby('customer_unique_id', as_index=False)
    .agg(
        buys_cnt = ('order_id', 'nunique')
        , avg_delivery_time_days = ('delivery_time_days', 'mean')
        , avg_delivery_delay_days = ('delivery_delay_days', 'mean')
        , delayed_orders_share = ('is_delayed', 'mean')
        , purchase_weekend_share = ('is_purchase_weekend', 'mean')
        , repeat_purchase_share = ('sale_is_customer_first_purchase', 'mean')
        , avg_payments_cnt = ('payments_cnt', 'mean')
        , total_customer_payment = ('total_payment', 'sum')
        , avg_total_order_payment = ('total_payment', 'mean')
        , avg_individual_payment = ('avg_payment', 'mean')
        , installment_orders_share = ('order_has_installment', 'mean')
        , avg_products_cnt = ('products_cnt', 'mean')
        , avg_unique_products_cnt = ('unique_products_cnt', 'mean')
        , avg_sellers_cnt = ('sellers_cnt', 'mean')
        , avg_order_total_products_price = ('total_products_price', 'mean')
        , avg_total_order_price = ('total_order_price', 'mean')
        , avg_products_price = ('avg_products_price', 'mean')
        , total_products_price = ('total_products_price', 'sum')
        , avg_order_total_freight_value = ('total_freight_value', 'mean')
        , avg_order_total_weight_kg = ('total_weight_kg', 'mean')
        , avg_order_total_volume_cm3 = ('total_volume_cm3', 'mean')
        , free_shipping_share = ('order_is_free_shipping', 'mean')
        , reviews_cnt = ('reviews_cnt', 'sum')
        , customer_avg_reviews_score = ('order_avg_reviews_score', 'mean')      
        , avg_distance_km = ('avg_distance_km', 'mean')
        , first_purchase_dt = ('order_purchase_dt', 'min')
        , last_purchase_dt=('order_purchase_dt', 'max')
        , from_first_to_second_days=('order_purchase_dt', lambda x: (x.nsmallest(2).iloc[-1] - x.min()).total_seconds() / (3600*24) if len(x) > 1 else np.nan)
        , from_first_to_last_days=('order_purchase_dt', lambda x: (x.max() - x.min()).total_seconds() / (3600*24) if len(x) > 1 else np.nan)
    )
    .merge(top_purchase_weekdays, on='customer_unique_id', how='left')
    .merge(payment_types, on='customer_unique_id', how='left')
    .merge(top_product_categories, on='customer_unique_id', how='left')
    .merge(top_general_product_categories, on='customer_unique_id', how='left')
          
)

We calculated the average value of the column ‘sale_is_customer_first_purchase’, but to get the proportion of repeat purchases, we need to subtract this value from 1.

tmp_df_sales_agg['repeat_purchase_share'] = 1 - tmp_df_sales_agg['repeat_purchase_share']

Missing values in dimensions for customers without df_items orders will show ‘No Order In Items’.

tmp_df_sales_agg['customer_top_product_categories'] = tmp_df_sales_agg['customer_top_product_categories'].fillna('No Order in Items')
tmp_df_sales_agg['customer_top_general_product_categories'] = tmp_df_sales_agg['customer_top_general_product_categories'].fillna('No Order in Items')

tmp_df_sales_agg['customer_top_purchase_weekdays'] = tmp_df_sales_agg['customer_top_purchase_weekdays'].astype('category')
tmp_df_sales_agg['customer_payment_types'] = tmp_df_sales_agg['customer_payment_types'].astype('category')
tmp_df_sales_agg['customer_top_product_categories'] = tmp_df_sales_agg['customer_top_product_categories'].astype('category')
tmp_df_sales_agg['customer_top_general_product_categories'] = tmp_df_sales_agg['customer_top_general_product_categories'].astype('category')

Inter-purchase Time

tmp_df_diff_days = (df_sales[['customer_unique_id', 'order_purchase_dt']]
          .sort_values(['customer_unique_id', 'order_purchase_dt'])
)
tmp_df_diff_days['avg_buys_diff_days'] = (tmp_df_diff_days.groupby(['customer_unique_id'])['order_purchase_dt']
                  .diff()
                  .apply(lambda x: x.days + x.seconds / (24 * 3600))
)
tmp_df_diff_days.dropna(subset='avg_buys_diff_days', inplace=True)
tmp_df_diff_days = (tmp_df_diff_days.groupby('customer_unique_id', as_index=False)
               .agg(avg_buys_diff_days = ('avg_buys_diff_days', 'mean')
               )
)

Check key mismatches.

fron.analyze_join_keys(tmp_df_sales_agg, tmp_df_diff_days, "customer_unique_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	90,443	0	93,235	2,792	93,235

NA for single-purchase customers (expected).

Merge tables.

tmp_df_sales_agg = tmp_df_sales_agg.merge(tmp_df_diff_days, on='customer_unique_id', how='left')

Active Months

tmp_df_months_with_buys = (
    df_sales.assign(
        year_month = lambda x: x.order_purchase_dt.dt.to_period('M')
    )
    .groupby('customer_unique_id', as_index=False)
    .agg(months_with_buys = ('year_month', 'nunique'))
)

tmp_df_months_with_buys.months_with_buys.value_counts(dropna=False)

months_with_buys
  91550
   1576
     87
     17
      3
      1
      1
Name: count, dtype: int64

Check key mismatches.

fron.analyze_join_keys(tmp_df_sales_agg, tmp_df_months_with_buys, "customer_unique_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	93,235	93,235	93,235

Merge tables.

tmp_df_sales_agg = tmp_df_sales_agg.merge(tmp_df_months_with_buys, on='customer_unique_id', how='left')

Consecutive Active Months

Count streaks of ≥2 months.

tmp_df_max_months_with_buys = df_sales[['order_purchase_dt', 'customer_unique_id', 'order_id']]
tmp_df_max_months_with_buys['year_month'] = tmp_df_max_months_with_buys.order_purchase_dt.dt.to_period('M')
tmp_df_max_months_with_buys = tmp_df_max_months_with_buys.drop_duplicates(['customer_unique_id', 'year_month'])

tmp_df_max_months_with_buys = tmp_df_max_months_with_buys.sort_values(['customer_unique_id', 'year_month'])
tmp_df_max_months_with_buys['diff'] = tmp_df_max_months_with_buys.groupby('customer_unique_id')['year_month'].diff()
tmp_df_max_months_with_buys.dropna(subset='diff', inplace=True)
tmp_df_max_months_with_buys.fillna(0, inplace=True)
tmp_df_max_months_with_buys['is_diff_one_month'] = tmp_df_max_months_with_buys['diff'] == pd.offsets.MonthEnd(1)

def get_max_consecutive_repeats_plus_on(x):
    res = x.ne(x.shift()).cumsum()[x].value_counts().max() + 1
    return res

tmp_df_max_months_with_buys = (
    tmp_df_max_months_with_buys.groupby('customer_unique_id')
    .agg(max_consecutive_months_with_buys = ('is_diff_one_month', get_max_consecutive_repeats_plus_on))
    .dropna()
    .astype(int)
    .reset_index()
)

Check key mismatches.

fron.analyze_join_keys(tmp_df_sales_agg, tmp_df_max_months_with_buys, "customer_unique_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	92,786	0	93,235	449	93,235

We have many users who made purchases in only one month, so they have a missing value in the new field. We will replace it with 1 after merging.

Merge tables.

tmp_df_sales_agg = tmp_df_sales_agg.merge(tmp_df_max_months_with_buys, on='customer_unique_id', how='left')

Replace NA with 1.

tmp_df_sales_agg.max_consecutive_months_with_buys = tmp_df_sales_agg.max_consecutive_months_with_buys.fillna(1).astype(int)
tmp_df_sales_agg.max_consecutive_months_with_buys.value_counts(dropna=False)

max_consecutive_months_with_buys
  92786
    438
      8
      1
      1
      1
Name: count, dtype: int64

Merge with df_customers

Check key mismatches.

fron.analyze_join_keys(df_customers, tmp_df_sales_agg, "customer_unique_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	2,539	0	95,774	93,235	95,774

Missing values occur for customers with only canceled orders.

Merge tables.

df_customers = df_customers.merge(tmp_df_sales_agg, on='customer_unique_id', how='left')

From Table df_geolocations#

Check key mismatches.

fron.analyze_join_keys(df_customers, df_geolocations, left_on='customer_zip_code_prefix', right_on='geolocation_zip_code_prefix', how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
N:1	156	4,193	95,774	19,015	95,774

As we already found out, there are customers whose prefixes are not present in the geolocation table.

We need full prefixes only to calculate the distance between the customer and the seller.

For geo-analysis, we will use the truncated prefixes. Let’s check if there are any missing rows.

fron.analyze_join_keys(df_customers, df_geolocations, left_on='customer_zip_code_prefix_3_digits', right_on='geolocation_zip_code_prefix_3_digits', how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
N:M	0	5	95,774	19,015	3,109,891

With geo-analysis, there will be no problems. The issue is only with calculating the distance.

We can try to fill in the coordinates based on the truncated prefixes and calculate the coordinates.

However, this will clearly result in significant error, and we want to preserve accuracy.

If we fill in these missing values, we will introduce more distortion into the data than gain any benefit.

Given that there are few missing rows, we will simply leave them as is and analyze the distance without them.

Merge tables.

df_customers = df_customers.merge(df_geolocations, left_on='customer_zip_code_prefix', right_on='geolocation_zip_code_prefix', how='left')
df_customers.rename(columns={'geolocation_lat': 'lat_customer', 'geolocation_lng': 'lng_customer'}, inplace=True)

df_customers.drop(['geolocation_zip_code_prefix', 'geolocation_zip_code_prefix_3_digits'], axis=1, inplace=True)

Let’s look at the missing values.

df_customers.explore.detect_anomalies()

Missings by Column
	Count	Percent
buys_cnt	2539	2.65%
avg_delivery_time_days	2670	2.79%
avg_delivery_delay_days	2670	2.79%
delayed_orders_share	2539	2.65%
purchase_weekend_share	2539	2.65%
repeat_purchase_share	2539	2.65%
avg_payments_cnt	2539	2.65%
total_customer_payment	2539	2.65%
avg_total_order_payment	2539	2.65%
avg_individual_payment	2539	2.65%
installment_orders_share	2539	2.65%
avg_products_cnt	2539	2.65%
avg_unique_products_cnt	2539	2.65%
avg_sellers_cnt	2539	2.65%
avg_order_total_products_price	2539	2.65%
avg_total_order_price	2539	2.65%
avg_products_price	2539	2.65%
total_products_price	2539	2.65%
avg_order_total_freight_value	2539	2.65%
avg_order_total_weight_kg	2539	2.65%
avg_order_total_volume_cm3	2539	2.65%
free_shipping_share	2539	2.65%
reviews_cnt	2539	2.65%
customer_avg_reviews_score	2539	2.65%
avg_distance_km	2992	3.12%
first_purchase_dt	2539	2.65%
last_purchase_dt	2539	2.65%
from_first_to_second_days	92982	97.08%
from_first_to_last_days	92982	97.08%
customer_top_purchase_weekdays	2539	2.65%
customer_payment_types	2539	2.65%
customer_top_product_categories	2539	2.65%
customer_top_general_product_categories	2539	2.65%
avg_buys_diff_days	92982	97.08%
months_with_buys	2539	2.65%
max_consecutive_months_with_buys	2539	2.65%
lat_customer	268	0.28%
lng_customer	268	0.28%
in_south_america	268	0.28%

All missing values are expected.

Missing values are for customers who did not make any successful purchases.

We will replace the missing values in the measurements with ‘Never Converted’.

df_customers['customer_top_purchase_weekdays'] = (
    df_customers['customer_top_purchase_weekdays']
    .cat.add_categories('Never Converted')
    .fillna('Never Converted')
)
df_customers['customer_payment_types'] = (
    df_customers['customer_payment_types']
    .cat.add_categories('Never Converted')
    .fillna('Never Converted')
)

df_customers['customer_top_product_categories'] = (
    df_customers['customer_top_product_categories']
    .cat.add_categories('Never Converted')
    .fillna('Never Converted')
)
df_customers['customer_top_general_product_categories'] = (
    df_customers['customer_top_general_product_categories']
    .cat.add_categories('Never Converted')
    .fillna('Never Converted')
)

Customer Segmentation#

Based on customer metrics, we will segment the customers.

**Monetary Value **

Examine quantiles in the total_customer_payment column.

df_customers.total_customer_payment.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00        9.59
05       32.69
25       63.06
50      107.78
75      182.46
95      469.33
00   13,664.08
Name: total_customer_payment, dtype: float64

We create the following segments:

Low value: payments up to 63 R$ inclusive
Medium value: payments from 63 to 182 R$ inclusive
High value: payments above 182 R$

labels = ['Low', 'Medium', 'High']
bins = [-np.inf, 63, 182, np.inf]

df_customers['value_segment'] = (
    df_customers.total_customer_payment.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Never Converted')
)

	Count
Medium	46579
High	23370
Low	23286
Never Converted	2539

Activity

We identify the following segments:

Non-converted: customers who didn’t complete any successful purchases
Core customers:
- Completed 3 or more purchases
- Time between first and last purchase is 60 days or more
Potential core:
- Completed 2 or more purchases
- Time between first and last purchase is 30 days or less
Short-term repeaters:
- Completed more than 2 purchases
- Time between first and last purchase is less than 30 days
One-time buyers: completed only one purchase

now = df_sales.order_purchase_dt.max()

conditions = [
    df_customers['first_purchase_dt'].isna() # No Successful Purchases
    , (df_customers['buys_cnt'] >= 3) & (df_customers['from_first_to_last_days'] >= 60) # Core
    , (df_customers['buys_cnt'] >= 2) & (df_customers['from_first_to_last_days'] >= 30) # Potential Core
    , (df_customers['buys_cnt'] >= 2) & (df_customers['from_first_to_last_days'] < 30) # Short-Lived Repeat
    , (df_customers['buys_cnt'] == 1) # One-Time
]
choices = ['Never Converted', 'Core', 'Potential Core', 'Short-Lived Repeat', 'One Time']
df_customers['activity_segment'] = np.select(conditions, choices, default='Other')
df_customers['activity_segment'] = df_customers['activity_segment'].astype('category')
df_customers['activity_segment'].value_counts(dropna=False)

activity_segment
One Time              90443
Never Converted        2539
Short-Lived Repeat     1341
Potential Core         1293
Core                    158
Name: count, dtype: int64

Examine quantiles in the avg_buys_diff_days column.

df_customers.avg_buys_diff_days.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00     0.00
05     0.00
25     0.01
50    32.60
75   124.55
95   311.36
00   582.86
Name: avg_buys_diff_days, dtype: float64

We create the following segments:

Weekly
Monthly
Quarterly
Semiannual
Annual

labels = ['Weekly', 'Monthly', 'Quarterly', 'Semiannual', 'Annual']
bins = [-np.inf, 7, 30, 90, 180, np.inf]
df_customers['purchase_freq_segment'] = (
    df_customers.avg_buys_diff_days.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins)
)

	Count
NaN	92982
Weekly	971
Quarterly	534
Annual	460
Semiannual	431
Monthly	396

Missing values require two-step replacement as they occur for both non-converted customers and those without any purchases.

mask_one_time = df_customers['buys_cnt'] == 1
mask_never_converted = df_customers['buys_cnt'].isna()

df_customers['purchase_freq_segment'] = (
    df_customers['purchase_freq_segment'].cat.add_categories(['Non-Repeating', 'Never Converted'])
)

df_customers.loc[mask_one_time, 'purchase_freq_segment'] = 'Non-Repeating'
df_customers.loc[mask_never_converted, 'purchase_freq_segment'] = 'Never Converted'
df_customers['purchase_freq_segment'].value_counts()

purchase_freq_segment
Non-Repeating      90443
Never Converted     2539
Weekly               971
Quarterly            534
Annual               460
Semiannual           431
Monthly              396
Name: count, dtype: int64

Examine quantiles in the from_first_to_second_days column.

df_customers.from_first_to_second_days.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00     0.00
05     0.00
25     0.00
50    28.91
75   125.49
95   318.98
00   582.86
Name: from_first_to_second_days, dtype: float64

We create the following segments:

Fast repeaters: repurchase within 14 days
Medium repeaters: repurchase in 14-60 days
Slow repeaters: repurchase after 60 days

labels = ['Fast Repeat', 'Medium Repeat', 'Slow Repeat']
bins = [-np.inf, 14, 60, np.inf]

df_customers['repeat_segment'] = (
    df_customers.from_first_to_second_days.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins)
)

	Count
NaN	92982
Fast Repeat	1169
Slow Repeat	1087
Medium Repeat	536

We replace missing values.

df_customers['repeat_segment'] = (
    df_customers['repeat_segment'].cat.add_categories(['Non-Repeating', 'Never Converted'])
)

df_customers.loc[mask_one_time, 'repeat_segment'] = 'Non-Repeating'
df_customers.loc[mask_never_converted, 'repeat_segment'] = 'Never Converted'
df_customers['repeat_segment'].value_counts()

repeat_segment
Non-Repeating      90443
Never Converted     2539
Fast Repeat         1169
Slow Repeat         1087
Medium Repeat        536
Name: count, dtype: int64

Loyalty

Examine quantiles in the customer_avg_reviews_score column.

df_customers.customer_avg_reviews_score.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00   1.00
05   1.00
25   4.00
50   5.00
75   5.00
95   5.00
00   5.00
Name: customer_avg_reviews_score, dtype: float64

We create the following segments:

Critics: average score below 3
Neutral: average score 3-4
Promoters: average score 5

labels = ['Critic', 'Neutral', 'Promoter']
bins = [-np.inf, 3, 5, np.inf]
df_customers['loyalty_segment'] = (
    df_customers.customer_avg_reviews_score.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Never Converted', right=False)
)

	Count
Promoter	54386
Neutral	26534
Critic	12315
Never Converted	2539

Risk Segment

Examine quantiles in the canceled_share column.

df_customers.canceled_share.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00   0.00
05   0.00
25   0.00
50   0.00
75   0.00
95   0.00
00   1.00
Name: canceled_share, dtype: float64

We create the following segments:

Reliable: 0% cancellation rate
Risky: cancellation rate above 0%

labels = ['Reliable', 'Risky']
bins = [-np.inf, 0, np.inf]
df_customers['risk_segment'] = (
    df_customers.canceled_share.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins)
)

	Count
Reliable	95197
Risky	577

Customer Behavioral Characteristics

Examine quantiles in the purchase_weekend_share column.

df_customers.purchase_weekend_share.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00   0.00
05   0.00
25   0.00
50   0.00
75   0.00
95   1.00
00   1.00
Name: purchase_weekend_share, dtype: float64

We create the following segments:

Weekday buyers: 0% weekend purchases
Weekend buyers: weekend purchases above 0%

labels = ['Weekday', 'Weekend']
bins = [-np.inf, 0, np.inf]
df_customers['weekday_segment'] = (
    df_customers.purchase_weekend_share.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Never Converted')
)

	Count
Weekday	71456
Weekend	21779
Never Converted	2539

Examine quantiles in the installment_orders_share column.

df_customers.installment_orders_share.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00   0.00
05   0.00
25   0.00
50   1.00
75   1.00
95   1.00
00   1.00
Name: installment_orders_share, dtype: float64

We create the following segments:

Full payment: 0% installment orders
Installment users: installment orders above 0%

labels = ['Full Pay', 'Installment']
bins = [-np.inf, 0, np.inf]
df_customers['installment_segment'] = (
    df_customers.installment_orders_share.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Never Converted')
)

	Count
Installment	48124
Full Pay	45111
Never Converted	2539

Order Characteristics

Examine quantiles in the avg_products_cnt column.

df_customers.avg_products_cnt.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00    1.00
05    1.00
25    1.00
50    1.00
75    1.00
95    2.00
00   21.00
Name: avg_products_cnt, dtype: float64

We create the following segments:

Single product buyers: average 1 product per order
Multi-product buyers: average 1-2 products per order
Bulk buyers: average more than 2 products per order

labels = ['Single Product', 'Multi Product', 'Bulk Buyer']
bins = [-np.inf, 1, 2, np.inf]
df_customers['products_cnt_segment'] = (
    df_customers.avg_products_cnt.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Never Converted')
)

	Count
Single Product	83774
Multi Product	7372
Never Converted	2539
Bulk Buyer	2089

Examine quantiles in the avg_order_total_weight_kg column.

df_customers.avg_order_total_weight_kg.quantile([0, 0.05, 0.25, 0.5, 0.75, 0.95, 1])

00     0.00
05     0.15
25     0.30
50     0.75
75     2.10
95    10.48
00   184.40
Name: avg_order_total_weight_kg, dtype: float64

We create the following segments:

Light orders: average up to 0.5 kg
Medium orders: average 0.5-2.5 kg
Heavy orders: average more than 2.5 kg

labels = ['Light', 'Medium', 'Heavy']
bins = [-np.inf, 0.5, 2.5, np.inf]
df_customers['weight_segment'] = (
    df_customers.avg_order_total_weight_kg.preproc
    .to_categorical(method='custom_bins', labels=labels, bins=bins, fill_na_value='Never Converted')
)

	Count
Light	37333
Medium	35503
Heavy	20399
Never Converted	2539

Enriching Table df_products#

From Table df_orders#

We will create the following metrics for each product:

Count of canceled orders containing this product

tmp_df_products_canceled = (
    df_orders[lambda x: x.order_status == 'Canceled'][['order_id']]
    .merge(df_items, on='order_id', how='left')
    .groupby('product_id', as_index=False)
    .agg(
        product_canceled_orders_cnt = ('order_id', 'nunique')
    )
)

We will merge this with the main products dataframe.

Check key mismatches.

fron.analyze_join_keys(df_products, tmp_df_products_canceled, "product_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	32,528	0	32,951	423	32,951

Missing values will appear for products that were not part of canceled orders.

Merge tables.

df_products = df_products.merge(tmp_df_products_canceled, on='product_id', how='left')

From Table df_sales#

We will create the following additional metrics for each product:

Count of orders containing this product
Total units sold of this product
Total sales value for this product
Average quantity of this product per order
Average share of this product in order by quantity
Average share of this product in order by value
Average price of product over all time
Maximum price of product
Minimum price of product
Price range

tmp_df_products = (
    df_sales[['order_id']].merge(df_items, on='order_id', how='left')
    .groupby('product_id', as_index=False)
    .agg(
        product_sales_cnt = ('order_id', 'nunique')
        , total_units_sold = ('order_item_id', 'count')
        , total_sales_amount = ('price', 'sum')
        , avg_price = ('price', 'mean')
        , min_price = ('price', 'min')
        , max_price = ('price', 'max')
    )
)
tmp_df_products['price_range'] = tmp_df_products['max_price'] - tmp_df_products['min_price']

We will calculate average shares in orders.

tmp_df_products_share = (df_sales[['order_id']].merge(df_items, on='order_id', how='left')
                      .groupby(['order_id', 'product_id'], as_index=False)
                      .agg(
                          product_qty = ('order_item_id', 'count')
                          , product_total_price = ('price', 'sum')
                      )
)
tmp_df_products_share['products_cnt'] = tmp_df_products_share.groupby('order_id').product_qty.transform('sum')
tmp_df_products_share['order_total_price'] = tmp_df_products_share.groupby('order_id').product_total_price.transform('sum')
tmp_df_products_share['product_qty_share_per_order'] = tmp_df_products_share['product_qty'] / tmp_df_products_share['products_cnt']
tmp_df_products_share['order_total_price_share_per_order'] = tmp_df_products_share['product_total_price'] / tmp_df_products_share['order_total_price']

tmp_df_products_share = (tmp_df_products_share.groupby('product_id', as_index=False)
                      .agg(
                          avg_product_qty_per_order = ('product_qty', 'mean')
                          , avg_product_qty_share_per_order = ('product_qty_share_per_order', 'mean')
                          , avg_order_total_price_share_per_order = ('order_total_price_share_per_order', 'mean')
                      )
)

Merge tables.

Check key mismatches.

fron.analyze_join_keys(tmp_df_products, tmp_df_products_share, "product_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	32,122	32,122	32,122

Merge tables.

tmp_df_products = tmp_df_products.merge(tmp_df_products_share, on='product_id', how='left')

We will merge this with the main products dataframe.

Check key mismatches.

fron.analyze_join_keys(df_products, tmp_df_products, "product_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	829	0	32,951	32,122	32,951

Missing values will appear for products that were never sold.

Merge tables.

df_products = df_products.merge(tmp_df_products, on='product_id', how='left')

Let’s look at the missing values.

df_products.explore.detect_anomalies()

Missings by Column
	Count	Percent
product_name_lenght	610	1.85%
product_description_lenght	610	1.85%
product_canceled_orders_cnt	32528	98.72%
product_sales_cnt	829	2.52%
total_units_sold	829	2.52%
total_sales_amount	829	2.52%
avg_price	829	2.52%
min_price	829	2.52%
max_price	829	2.52%
price_range	829	2.52%
avg_product_qty_per_order	829	2.52%
avg_product_qty_share_per_order	829	2.52%
avg_order_total_price_share_per_order	829	2.52%

All missing values are expected.

Enriching Table df_sellers#

From Tables df_items and df_products#

From Table df_items we will select only sales.

Check key mismatches.

fron.analyze_join_keys(df_items, df_sales[['order_id']], "order_id", how='inner')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Inner join size
N:1	2,007	0	112,279	96,346	110,021

tmp_df_items = df_items.merge(df_sales[['order_id']], on='order_id', how='inner')

We will add product information.

Check key mismatches.

fron.analyze_join_keys(tmp_df_items, df_products, "product_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
N:1	0	829	110,021	32,951	110,021

Merge tables.

tmp_df_items_prods = tmp_df_items.merge(df_products, on='product_id', how='left')

We will create the following metrics for each seller:

Total products sold
Count of unique products sold
Number of orders
Total sales value
Average carrier handoff delay
Average number of items per order
Average order value
Average item price
Average product weight

tmp_df_items_prods_agg_1 = (
    tmp_df_items_prods.groupby('seller_id', as_index=False)
    .agg(
        products_cnt = ('product_id', 'count')
        , unique_products_cnt = ('product_id', 'nunique')
        , orders_cnt = ('order_id', 'nunique')
        , revenue = ('price', 'sum')
        , avg_carrier_delivery_delay_days = ('carrier_delivery_delay_days', 'mean')
    )
)

tmp_df_items_prods_agg_2 = (
    tmp_df_items_prods.groupby(['seller_id', 'order_id'], as_index=False)
    .agg(
        products_cnt = ('product_id', 'count')
        , order_total_price = ('price', 'sum')
        , avg_product_price = ('price', 'mean')
    )
    .groupby('seller_id', as_index=False)
    .agg(
        avg_prouducts_cnt = ('products_cnt', 'mean')
        , avg_order_total_price = ('order_total_price', 'mean')
        , avg_product_price = ('avg_product_price', 'mean')
    )
)

Important note for average weight calculation:

We must remove duplicates to calculate averages based on unique products only.

tmp_df_items_prods_agg_3 = (
     tmp_df_items_prods.drop_duplicates(subset = ['seller_id', 'product_id'])
     .groupby('seller_id', as_index=False)
     .agg(
          avg_product_weight_kg = ('product_weight_g', 'mean')
     )
)

tmp_df_items_prods_agg_3['avg_product_weight_kg'] = (tmp_df_items_prods_agg_3['avg_product_weight_kg'] / 1000).round(2)

We will merge intermediate dataframes.

fron.analyze_join_keys(tmp_df_items_prods_agg_1, tmp_df_items_prods_agg_2, "seller_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	2,947	2,947	2,947

tmp_df_items_prods_agg = tmp_df_items_prods_agg_1.merge(tmp_df_items_prods_agg_2, on='seller_id', how='left')

fron.analyze_join_keys(tmp_df_items_prods_agg, tmp_df_items_prods_agg_3, "seller_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	0	0	2,947	2,947	2,947

tmp_df_items_prods_agg = tmp_df_items_prods_agg.merge(tmp_df_items_prods_agg_3, on='seller_id', how='left')

We will add new fields to df_sellers.

Check key mismatches.

fron.analyze_join_keys(df_sellers, tmp_df_items_prods_agg, "seller_id", how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
1:1	148	0	3,095	2,947	3,095

Sellers whose products were never purchased (or whose orders were canceled) will have missing values. This is expected.

Merge tables.

df_sellers = df_sellers.merge(tmp_df_items_prods_agg, on='seller_id', how='left')

From Table df_geolocations#

Check key mismatches.

fron.analyze_join_keys(df_sellers, df_geolocations, left_on='seller_zip_code_prefix', right_on='geolocation_zip_code_prefix', how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
N:1	7	16,776	3,095	19,015	3,095

As previously established, some sellers’ zip prefixes are missing from the geolocation table.

We will check using truncated prefixes.

fron.analyze_join_keys(df_sellers, df_geolocations, left_on='seller_zip_code_prefix_3_digits', right_on='geolocation_zip_code_prefix_3_digits', how='left')

Join info
Type	Left-only keys	Right-only keys	Left size	Right size	Left join size
N:M	0	218	3,095	19,015	98,152

We will handle this the same way as with customers:

Keep as-is for geo-analysis (no missing data)
Maintain precision for distance calculations (accept missing values)
Use left join to preserve all sellers

df_sellers = df_sellers.merge(df_geolocations, left_on='seller_zip_code_prefix', right_on='geolocation_zip_code_prefix', how='left')
df_sellers.rename(columns={'geolocation_lat': 'lat_seller', 'geolocation_lng': 'lng_seller'}, inplace=True)

df_sellers.drop(['geolocation_zip_code_prefix', 'geolocation_zip_code_prefix_3_digits'], axis=1, inplace=True)

Let’s look at the missing values.

df_sellers.explore.detect_anomalies()

Missings by Column
	Count	Percent
products_cnt	148	4.78%
unique_products_cnt	148	4.78%
orders_cnt	148	4.78%
revenue	148	4.78%
avg_carrier_delivery_delay_days	149	4.81%
avg_prouducts_cnt	148	4.78%
avg_order_total_price	148	4.78%
avg_product_price	148	4.78%
avg_product_weight_kg	148	4.78%
lat_seller	7	0.23%
lng_seller	7	0.23%
in_south_america	7	0.23%

We will clear temporary variables from memory.

for var_name in list(globals().keys()):
    if var_name.startswith('tmp_'):
        del globals()[var_name]

Data Merging

On this page

Data Merging#

Enriching Table df_items#

Enriching Table df_orders#

From Table df_payments#

From Tables df_items and df_products#

From Table df_customers#

From Table df_reviews#

Creating New Dimensions#

Creating Table df_sales#

Enriching Table df_customers#

From Table df_orders#

From Table df_sales#

From Table df_geolocations#

Customer Segmentation#

Enriching Table df_products#

From Table df_orders#

From Table df_sales#

Enriching Table df_sellers#

From Tables df_items and df_products#

From Table df_geolocations#