Missing Value Handling

Missing Value Handling#

Table orders#

Let’s examine missing values.

tmp_miss = df_orders.explore.detect_anomalies(return_mode='by_column')
Missings by Column
  Count Percent
order_approved_dt 135 0.14%
order_delivered_carrier_dt 1716 1.73%
order_delivered_customer_dt 2888 2.91%

To calculate time intervals between different dates, we should ideally fill missing values between dates if the next stage’s date is present.

We’ll replace missing timestamps.

order_delivered_carrier_dt

Let’s look at orders missing the carrier handoff date but having a delivery date or delivery status.

tmp_miss['order_delivered_carrier_dt'].order_status.value_counts()
order_status
Unavailable    602
Canceled       510
Processing     299
Invoiced       296
Created          5
Approved         2
Delivered        2
Shipped          0
Name: count, dtype: int64
tmp_mask = lambda x: (x.order_delivered_carrier_dt.isna()) & (x.order_delivered_customer_dt.notna() | (x.order_status == 'Delivered'))
df_orders[tmp_mask].head()
order_id customer_id order_status order_purchase_dt order_approved_dt order_delivered_carrier_dt order_delivered_customer_dt order_estimated_delivery_dt
73222 2aa91108853cecb43c84a5dc5b277475 afeb16c7f46396c0ed54acb45ccaaa40 Delivered 2017-09-29 08:52:58 2017-09-29 09:07:16 NaT 2017-11-20 19:44:47 2017-11-14
92643 2d858f451373b04fb5c984a1cc2defaf e08caf668d499a6d643dafd7c5cc498a Delivered 2017-05-25 23:22:43 2017-05-25 23:30:16 NaT NaT 2017-06-23

There are only 2 such orders—we’ll replace them with the median value.

order_carrier_time = (df_orders.order_delivered_carrier_dt - df_orders.order_approved_dt).median()
order_carrier_time
Timedelta('1 days 19:28:14.500000')
df_orders.loc[tmp_mask, 'order_delivered_carrier_dt'] = df_orders.order_approved_dt + order_carrier_time

order_approved_dt

Let’s examine orders missing the payment approval date but having a carrier handoff timestamp.

tmp_miss['order_approved_dt'].order_status.value_counts()
order_status
Canceled       116
Delivered       14
Created          5
Approved         0
Invoiced         0
Processing       0
Shipped          0
Unavailable      0
Name: count, dtype: int64
tmp_mask = lambda x: x.order_approved_dt.isna() & x.order_delivered_carrier_dt.notna()
print(f'Rows count: {df_orders[tmp_mask].shape[0]}')
df_orders[tmp_mask].head()
Rows count: 14
order_id customer_id order_status order_purchase_dt order_approved_dt order_delivered_carrier_dt order_delivered_customer_dt order_estimated_delivery_dt
5323 e04abd8149ef81b95221e88f6ed9ab6a 2127dc6603ac33544953ef05ec155771 Delivered 2017-02-18 14:40:00 NaT 2017-02-23 12:04:47 2017-03-01 13:25:33 2017-03-17
16567 8a9adc69528e1001fc68dd0aaebbb54a 4c1ccc74e00993733742a3c786dc3c1f Delivered 2017-02-18 12:45:31 NaT 2017-02-23 09:01:52 2017-03-02 10:05:06 2017-03-21
19031 7013bcfc1c97fe719a7b5e05e61c12db 2941af76d38100e0f8740a374f1a5dc3 Delivered 2017-02-18 13:29:47 NaT 2017-02-22 16:25:25 2017-03-01 08:07:38 2017-03-17
22663 5cf925b116421afa85ee25e99b4c34fb 29c35fc91fc13fb5073c8f30505d860d Delivered 2017-02-18 16:48:35 NaT 2017-02-22 11:23:10 2017-03-09 07:28:47 2017-03-31
23156 12a95a3c06dbaec84bcfb0e2da5d228a 1e101e0daffaddce8159d25a8e53f2b2 Delivered 2017-02-17 13:05:55 NaT 2017-02-22 11:23:11 2017-03-02 11:09:19 2017-03-20

Since there are only 14 such orders, we’ll simply fill the gaps using this time.

Let’s check the median time from order creation to payment approval.

order_processing_time_hours = (df_orders.order_approved_dt - df_orders.order_purchase_dt).median()
order_processing_time_hours
Timedelta('0 days 00:20:31')
df_orders.loc[tmp_mask, 'order_approved_dt'] = df_orders.order_purchase_dt + order_processing_time_hours

order_delivered_customer_dt

Let’s examine delivered orders missing the customer receipt date.

tmp_miss['order_delivered_customer_dt'].order_status.value_counts()
order_status
Shipped        1097
Unavailable     602
Canceled        579
Processing      299
Invoiced        296
Delivered         8
Created           5
Approved          2
Name: count, dtype: int64
tmp_miss['order_delivered_customer_dt'][lambda x: x.order_status == 'Delivered']
order_id customer_id order_status order_purchase_dt order_approved_dt order_delivered_carrier_dt order_delivered_customer_dt order_estimated_delivery_dt
3002 2d1e2d5bf4dc7227b3bfebb81328c15f ec05a6d8558c6455f0cbbd8a420ad34f Delivered 2017-11-28 17:44:07 2017-11-28 17:56:40 2017-11-30 18:12:23 NaT 2017-12-18
20618 f5dd62b788049ad9fc0526e3ad11a097 5e89028e024b381dc84a13a3570decb4 Delivered 2018-06-20 06:58:43 2018-06-20 07:19:05 2018-06-25 08:05:00 NaT 2018-07-16
43834 2ebdfc4f15f23b91474edf87475f108e 29f0540231702fda0cfdee0a310f11aa Delivered 2018-07-01 17:05:11 2018-07-01 17:15:12 2018-07-03 13:57:00 NaT 2018-07-30
79263 e69f75a717d64fc5ecdfae42b2e8e086 cfda40ca8dd0a5d486a9635b611b398a Delivered 2018-07-01 22:05:55 2018-07-01 22:15:14 2018-07-03 13:57:00 NaT 2018-07-30
82868 0d3268bad9b086af767785e3f0fc0133 4f1d63d35fb7c8999853b2699f5c7649 Delivered 2018-07-01 21:14:02 2018-07-01 21:29:54 2018-07-03 09:28:00 NaT 2018-07-24
92643 2d858f451373b04fb5c984a1cc2defaf e08caf668d499a6d643dafd7c5cc498a Delivered 2017-05-25 23:22:43 2017-05-25 23:30:16 NaT NaT 2017-06-23
97647 ab7c89dc1bf4a1ead9d6ec1ec8968a84 dd1b84a7286eb4524d52af4256c0ba24 Delivered 2018-06-08 12:09:39 2018-06-08 12:36:39 2018-06-12 14:10:00 NaT 2018-06-26
98038 20edc82cf5400ce95e1afacc25798b31 28c37425f1127d887d7337f284080a0f Delivered 2018-06-27 16:09:12 2018-06-27 16:29:30 2018-07-03 19:26:00 NaT 2018-07-19

There are only 8 such orders—we’ll also replace them with the median value.

order_customer_time = (df_orders.order_delivered_customer_dt - df_orders.order_delivered_carrier_dt).median()
order_customer_time
Timedelta('7 days 02:29:46')
tmp_mask = df_orders.order_delivered_customer_dt.isna() & (df_orders.order_status=='Delivered')
df_orders.loc[tmp_mask, 'order_delivered_customer_dt'] = df_orders.order_delivered_carrier_dt + order_customer_time

Missing values in review_comment_title and review_comment_message indicate these fields were left blank, so processing them is unnecessary.

Table df_products#

Let’s examine missing values in df_products.

tmp_miss = df_products.explore.detect_anomalies(return_mode='by_column')
Missings by Column
  Count Percent
product_category_name 610 1.85%
product_name_lenght 610 1.85%
product_description_lenght 610 1.85%
product_photos_qty 610 1.85%
product_weight_g 2 0.01%
product_length_cm 2 0.01%
product_height_cm 2 0.01%
product_width_cm 2 0.01%

In product_category_name, we’ll replace missing values with ‘Missing in Products’.

df_products['product_category_name'] = df_products['product_category_name'].cat.add_categories(['Missing in Products'])
df_products['product_category_name'] = df_products['product_category_name'].fillna('Missing in Products')

Since only 2 orders lack specifications, we’ll replace them with the median value for their product category.

Let’s check group sizes if we replace within these groups.

category_columns = 'product_category_name'
df_products['product_weight_g'].preproc.check_group_counts(category_columns=category_columns)
============================ Group Analysis Report =============================
Grouping columns: product_category_name
Value column: product_weight_g

Total groups:                            74
Groups with missing values:              2.7%
Groups with ALL values missing:          0.0%
Total missing values:                    2
Missing in non-empty groups:             2

---------------------------- Group Size Statistics -----------------------------
Mean group size:               445.3
Median group size:             94.5
Minimum group size:            1
Maximum group size:            3,029
Standard deviation:            731.1

-------------------------- Missing Value Distribution --------------------------
Groups with 1 missing value:   2.7%
Groups with 2-5 missing values: 0.0%
Groups with 5+ missing values: 0.0%

---------------- Threshold Analysis (ontly groups with missings)----------------
Groups with 5+ elements:       100.0%
Groups with 10+ elements:      100.0%
Groups with 20+ elements:      100.0%
Groups with 30+ elements:      100.0%
Groups with 40+ elements:      100.0%
Groups with 50+ elements:      100.0%
================================================================================

The group sizes are sufficient, so we’ll replace with the group median.

df_products[df_products[['product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']].isna().any(axis=1)]
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
8578 09ff539a621711667c43eba6a3bd8466 bebes 60.00 865.00 3.00 NaN NaN NaN NaN
18851 5eb564652db742ff8f28759cd8d2652a Missing in Products NaN NaN NaN NaN NaN NaN NaN
df_products['product_weight_g'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_length_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_height_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_width_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)

Missing values in product name length and product description length won’t be filled, as we won’t use them for analysis.

Missing values in photo count will be replaced with 1, as it’s both the median and mode.

df_products.loc[df_products.product_photos_qty.isna(), 'product_photos_qty'] = 1

Let’s check how many missing values remain.

df_products.explore.detect_anomalies()
Missings by Column
  Count Percent
product_name_lenght 610 1.85%
product_description_lenght 610 1.85%
del tmp_miss