Missing Value Handling#
Table orders#
Let’s examine missing values.
tmp_miss = df_orders.explore.detect_anomalies(return_mode='by_column')
Count | Percent | |
---|---|---|
order_approved_dt | 135 | 0.14% |
order_delivered_carrier_dt | 1716 | 1.73% |
order_delivered_customer_dt | 2888 | 2.91% |
To calculate time intervals between different dates, we should ideally fill missing values between dates if the next stage’s date is present.
We’ll replace missing timestamps.
order_delivered_carrier_dt
Let’s look at orders missing the carrier handoff date but having a delivery date or delivery status.
tmp_miss['order_delivered_carrier_dt'].order_status.value_counts()
order_status
Unavailable 602
Canceled 510
Processing 299
Invoiced 296
Created 5
Approved 2
Delivered 2
Shipped 0
Name: count, dtype: int64
tmp_mask = lambda x: (x.order_delivered_carrier_dt.isna()) & (x.order_delivered_customer_dt.notna() | (x.order_status == 'Delivered'))
df_orders[tmp_mask].head()
order_id | customer_id | order_status | order_purchase_dt | order_approved_dt | order_delivered_carrier_dt | order_delivered_customer_dt | order_estimated_delivery_dt | |
---|---|---|---|---|---|---|---|---|
73222 | 2aa91108853cecb43c84a5dc5b277475 | afeb16c7f46396c0ed54acb45ccaaa40 | Delivered | 2017-09-29 08:52:58 | 2017-09-29 09:07:16 | NaT | 2017-11-20 19:44:47 | 2017-11-14 |
92643 | 2d858f451373b04fb5c984a1cc2defaf | e08caf668d499a6d643dafd7c5cc498a | Delivered | 2017-05-25 23:22:43 | 2017-05-25 23:30:16 | NaT | NaT | 2017-06-23 |
There are only 2 such orders—we’ll replace them with the median value.
order_carrier_time = (df_orders.order_delivered_carrier_dt - df_orders.order_approved_dt).median()
order_carrier_time
Timedelta('1 days 19:28:14.500000')
df_orders.loc[tmp_mask, 'order_delivered_carrier_dt'] = df_orders.order_approved_dt + order_carrier_time
order_approved_dt
Let’s examine orders missing the payment approval date but having a carrier handoff timestamp.
tmp_miss['order_approved_dt'].order_status.value_counts()
order_status
Canceled 116
Delivered 14
Created 5
Approved 0
Invoiced 0
Processing 0
Shipped 0
Unavailable 0
Name: count, dtype: int64
tmp_mask = lambda x: x.order_approved_dt.isna() & x.order_delivered_carrier_dt.notna()
print(f'Rows count: {df_orders[tmp_mask].shape[0]}')
df_orders[tmp_mask].head()
Rows count: 14
order_id | customer_id | order_status | order_purchase_dt | order_approved_dt | order_delivered_carrier_dt | order_delivered_customer_dt | order_estimated_delivery_dt | |
---|---|---|---|---|---|---|---|---|
5323 | e04abd8149ef81b95221e88f6ed9ab6a | 2127dc6603ac33544953ef05ec155771 | Delivered | 2017-02-18 14:40:00 | NaT | 2017-02-23 12:04:47 | 2017-03-01 13:25:33 | 2017-03-17 |
16567 | 8a9adc69528e1001fc68dd0aaebbb54a | 4c1ccc74e00993733742a3c786dc3c1f | Delivered | 2017-02-18 12:45:31 | NaT | 2017-02-23 09:01:52 | 2017-03-02 10:05:06 | 2017-03-21 |
19031 | 7013bcfc1c97fe719a7b5e05e61c12db | 2941af76d38100e0f8740a374f1a5dc3 | Delivered | 2017-02-18 13:29:47 | NaT | 2017-02-22 16:25:25 | 2017-03-01 08:07:38 | 2017-03-17 |
22663 | 5cf925b116421afa85ee25e99b4c34fb | 29c35fc91fc13fb5073c8f30505d860d | Delivered | 2017-02-18 16:48:35 | NaT | 2017-02-22 11:23:10 | 2017-03-09 07:28:47 | 2017-03-31 |
23156 | 12a95a3c06dbaec84bcfb0e2da5d228a | 1e101e0daffaddce8159d25a8e53f2b2 | Delivered | 2017-02-17 13:05:55 | NaT | 2017-02-22 11:23:11 | 2017-03-02 11:09:19 | 2017-03-20 |
Since there are only 14 such orders, we’ll simply fill the gaps using this time.
Let’s check the median time from order creation to payment approval.
order_processing_time_hours = (df_orders.order_approved_dt - df_orders.order_purchase_dt).median()
order_processing_time_hours
Timedelta('0 days 00:20:31')
df_orders.loc[tmp_mask, 'order_approved_dt'] = df_orders.order_purchase_dt + order_processing_time_hours
order_delivered_customer_dt
Let’s examine delivered orders missing the customer receipt date.
tmp_miss['order_delivered_customer_dt'].order_status.value_counts()
order_status
Shipped 1097
Unavailable 602
Canceled 579
Processing 299
Invoiced 296
Delivered 8
Created 5
Approved 2
Name: count, dtype: int64
tmp_miss['order_delivered_customer_dt'][lambda x: x.order_status == 'Delivered']
order_id | customer_id | order_status | order_purchase_dt | order_approved_dt | order_delivered_carrier_dt | order_delivered_customer_dt | order_estimated_delivery_dt | |
---|---|---|---|---|---|---|---|---|
3002 | 2d1e2d5bf4dc7227b3bfebb81328c15f | ec05a6d8558c6455f0cbbd8a420ad34f | Delivered | 2017-11-28 17:44:07 | 2017-11-28 17:56:40 | 2017-11-30 18:12:23 | NaT | 2017-12-18 |
20618 | f5dd62b788049ad9fc0526e3ad11a097 | 5e89028e024b381dc84a13a3570decb4 | Delivered | 2018-06-20 06:58:43 | 2018-06-20 07:19:05 | 2018-06-25 08:05:00 | NaT | 2018-07-16 |
43834 | 2ebdfc4f15f23b91474edf87475f108e | 29f0540231702fda0cfdee0a310f11aa | Delivered | 2018-07-01 17:05:11 | 2018-07-01 17:15:12 | 2018-07-03 13:57:00 | NaT | 2018-07-30 |
79263 | e69f75a717d64fc5ecdfae42b2e8e086 | cfda40ca8dd0a5d486a9635b611b398a | Delivered | 2018-07-01 22:05:55 | 2018-07-01 22:15:14 | 2018-07-03 13:57:00 | NaT | 2018-07-30 |
82868 | 0d3268bad9b086af767785e3f0fc0133 | 4f1d63d35fb7c8999853b2699f5c7649 | Delivered | 2018-07-01 21:14:02 | 2018-07-01 21:29:54 | 2018-07-03 09:28:00 | NaT | 2018-07-24 |
92643 | 2d858f451373b04fb5c984a1cc2defaf | e08caf668d499a6d643dafd7c5cc498a | Delivered | 2017-05-25 23:22:43 | 2017-05-25 23:30:16 | NaT | NaT | 2017-06-23 |
97647 | ab7c89dc1bf4a1ead9d6ec1ec8968a84 | dd1b84a7286eb4524d52af4256c0ba24 | Delivered | 2018-06-08 12:09:39 | 2018-06-08 12:36:39 | 2018-06-12 14:10:00 | NaT | 2018-06-26 |
98038 | 20edc82cf5400ce95e1afacc25798b31 | 28c37425f1127d887d7337f284080a0f | Delivered | 2018-06-27 16:09:12 | 2018-06-27 16:29:30 | 2018-07-03 19:26:00 | NaT | 2018-07-19 |
There are only 8 such orders—we’ll also replace them with the median value.
order_customer_time = (df_orders.order_delivered_customer_dt - df_orders.order_delivered_carrier_dt).median()
order_customer_time
Timedelta('7 days 02:29:46')
tmp_mask = df_orders.order_delivered_customer_dt.isna() & (df_orders.order_status=='Delivered')
df_orders.loc[tmp_mask, 'order_delivered_customer_dt'] = df_orders.order_delivered_carrier_dt + order_customer_time
Missing values in review_comment_title and review_comment_message indicate these fields were left blank, so processing them is unnecessary.
Table df_products#
Let’s examine missing values in df_products.
tmp_miss = df_products.explore.detect_anomalies(return_mode='by_column')
Count | Percent | |
---|---|---|
product_category_name | 610 | 1.85% |
product_name_lenght | 610 | 1.85% |
product_description_lenght | 610 | 1.85% |
product_photos_qty | 610 | 1.85% |
product_weight_g | 2 | 0.01% |
product_length_cm | 2 | 0.01% |
product_height_cm | 2 | 0.01% |
product_width_cm | 2 | 0.01% |
In product_category_name, we’ll replace missing values with ‘Missing in Products’.
df_products['product_category_name'] = df_products['product_category_name'].cat.add_categories(['Missing in Products'])
df_products['product_category_name'] = df_products['product_category_name'].fillna('Missing in Products')
Since only 2 orders lack specifications, we’ll replace them with the median value for their product category.
Let’s check group sizes if we replace within these groups.
category_columns = 'product_category_name'
df_products['product_weight_g'].preproc.check_group_counts(category_columns=category_columns)
============================ Group Analysis Report =============================
Grouping columns: product_category_name
Value column: product_weight_g
Total groups: 74
Groups with missing values: 2.7%
Groups with ALL values missing: 0.0%
Total missing values: 2
Missing in non-empty groups: 2
---------------------------- Group Size Statistics -----------------------------
Mean group size: 445.3
Median group size: 94.5
Minimum group size: 1
Maximum group size: 3,029
Standard deviation: 731.1
-------------------------- Missing Value Distribution --------------------------
Groups with 1 missing value: 2.7%
Groups with 2-5 missing values: 0.0%
Groups with 5+ missing values: 0.0%
---------------- Threshold Analysis (ontly groups with missings)----------------
Groups with 5+ elements: 100.0%
Groups with 10+ elements: 100.0%
Groups with 20+ elements: 100.0%
Groups with 30+ elements: 100.0%
Groups with 40+ elements: 100.0%
Groups with 50+ elements: 100.0%
================================================================================
The group sizes are sufficient, so we’ll replace with the group median.
df_products[df_products[['product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']].isna().any(axis=1)]
product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
---|---|---|---|---|---|---|---|---|---|
8578 | 09ff539a621711667c43eba6a3bd8466 | bebes | 60.00 | 865.00 | 3.00 | NaN | NaN | NaN | NaN |
18851 | 5eb564652db742ff8f28759cd8d2652a | Missing in Products | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df_products['product_weight_g'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_length_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_height_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_width_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
Missing values in product name length and product description length won’t be filled, as we won’t use them for analysis.
Missing values in photo count will be replaced with 1, as it’s both the median and mode.
df_products.loc[df_products.product_photos_qty.isna(), 'product_photos_qty'] = 1
Let’s check how many missing values remain.
df_products.explore.detect_anomalies()
Count | Percent | |
---|---|---|
product_name_lenght | 610 | 1.85% |
product_description_lenght | 610 | 1.85% |
del tmp_miss