Missing Value Handling

Missing Value Handling#

Table orders#

Let’s examine missing values.

tmp_miss = df_orders.explore.detect_anomalies(return_mode='by_column')

Missings by Column
	Count	Percent
order_approved_dt	135	0.14%
order_delivered_carrier_dt	1716	1.73%
order_delivered_customer_dt	2888	2.91%

To calculate time intervals between different dates, we should ideally fill missing values between dates if the next stage’s date is present.

We’ll replace missing timestamps.

order_delivered_carrier_dt

Let’s look at orders missing the carrier handoff date but having a delivery date or delivery status.

tmp_miss['order_delivered_carrier_dt'].order_status.value_counts()

order_status
Unavailable    602
Canceled       510
Processing     299
Invoiced       296
Created          5
Approved         2
Delivered        2
Shipped          0
Name: count, dtype: int64

tmp_mask = lambda x: (x.order_delivered_carrier_dt.isna()) & (x.order_delivered_customer_dt.notna() | (x.order_status == 'Delivered'))

df_orders[tmp_mask].head()

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt
73222	2aa91108853cecb43c84a5dc5b277475	afeb16c7f46396c0ed54acb45ccaaa40	Delivered	2017-09-29 08:52:58	2017-09-29 09:07:16	NaT	2017-11-20 19:44:47	2017-11-14
92643	2d858f451373b04fb5c984a1cc2defaf	e08caf668d499a6d643dafd7c5cc498a	Delivered	2017-05-25 23:22:43	2017-05-25 23:30:16	NaT	NaT	2017-06-23

There are only 2 such orders—we’ll replace them with the median value.

order_carrier_time = (df_orders.order_delivered_carrier_dt - df_orders.order_approved_dt).median()
order_carrier_time

Timedelta('1 days 19:28:14.500000')

df_orders.loc[tmp_mask, 'order_delivered_carrier_dt'] = df_orders.order_approved_dt + order_carrier_time

order_approved_dt

Let’s examine orders missing the payment approval date but having a carrier handoff timestamp.

tmp_miss['order_approved_dt'].order_status.value_counts()

order_status
Canceled       116
Delivered       14
Created          5
Approved         0
Invoiced         0
Processing       0
Shipped          0
Unavailable      0
Name: count, dtype: int64

tmp_mask = lambda x: x.order_approved_dt.isna() & x.order_delivered_carrier_dt.notna()

print(f'Rows count: {df_orders[tmp_mask].shape[0]}')
df_orders[tmp_mask].head()

Rows count: 14

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt
5323	e04abd8149ef81b95221e88f6ed9ab6a	2127dc6603ac33544953ef05ec155771	Delivered	2017-02-18 14:40:00	NaT	2017-02-23 12:04:47	2017-03-01 13:25:33	2017-03-17
16567	8a9adc69528e1001fc68dd0aaebbb54a	4c1ccc74e00993733742a3c786dc3c1f	Delivered	2017-02-18 12:45:31	NaT	2017-02-23 09:01:52	2017-03-02 10:05:06	2017-03-21
19031	7013bcfc1c97fe719a7b5e05e61c12db	2941af76d38100e0f8740a374f1a5dc3	Delivered	2017-02-18 13:29:47	NaT	2017-02-22 16:25:25	2017-03-01 08:07:38	2017-03-17
22663	5cf925b116421afa85ee25e99b4c34fb	29c35fc91fc13fb5073c8f30505d860d	Delivered	2017-02-18 16:48:35	NaT	2017-02-22 11:23:10	2017-03-09 07:28:47	2017-03-31
23156	12a95a3c06dbaec84bcfb0e2da5d228a	1e101e0daffaddce8159d25a8e53f2b2	Delivered	2017-02-17 13:05:55	NaT	2017-02-22 11:23:11	2017-03-02 11:09:19	2017-03-20

Since there are only 14 such orders, we’ll simply fill the gaps using this time.

Let’s check the median time from order creation to payment approval.

order_processing_time_hours = (df_orders.order_approved_dt - df_orders.order_purchase_dt).median()
order_processing_time_hours

Timedelta('0 days 00:20:31')

df_orders.loc[tmp_mask, 'order_approved_dt'] = df_orders.order_purchase_dt + order_processing_time_hours

order_delivered_customer_dt

Let’s examine delivered orders missing the customer receipt date.

tmp_miss['order_delivered_customer_dt'].order_status.value_counts()

order_status
Shipped        1097
Unavailable     602
Canceled        579
Processing      299
Invoiced        296
Delivered         8
Created           5
Approved          2
Name: count, dtype: int64

tmp_miss['order_delivered_customer_dt'][lambda x: x.order_status == 'Delivered']

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt
3002	2d1e2d5bf4dc7227b3bfebb81328c15f	ec05a6d8558c6455f0cbbd8a420ad34f	Delivered	2017-11-28 17:44:07	2017-11-28 17:56:40	2017-11-30 18:12:23	NaT	2017-12-18
20618	f5dd62b788049ad9fc0526e3ad11a097	5e89028e024b381dc84a13a3570decb4	Delivered	2018-06-20 06:58:43	2018-06-20 07:19:05	2018-06-25 08:05:00	NaT	2018-07-16
43834	2ebdfc4f15f23b91474edf87475f108e	29f0540231702fda0cfdee0a310f11aa	Delivered	2018-07-01 17:05:11	2018-07-01 17:15:12	2018-07-03 13:57:00	NaT	2018-07-30
79263	e69f75a717d64fc5ecdfae42b2e8e086	cfda40ca8dd0a5d486a9635b611b398a	Delivered	2018-07-01 22:05:55	2018-07-01 22:15:14	2018-07-03 13:57:00	NaT	2018-07-30
82868	0d3268bad9b086af767785e3f0fc0133	4f1d63d35fb7c8999853b2699f5c7649	Delivered	2018-07-01 21:14:02	2018-07-01 21:29:54	2018-07-03 09:28:00	NaT	2018-07-24
92643	2d858f451373b04fb5c984a1cc2defaf	e08caf668d499a6d643dafd7c5cc498a	Delivered	2017-05-25 23:22:43	2017-05-25 23:30:16	NaT	NaT	2017-06-23
97647	ab7c89dc1bf4a1ead9d6ec1ec8968a84	dd1b84a7286eb4524d52af4256c0ba24	Delivered	2018-06-08 12:09:39	2018-06-08 12:36:39	2018-06-12 14:10:00	NaT	2018-06-26
98038	20edc82cf5400ce95e1afacc25798b31	28c37425f1127d887d7337f284080a0f	Delivered	2018-06-27 16:09:12	2018-06-27 16:29:30	2018-07-03 19:26:00	NaT	2018-07-19

There are only 8 such orders—we’ll also replace them with the median value.

order_customer_time = (df_orders.order_delivered_customer_dt - df_orders.order_delivered_carrier_dt).median()
order_customer_time

Timedelta('7 days 02:29:46')

tmp_mask = df_orders.order_delivered_customer_dt.isna() & (df_orders.order_status=='Delivered')

df_orders.loc[tmp_mask, 'order_delivered_customer_dt'] = df_orders.order_delivered_carrier_dt + order_customer_time

Missing values in review_comment_title and review_comment_message indicate these fields were left blank, so processing them is unnecessary.

Table df_products#

Let’s examine missing values in df_products.

tmp_miss = df_products.explore.detect_anomalies(return_mode='by_column')

Missings by Column
	Count	Percent
product_category_name	610	1.85%
product_name_lenght	610	1.85%
product_description_lenght	610	1.85%
product_photos_qty	610	1.85%
product_weight_g	2	0.01%
product_length_cm	2	0.01%
product_height_cm	2	0.01%
product_width_cm	2	0.01%

In product_category_name, we’ll replace missing values with ‘Missing in Products’.

df_products['product_category_name'] = df_products['product_category_name'].cat.add_categories(['Missing in Products'])
df_products['product_category_name'] = df_products['product_category_name'].fillna('Missing in Products')

Since only 2 orders lack specifications, we’ll replace them with the median value for their product category.

Let’s check group sizes if we replace within these groups.

category_columns = 'product_category_name'
df_products['product_weight_g'].preproc.check_group_counts(category_columns=category_columns)

============================ Group Analysis Report =============================

Grouping columns: product_category_name
Value column: product_weight_g

Total groups:                            74
Groups with missing values:              2.7%
Groups with ALL values missing:          0.0%
Total missing values:                    2
Missing in non-empty groups:             2

---------------------------- Group Size Statistics -----------------------------
Mean group size:               445.3
Median group size:             94.5
Minimum group size:            1
Maximum group size:            3,029
Standard deviation:            731.1

-------------------------- Missing Value Distribution --------------------------
Groups with 1 missing value:   2.7%
Groups with 2-5 missing values: 0.0%
Groups with 5+ missing values: 0.0%

---------------- Threshold Analysis (ontly groups with missings)----------------
Groups with 5+ elements:       100.0%
Groups with 10+ elements:      100.0%
Groups with 20+ elements:      100.0%
Groups with 30+ elements:      100.0%
Groups with 40+ elements:      100.0%
Groups with 50+ elements:      100.0%
================================================================================

The group sizes are sufficient, so we’ll replace with the group median.

df_products[df_products[['product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']].isna().any(axis=1)]

	product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm	product_width_cm
8578	09ff539a621711667c43eba6a3bd8466	bebes	60.00	865.00	3.00	NaN	NaN	NaN	NaN
18851	5eb564652db742ff8f28759cd8d2652a	Missing in Products	NaN	NaN	NaN	NaN	NaN	NaN	NaN

df_products['product_weight_g'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_length_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_height_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)
df_products['product_width_cm'].preproc.fill_missing_by_category(category_columns=category_columns, inplace=True)

Missing values in product name length and product description length won’t be filled, as we won’t use them for analysis.

Missing values in photo count will be replaced with 1, as it’s both the median and mode.

df_products.loc[df_products.product_photos_qty.isna(), 'product_photos_qty'] = 1

Let’s check how many missing values remain.

df_products.explore.detect_anomalies()

Missings by Column
	Count	Percent
product_name_lenght	610	1.85%
product_description_lenght	610	1.85%

del tmp_miss

Missing Value Handling

On this page

Missing Value Handling#

Table orders#

Table df_products#