Outlier Handling#
Let’s examine where we have zero values.
for key, df in dfs:
print(f'DataFrame {key}')
df.explore.detect_anomalies(anomaly_type='zero')
DataFrame orders
No numeric columns to analyze zero
DataFrame items
Count | Percent | |
---|---|---|
freight_value | 383 | 0.34% |
DataFrame reviews
No zero found in specified columns
DataFrame products
Count | Percent | |
---|---|---|
product_weight_g | 4 | 0.01% |
DataFrame geolocations
No zero found in specified columns
DataFrame sellers
No zero found in specified columns
DataFrame payments
Count | Percent | |
---|---|---|
payment_installments | 2 | 0.00% |
payment_value | 8 | 0.01% |
DataFrame customers
No zero found in specified columns
DataFrame categories
No numeric columns to analyze zero
Zeros in delivery cost cannot be processed, as they may indicate free shipping.
Zeros in product weight likely mean the value was not specified for the product.
Since there are few of them and they all belong to the same category (cama_mesa_banho), we’ll replace them with the median value for that category.
median_for_fill = df_products[df_products['product_category_name'] == 'cama_mesa_banho']['product_weight_g'].median()
df_products.loc[df_products['product_weight_g'] == 0, 'product_weight_g'] = median_for_fill
Zeros in payment_installments will be replaced with 1, as it is both the mode and median.
df_payments.loc[df_payments['payment_installments'] == 0, 'payment_installments'] = 1
Zeros in payment_value cannot be processed, as there may be a payment-specific reason.
We won’t process outliers in order value because they carry business significance—these are large purchases.
We have orders where reviews were created before the orders themselves.
We’ll consider these outliers, as they violate temporal logic.
But we don’t know whether the issue lies in the order creation date or the review creation date.
Modifying them could introduce bias, and since most of these orders were canceled and there are very few, we’ll leave them as is.
We’ve recorded them as anomalies and will note them in the conclusions.