Intermediate Conclusion#
Key preprocessing steps completed:
Replaced zero values in payment_installments with 1 (mode and median)
Removed complete duplicate rows from df_geolocations
Averaged coordinates by geolocation_zip_code_prefix
Replaced missing product dimensions/weights with median values by product category
Replaced missing photo counts with 1 (mode and median)
Replaced missing product category names with ‘unknown’
Converted product dimensions/weights to integer type
New variables created:
Time-related metrics:
Order processing time
Total delivery time
Carrier delivery time
Difference between actual and estimated delivery times
Review response time
Total product cost (including shipping)
Product volume
Weight-to-volume ratio
Delivery time categories (Fast/Medium/Slow)
Review character length
Average review score per order
Data filtering:
Trimmed order data to January 2017 - August 2018
Created order cancellation flag
Orders present in orders table but missing from items table were either canceled or unavailable
Data integration:
Merged datasets and created analysis-ready dataframes
Calculated buyer-seller distances
All missing values were expected and properly handled according to their business context.