Intermediate Conclusion

Intermediate Conclusion#

Key preprocessing steps completed:

  • Replaced zero values in payment_installments with 1 (mode and median)

  • Removed complete duplicate rows from df_geolocations

  • Averaged coordinates by geolocation_zip_code_prefix

  • Replaced missing product dimensions/weights with median values by product category

  • Replaced missing photo counts with 1 (mode and median)

  • Replaced missing product category names with ‘unknown’

  • Converted product dimensions/weights to integer type

New variables created:

  • Time-related metrics:

    • Order processing time

    • Total delivery time

    • Carrier delivery time

    • Difference between actual and estimated delivery times

  • Review response time

  • Total product cost (including shipping)

  • Product volume

  • Weight-to-volume ratio

  • Delivery time categories (Fast/Medium/Slow)

  • Review character length

  • Average review score per order

Data filtering:

  • Trimmed order data to January 2017 - August 2018

  • Created order cancellation flag

  • Orders present in orders table but missing from items table were either canceled or unavailable

Data integration:

  • Merged datasets and created analysis-ready dataframes

  • Calculated buyer-seller distances

All missing values were expected and properly handled according to their business context.