Intermediate Conclusion

Intermediate Conclusion#

Table Relationships

  • Orders table contains 1 order missing from payments table

  • Orders table contains 775 orders missing from order items table (mostly canceled/unavailable status)

  • Customers table has 157 zip_code_prefixes missing from geolocations

  • Geolocations table has 4,178 zip_code_prefixes missing from customers

  • Sellers table has 7 zip_code_prefixes missing from geolocations

  • Geolocations table has 16,776 zip_code_prefixes missing from sellers

Duplicates

  • df_geolocations: 26% full row duplicates

  • df_reviews: 827 duplicate review_ids

  • df_reviews: 559 duplicate order_ids

  • df_geolocations: 97% duplicates in geolocation_zip_code_prefix

  • Some zip prefixes map to:

    • 2 unique states

    • 5 unique cities

Missing Values

  • There are 160 missing values in order_approved_dt (<1% of the total number of rows).

  • There are 14 missing values in order_approved_dt for orders with the delivered status.

  • All delivered orders with missing values in order_approved_dt have the payment type boleto.

  • There are 1.78k missing values in order_delivered_carrier_dt (2% of the total number of rows).

  • There are 2 delivered orders with missing values in order_delivered_carrier_dt.

  • All orders with the status unavailable have missing values in order_delivered_carrier_dt.

  • Both orders with missing values in order_delivered_carrier_dt were paid with a credit card.

  • There are 2.96k missing values in order_delivered_customer_dt (3% of the total number of rows).

  • There are 2% missing values in product_description_lenght.

  • There are 2% missing values in product_photos_qty.

  • There are 2 missing values in product_weight_g.

  • There are 2 missing values in product_length_cm.

  • There are 2% missing values in product_category_name.

  • There are 2 missing values in product_height_cm.

  • There are 2 missing values in product_width_cm.

  • Missing values in product category name, product description length, product description length, and product photos quantity are located in the same rows.

  • Missing values in product length, width, height, and weight are located in the same rows.

Zero Values

  • There are 383 zero values in freight_value.

  • There are 2 orders that have a value of 0 in payment_installments.

  • There are 9 zero payments in payment_value.

  • There are 4 zero values in product_weight_g.

  • There are 797 (95%) zero values in declared_monthly_revenue.

  • The main part of zero values in freight_value occurred from April 22 to June 14, 2018. There might have been a free shipping promotion.

  • All 4 products with zero weight have the category cama_mesa_banho (textile and home furnishings).

  • Orders with zero payment amount have a payment type of either voucher or not_defined.

Other Anomalies

  • There are 65 orders for which reviews were created before the orders were created. This is strange. Out of these, 58 orders were canceled, 6 were delivered, and 1 was in the delivery process.

  • 6 orders were made outside South America, although the customer’s state is in South America.

  • The table product_category_name has 71 unique product categories, whereas the table products has 73 categories.

  • One user made orders totaling 13,664. This value clearly stands out from the general distribution.

  • There are also several users who made purchases totaling 6,000 or more.

  • In the column with the total order amount per user, around 10% are outliers. This exceeds the usual norm (usually 5%), but is not a critical value.

  • In the product price and freight value, there are slightly less than 10% outliers. This exceeds the usual norm (usually 5%), but is not a critical value.

  • In the payment value, there are 10% outliers. This exceeds the usual norm (usually 5%), but is not a critical value.

  • In the number of installments, there are less than 1% outliers. This is a normal value.

  • On November 24, 2017, there were many outliers in the payment value.

  • The proportion of outliers in the number of product photos, length, and width is within the norm.

  • The proportion of outliers in weight (9.95%) and height (8.97%) of the product exceeds the norm (usually 5%), but is not critical.

  • There is a strange maximum date in shipping_limit_dt. The value is significantly large compared to other time variables.

  • Four orders have an abnormally large shipping_limit_dt. However, the estimated delivery time for these orders is normal.

  • In the table with orders, there are 1,359 orders where order_delivered_carrier_dt is earlier than order_approved_dt.

  • In the table with orders, there are 23 orders where order_delivered_customer_dt is earlier than order_delivered_carrier_dt.

  • The dataset contains 8,707 orders where the payment amount does not match the order amount.

  • In the table with orders, there are orders with the status canceled, but they also have a delivery time to the customer.

  • In the table with order items, there are product IDs that were sold by different sellers. There are more than 1,000 such products. Additionally, there are products sold by sellers located in different states.

Key Observations

  • All orders created after August 2018 were canceled, except for one order.

  • There were few orders before 2017. Either the data is incomplete or there were few orders.

  • 97% of orders have the status ā€œdeliveredā€.

  • Most customers are from the city of sao paulo (16%).

  • Most customers are from the state SP (42%).

  • geolocation_city has the most records for the city of sao paulo (14%).

  • geolocation_state has the most records for the state SP (40%).

  • The maximum number of product units in an order is 21.

  • Product prices range from 0.85 to 6.74k.

  • Most products are sold for prices ranging from 39.9 to 134.9.

  • The median product price is 74.99.

  • The maximum number of payment methods for one order is 29.

  • The maximum number of installments for one product is 24.

  • The median number of installments is 1.

  • 75% of orders have installment plans of 4 or fewer parts.

  • The maximum payment is 13.66k. The median payment is 100.

  • 74% of payments were made using credit cards.

  • The payment_type column has undefined payment types (<1%).

  • The review_creation_dt column has 9% missing days.

  • The review_answer_dt column has 5% missing days.

  • More than half of the reviews (57%) have maximum ratings of 5.

  • Most reviews (8%) have the title ā€˜recomendo’.

  • 58% of orders do not have reviews.

  • Only 36% of unique comments in reviews.

  • Most comments in reviews contain the word ā€œmuito bomā€ (1%).

  • The maximum length of a product name is 76 characters.

  • The maximum length of a product description is 3.99k characters.

  • The minimum length of a product description is 4 characters.

  • The maximum number of photos for one product is 20.

  • 51% of products have 1 photo.

  • The maximum product weight is 40.42k grams.

  • The maximum product length is 105 cm. The minimum length is 7 cm. The median length is 25 cm.

  • The maximum product height is 105 cm. The minimum height is 2 cm. The median height is 13 cm.

  • The maximum product width is 118 cm. The minimum width is 6 cm. The median width is 20 cm.

  • There are a total of 73 unique product categories.

  • Most sellers are from the city of sao paulo (22%).

  • Most sellers are from the state SP (60%).