In [None]:
%run ../../_pre_run.ipynb

#  Hypothesis Testing

As we previously determined, the following metrics have a right-skewed distribution:

- from_purchase_to_approved_hours
- total_payment
- delivery_time_days
- total_weight_kg
- avg_products_price
- total_freight_value

Given the skewness of these metrics and the fact that most of our hypotheses will involve more than two categories, we will use regression analysis to test our hypotheses. 

Regression analysis utilizes the full variance of the data when building a model.

To account for the right-skewness, we will employ a Generalized Linear Model (GLM) regression with a Gamma distribution and a log-link function.

For multiple comparisons, we will apply the Holm correction.

## Time Patterns

**Are orders processed longer at night?**

For each time of day N (except Sunday):

- $H_{0}^{N}:$ The mean order processing time at night equals the mean order processing time during time period N.
- $H_{1}^{N}:$ The mean order processing time at night does not equal the mean order processing time during time period N.

In [None]:
df_sales.assign(from_purchase_to_approved_hours = lambda x: x.from_purchase_to_approved_hours + 1e-6).stats.glm(
    formula='from_purchase_to_approved_hours ~ C(purchase_time_of_day, Treatment(reference="Night"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
    , p_adjust='holm'
)

**Result:**

- At the 0.05 significance level, the mean order processing time at night is statistically significantly different from the mean order processing time at any other time of day.
- Moreover, the processing time at night is statistically significantly longer than during other periods.

---

**Are orders processed longer on weekdays?**

- H0: The average processing time for orders on weekdays and weekends is the same.
- H1: The average processing time for orders on weekdays and weekends differs

In [None]:
df_sales.assign(from_purchase_to_approved_hours = lambda x: x.from_purchase_to_approved_hours + 1e-6).stats.glm(
    formula='from_purchase_to_approved_hours ~ C(purchase_day_type, Treatment(reference="Weekday"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean order processing time on weekdays is statistically significantly different from the mean order processing time on weekends.
- Moreover, the processing time on weekdays is statistically significantly shorter than on weekends.

## Customer Reviews Scores

**Does reviews score of 1 have a higher average order value?**

For each rating N (except 1):

- $H_{0}^{N}:$ The mean order value for rating 1 equals the mean order value for rating N.
- $H_{1}^{N}:$ The mean order value for rating 1 does not equal the mean order value for rating N.

In [None]:
df_sales.stats.glm(
    formula='total_payment ~ C(order_avg_reviews_score, Treatment(reference=1))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
    , p_adjust='holm'
)

**Result:**

- At the 0.05 significance level, the mean order value for rating 1 is statistically significantly different from the mean order value for any other rating.
- Moreover, the mean order value for rating 1 is statistically significantly higher than for other ratings.

---

**Are orders with review score of 1 delivered longer?**

For each rating N (except 1):

- $H_{0}^{N}:$ The mean delivery time for rating 1 equals the mean delivery time for rating N.
- $H_{1}^{N}:$ The mean delivery time for rating 1 does not equal the mean delivery time for rating N.

In [None]:
df_sales.stats.glm(
    formula='delivery_time_days ~ C(order_avg_reviews_score, Treatment(reference=1))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
    , p_adjust='holm'
)

**Result:**

- At the 0.05 significance level, the mean delivery time for rating 1 is statistically significantly different from the mean delivery time for any other rating.
- Moreover, the mean delivery time for rating 1 is statistically significantly longer than for other ratings.

## Installments

**Are installment orders processed faster?**

- $H_{0}:$ The mean processing time of installment-based orders equals the mean processing time of non-installment orders.
- $H_{1}:$ The mean processing time of installment-based orders does not equal the mean processing time of non-installment orders.

In [None]:
df_sales.assign(from_purchase_to_approved_hours = lambda x: x.from_purchase_to_approved_hours + 1e-6).stats.glm(
    formula='from_purchase_to_approved_hours ~ C(order_has_installment, Treatment(reference="Has Installments"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean order processing time for installment payments is statistically significantly different from the mean order processing time for non-installment payments.
- Moreover, the mean order processing time for installment payments is statistically significantly shorter than for non-installment payments.

---

**Do installment orders have a higher average order value?**

- $H_{0}:$ The mean order value of installment-based orders equals the mean order value of non-installment orders.
- $H_{1}:$ The mean order value of installment-based orders does not equal the mean order value of non-installment orders.

In [None]:
df_sales.stats.glm(
    formula='total_payment ~ C(order_has_installment, Treatment(reference="Has Installments"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean order value for installment payments is statistically significantly different from the mean order value for non-installment payments.
- Moreover, the mean order value for installment payments is statistically significantly higher than for non-installment payments.

---

**Do installment orders have a higher average order weight?**

- $H_{0}:$ The mean weight of installment-based orders equals the mean weight of non-installment orders.
- $H_{1}:$ The mean weight of installment-based orders does not equal the mean weight of non-installment orders.

In [None]:
df_sales.assign(total_weight_kg = lambda x: x.total_weight_kg + 1e-6).stats.glm(
    formula='total_weight_kg ~ C(order_has_installment, Treatment(reference="Has Installments"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean order weight for installment payments is statistically significantly different from the mean order weight for non-installment payments.
- Moreover, the mean order weight for installment payments is statistically significantly higher than for non-installment payments.

---

**Do installment orders have a higher average product price in the order?**

- $H_{0}:$ The mean product price in orders with installment payments equals the mean product price in orders without installment payments.
- $H_{1}:$ The mean product price in orders with installment payments does not equal the mean product price in orders without installment payments.

In [None]:
df_sales.stats.glm(
    formula='avg_products_price ~ C(order_has_installment, Treatment(reference="Has Installments"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean product price in orders with installment payments is statistically significantly different from the mean product price in orders without installment payments.
- Moreover, the mean product price in orders with installment payments is statistically significantly higher than in orders without installment payments.

---

**Do installment orders have a higher average delivery cost?**

- $H_{0}:$ The mean delivery cost for orders with installment payments equals the mean delivery cost for orders without installment payments.
- $H_{1}:$ The mean delivery cost for orders with installment payments does not equal the mean delivery cost for orders without installment payments.

In [None]:
df_sales.assign(total_freight_value = lambda x: x.total_freight_value + 1e-6).stats.glm(
    formula='total_freight_value ~ C(order_has_installment, Treatment(reference="Has Installments"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean delivery cost for orders with installment payments is statistically significantly different from the mean delivery cost for orders without installment payments.
- Moreover, the mean delivery cost for orders with installment payments is statistically significantly higher than for orders without installment payments.

## Order Processing and Delivery

**Do delayed orders have a higher average order value?**

- $H_{0}:$ The mean value of delayed orders equals the mean value of non-delayed orders.
- $H_{1}:$ The mean value of delayed orders does not equal the mean value of non-delayed orders.

In [None]:
df_sales.stats.glm(
    formula='total_payment ~ C(is_delayed, Treatment(reference="Delayed"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean value of delayed orders is statistically significantly different from the mean value of non-delayed orders.
- Moreover, the mean value of delayed orders is statistically significantly higher than of non-delayed orders.

---

**Do delayed orders have a higher average order weight?**

- $H_{0}:$ The mean weight of delayed orders equals the mean weight of non-delayed orders.
- $H_{1}:$ The mean weight of delayed orders does not equal the mean weight of non-delayed orders.

In [None]:
df_sales.assign(total_weight_kg = lambda x: x.total_weight_kg + 1e-6).stats.glm(
    formula='total_weight_kg ~ C(is_delayed, Treatment(reference="Delayed"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean weight of delayed orders is statistically significantly different from the mean weight of non-delayed orders.
- Moreover, the mean weight of delayed orders is statistically significantly higher than that of non-delayed orders.

---

**Do delayed orders have a higher average product price in the order?**

- $H_{0}:$ The mean product price in delayed orders equals the mean product price in non-delayed orders.
- $H_{1}:$ The mean product price in delayed orders does not equal the mean product price in non-delayed orders.

In [None]:
df_sales.stats.glm(
    formula='avg_products_price ~ C(is_delayed, Treatment(reference="Delayed"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean product price in delayed orders is statistically significantly different from the mean product price in non-delayed orders.
- Moreover, the mean product price in delayed orders is statistically significantly higher than in non-delayed orders.

---

**Do delayed orders have a higher average delivery cost?**

- $H_{0}:$ The mean delivery cost for delayed orders equals the mean delivery cost for non-delayed orders.
- $H_{1}:$ The mean delivery cost for delayed orders does not equal the mean delivery cost for non-delayed orders.

In [None]:
df_sales.assign(total_freight_value = lambda x: x.total_freight_value + 1e-6).stats.glm(
    formula='total_freight_value ~ C(is_delayed, Treatment(reference="Delayed"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
)

**Result:**

- At the 0.05 significance level, the mean delivery cost in delayed orders is statistically significantly different from the mean delivery cost in non-delayed orders.
- Moreover, the mean delivery cost in delayed orders is statistically significantly higher than in non-delayed orders.

---

**Does delivery delay affect the distribution of order ratings?**

- $H_{0}:$ The distribution of order ratings is identical for delayed and non-delayed orders.
- $H_{1}:$ The distribution of order ratings differs between delayed and non-delayed orders.

We will use:

- OrderedModel, since the rating is an ordinal variable.

In [None]:
(
    df_sales.assign(
        order_avg_reviews_score=lambda x: pd.Categorical(
            x.order_avg_reviews_score,
            categories=[1, 2, 3, 4, 5],
            ordered=True
        )
    )
    .stats.ordered_model(
        formula = 'order_avg_reviews_score ~ C(is_delayed, Treatment(reference="Delayed"))'
    )
)

**Result:**

- At the 0.05 significance level, the distribution of order ratings differs between delayed and non-delayed orders.
- Moreover, non-delayed orders have statistically significantly higher chances of receiving a higher rating.

---

**Are expensive orders delivered longer?**

For each order price category N (except expensive):

- $H_{0}^{N}:$ The mean delivery cost for the expensive price category equals the mean delivery cost for category N.
- $H_{1}^{N}:$ The mean delivery cost for the expensive price category does not equal the mean delivery cost for category N.

In [None]:
df_sales.stats.glm(
    formula='delivery_time_days ~ C(order_total_payment_cat, Treatment(reference="Expensive"))'
    , family=sm.families.Gamma(link=sm.families.links.Log())
    , p_adjust='holm'
)

**Result:**

- At the 0.05 significance level, the mean delivery cost for the high-price order category is statistically significantly different from the mean delivery cost for any other category.
- Moreover, the mean delivery cost for the high-price category is statistically significantly higher than for other categories.

In [None]:
%run ../../_post_run.ipynb