Data Exploration

Data Exploration#

Table df_orders#

Let’s look at the information about the dataframe.

df_orders.explore.info()

Dataframe Overview
Summary		Column Types
Rows	99.44k	Text	2
Features	8	Categorical	1
Missing cells	4.91k (<1%)	Int	0
Exact Duplicates	---	Float	0
Fuzzy Duplicates	---	Datetime	5
Memory Usage (Mb)	19

Initial Column Analysis#

We will examine each column individually.

order_id

df_orders.order_id.explore.info(plot=False)

Summary Statistics for "order_id" (Type: Text)
Summary		Text Metrics
Total Values	99.44k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	99.44k (100%)	Median Length (chars)	32.0
Non-Duplicates	99.44k (100%)	Min Length (chars)	32.0
Exact Duplicates	---	Most Common Length	32 (100.0%)
Fuzzy Duplicates	---	Avg Digit Ratio	0.63
Values with Duplicates	---
Memory Usage	8

customer_id

df_orders.customer_id.explore.info(plot=False)

Summary Statistics for "customer_id" (Type: Text)
Summary		Text Metrics
Total Values	99.44k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	99.44k (100%)	Median Length (chars)	32.0
Non-Duplicates	99.44k (100%)	Min Length (chars)	32.0
Exact Duplicates	---	Most Common Length	32 (100.0%)
Fuzzy Duplicates	---	Avg Digit Ratio	0.62
Values with Duplicates	---
Memory Usage	8

order_status

df_orders.order_status.explore.info()

Summary Statistics for "order_status" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	99.44k (100%)	Avg Word Count	1.0	Delivered	96.48k (97%)
Missing Values	---	Max Length (chars)	11.0	Shipped	1.11k (1%)
Empty Strings	---	Avg Length (chars)	9.0	Canceled	625 (<1%)
Distinct Values	8 (<1%)	Median Length (chars)	9.0	Unavailable	609 (<1%)
Non-Duplicates	---	Min Length (chars)	7.0	Invoiced	314 (<1%)
Exact Duplicates	99.43k (99%)	Most Common Length	9 (97.0%)	Processing	301 (<1%)
Fuzzy Duplicates	99.43k (99%)	Avg Digit Ratio	0.00	Created	5 (<1%)
Values with Duplicates	8 (<1%)			Approved	2 (<1%)
Memory Usage	<1 Mb

../../_images/68ccf60ace450855e7b6bade2a694538ea387df95d7d1f29620b722b28c90ba8.jpg

Key Observations:

97% of all orders were delivered

order_purchase_dt

df_orders.order_purchase_dt.explore.info()

Summary Statistics for "order_purchase_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-09-04	Values	99.44k (100%)	Missing Years	---
Last date	2018-10-17	Zeros	---	Missing Months	1 (4%)
Avg Days Frequency	0.01	Missings	---	Missing Weeks	11 (10%)
Min Days Interval	0	Distinct	98.88k (99%)	Missing Days	140 (18%)
Max Days Interval	62	Duplicates	566 (<1%)	Weekend Percentage	23.0%
Memory Usage	1	Dup. Values	556 (<1%)	Most Common Weekday	Monday

Key Observations:

In order_purchase_dt missing 4% of months, 10% of weeks, 18% of days

order_approved_dt

df_orders.order_approved_dt.explore.info()

Summary Statistics for "order_approved_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-09-15	Values	99.28k (99%)	Missing Years	---
Last date	2018-09-03	Zeros	---	Missing Months	1 (4%)
Avg Days Frequency	0.01	Missings	160 (<1%)	Missing Weeks	11 (11%)
Min Days Interval	0	Distinct	90.73k (91%)	Missing Days	108 (15%)
Max Days Interval	66	Duplicates	8.71k (9%)	Weekend Percentage	21.4%
Memory Usage	1	Dup. Values	7.04k (7%)	Most Common Weekday	Tuesday

Key Observations:

In order_approved_dt 160 missing values (<1% of total rows)
In order_approved_dt missing 4% of months, 11% of weeks, 15% of days

order_delivered_carrier_dt

df_orders.order_delivered_carrier_dt.explore.info()

Summary Statistics for "order_delivered_carrier_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-10-08	Values	97.66k (98%)	Missing Years	---
Last date	2018-09-11	Zeros	---	Missing Months	---
Avg Days Frequency	0.01	Missings	1.78k (2%)	Missing Weeks	2 (2%)
Min Days Interval	0	Distinct	81.02k (81%)	Missing Days	157 (22%)
Max Days Interval	20	Duplicates	18.42k (19%)	Weekend Percentage	1.9%
Memory Usage	1	Dup. Values	10.09k (10%)	Most Common Weekday	Tuesday

Key Observations:

In order_delivered_carrier_dt 1.78k missing values (2% of total rows).
In order_delivered_carrier_dt missing 2% of weeks, 22% of days.

order_delivered_customer_dt

df_orders.order_delivered_customer_dt.explore.info()

Summary Statistics for "order_delivered_customer_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-10-11	Values	96.48k (97%)	Missing Years	---
Last date	2018-10-17	Zeros	---	Missing Months	---
Avg Days Frequency	0.01	Missings	2.96k (3%)	Missing Weeks	3 (3%)
Min Days Interval	0	Distinct	95.66k (96%)	Missing Days	92 (12%)
Max Days Interval	13	Duplicates	3.78k (4%)	Weekend Percentage	6.9%
Memory Usage	1	Dup. Values	804 (<1%)	Most Common Weekday	Monday

Key Observations:

In order_delivered_customer_dt 2.96k missing values (3% of total rows).
In order_delivered_customer_dt missing 3% of weeks, 12% of days.

order_estimated_delivery_dt

df_orders.order_estimated_delivery_dt.explore.info(plot=False)

Summary Statistics for "order_estimated_delivery_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-09-30	Values	99.44k (100%)	Missing Years	---
Last date	2018-11-12	Zeros	---	Missing Months	---
Avg Days Frequency	0.01	Missings	---	Missing Weeks	5 (4%)
Min Days Interval	0	Distinct	459 (<1%)	Missing Days	315 (41%)
Max Days Interval	16	Duplicates	98.98k (99%)	Weekend Percentage	0.0%
Memory Usage	1	Dup. Values	438 (<1%)	Most Common Weekday	Wednesday

Key Observations:

In order_estimated_delivery_dt missing 4% of weeks, 41% of days.

Adding Temporary Dimensions#

To study anomalies across different dimensions, we will add temporary metrics.

We will prefix their names with ‘tmp_’ to indicate that these are temporary metrics to be removed later.

They are temporary because the data may change after preprocessing.

Therefore, the primary metrics will be created after preprocessing.

Let’s check the initial DataFrame size and save it to ensure no data is lost later.

print(df_orders.shape[0])
tmp_ids = df_orders.order_id

tmp_df_reviews = (
    df_reviews.groupby('order_id', as_index=False)
    .agg(tmp_avg_reviews_score = ('review_score', 'mean'))
)
tmp_df_reviews['tmp_avg_reviews_score'] = np.floor(tmp_df_reviews['tmp_avg_reviews_score']).astype(int).astype('category')

tmp_df_payments = (
    df_payments.groupby('order_id', as_index=False)
    .agg(tmp_payment_types = ('payment_type', lambda x: ', '.join(x.unique())))
)
tmp_df_items = (
    df_items.merge(df_products, on='product_id', how='left')
    .assign(product_category_name = lambda x: x['product_category_name'].cat.add_categories(['missed in df_products']))
    .fillna({'product_category_name': 'missed in df_products'})
    .groupby('order_id', as_index=False)
    .agg(tmp_product_categories = ('product_category_name', lambda x: ', '.join(x.unique())))
)

df_orders = (
    df_orders.merge(tmp_df_reviews, on='order_id', how='left')
    .merge(tmp_df_payments, on='order_id', how='left')
    .merge(tmp_df_items, on='order_id', how='left')
    .merge(df_customers[['customer_id', 'customer_state']], on='customer_id', how='left')
    .rename(columns={'customer_state': 'tmp_customer_state'})
)

df_orders['tmp_product_categories'] = df_orders['tmp_product_categories'].fillna('Missing in Items').astype('category')

df_orders['tmp_payment_types'] = df_orders['tmp_payment_types'].fillna('Missing in Pays').astype('category')

df_orders['tmp_order_purchase_month'] = df_orders['order_purchase_dt'].dt.month_name().fillna('Missing purchase dt').astype('category')

df_orders['tmp_order_purchase_weekday'] = df_orders['order_purchase_dt'].dt.day_name().fillna('Missing purchase dt').astype('category')

conditions = [
    df_orders['order_purchase_dt'].isna()                      
    , df_orders['order_purchase_dt'].dt.hour.between(4,11)
    , df_orders['order_purchase_dt'].dt.hour.between(12,16)
    , df_orders['order_purchase_dt'].dt.hour.between(17,22)
    , df_orders['order_purchase_dt'].dt.hour.isin([23, 0, 1, 2, 3])
]
choices = ['Missing purchase dt', 'Morning', 'Afternoon', 'Evening', 'Night']
df_orders['tmp_purchase_time_of_day'] = np.select(conditions, choices, default='Missing purchase dt')
df_orders['tmp_purchase_time_of_day'] = df_orders['tmp_purchase_time_of_day'].astype('category')

conditions = [
    df_orders['order_delivered_customer_dt'].isna() | df_orders['order_estimated_delivery_dt'].isna()
    , df_orders['order_delivered_customer_dt'] > df_orders['order_estimated_delivery_dt']            
    , df_orders['order_delivered_customer_dt'] <= df_orders['order_estimated_delivery_dt']                           
]
choices = ['Missing delivery dt', 'Delayed', 'Not Delayed']
df_orders['tmp_is_delayed'] = np.select(conditions, choices, default='Missing delivery dt')
df_orders['tmp_is_delayed'] = df_orders['tmp_is_delayed'].astype('category')

conditions = [
    df_orders['order_status'].isna(), 
    df_orders['order_status'] == 'Delivered',               
    df_orders['order_status'] != 'Delivered',  
]
choices = ['Missing Status', 'Delivered', 'Not Delivered']
df_orders['tmp_is_delivered'] = np.select(conditions, choices, default='Missing Status')
df_orders['tmp_is_delivered'] = df_orders['tmp_is_delivered'].astype('category')

del tmp_df_reviews, tmp_df_payments, tmp_df_items

Verified that nothing was lost.

df_orders.shape[0]

set(df_orders.order_id) == set(tmp_ids)

True

All good.

Exploring Missing Values#

Let’s examine which columns contain missing values.

df_orders.explore.anomalies_report(
    anomaly_type='missing'
    , width=600
)

Missings by Column
	Count	Percent
order_approved_dt	160	0.2%
order_delivered_carrier_dt	1783	1.8%
order_delivered_customer_dt	2965	3.0%

Co-occurring Missings (Pairwise)
	order_approved_dt	order_delivered_carrier_dt
order_approved_dt
order_delivered_carrier_dt	< 8.2% / ^ 91.2%
order_delivered_customer_dt	< 4.9% / ^ 91.2%	< 60.1% / ^ 99.9%

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_is_delayed	Missing delivery dt	2965	2965	100.0%	3.0%	99.5%	96.5%
tmp_is_delivered	Not Delivered	2963	2957	99.8%	3.0%	99.2%	96.2%
tmp_avg_reviews_score	1	11756	2086	17.7%	11.8%	70.0%	58.2%
order_status	Shipped	1107	1107	100.0%	1.1%	37.1%	36.0%
tmp_product_categories	Missing in Items	775	775	100.0%	0.8%	26.0%	25.2%
order_status	Canceled	625	619	99.0%	0.6%	20.8%	20.1%
order_status	Unavailable	609	609	100.0%	0.6%	20.4%	19.8%
order_status	Invoiced	314	314	100.0%	0.3%	10.5%	10.2%
order_status	Processing	301	301	100.0%	0.3%	10.1%	9.8%

Sample Missings Rows
	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
95049	aef214f769de82d355c5d2ab49e15891	d83a72326816260884edd05238aaa508	Invoiced	2016-10-06 13:07:13	2016-10-06 16:03:13	NaT	NaT	2016-12-02 00:00:00	5	Credit Card	telefonia	SP	October	Thursday	Afternoon	Missing delivery dt	Not Delivered
14994	263f5778d1130e9c186958780172a107	7324ecb0ff143f561193d22bea7d63fb	Canceled	2017-10-12 08:39:30	2017-10-12 09:06:26	NaT	NaT	2017-11-06 00:00:00	1	Credit Card, Voucher	dvds_blu_ray	SP	October	Thursday	Morning	Missing delivery dt	Not Delivered
3786	b07abc8b9acaf00e79b4657419f469f3	15308b044c9608fc82e57f2e5d6878f6	Unavailable	2017-03-09 17:45:04	2017-03-09 17:45:04	NaT	NaT	2017-03-28 00:00:00	1	Credit Card	Missing in Items	SP	March	Thursday	Evening	Missing delivery dt	Not Delivered
87216	560f8dd770a2b2c009a4fc935c507014	8ba0374666bfb38c409875762a5af81e	Shipped	2017-12-16 18:51:09	2017-12-16 19:10:22	2017-12-19 20:28:56	NaT	2018-01-12 00:00:00	1	Credit Card	telefonia	RJ	December	Saturday	Evening	Missing delivery dt	Not Delivered
72190	3757dc827b45811e32f858867877df67	442c2a88842315cb04a3d46658371689	Processing	2017-08-21 10:59:48	2017-08-23 02:46:19	NaT	NaT	2017-09-14 00:00:00	2	Boleto	informatica_acessorios	SP	August	Monday	Morning	Missing delivery dt	Not Delivered

../../_images/1706a6eb3371435743a0ffcab8ba90301037e401fdb8fdd7af4275429c143a5a.jpg

Key Observations:

Missing values in these columns likely belong to orders that did not reach a certain status.

We will analyze missing values in each column separately.

Missing in order_approved_dt

tmp_miss = df_orders[df_orders['order_approved_dt'].isna()]

Let’s examine missing values in payment approval time over time. Time will be based on order creation time.

df_orders['order_approved_dt'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , anomaly_type='missing'
    , freq='W'
)

../../_images/5a350159c5426ea88a728026f8a70028f68f3174d871d490bf61013cda07e02d.jpg

Key Observations:

In February 2017 and August 2018, there was a spike in orders missing payment approval timestamps.

Let’s analyze by order status.

df_orders['order_approved_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='order_status'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
order_status	Canceled	625	141	22.6%	0.6%	88.1%	87.5%
order_status	Created	5	5	100.0%	0.0%	3.1%	3.1%
order_status	Delivered	96478	14	0.0%	97.0%	8.8%	-88.3%

Key Observations:

Missing values in the “canceled” and “created” statuses are logical.
However, 14 missing values in order_approved_dt for orders with “delivered” status are unusual.

Let’s examine these 14 delivered orders with missing order_approved_dt.

tmp_miss[lambda x: x.order_status == 'Delivered']

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
5323	e04abd8149ef81b95221e88f6ed9ab6a	2127dc6603ac33544953ef05ec155771	Delivered	2017-02-18 14:40:00	NaT	2017-02-23 12:04:47	2017-03-01 13:25:33	2017-03-17	4	Boleto	eletroportateis	SP	February	Saturday	Afternoon	Not Delayed	Delivered
16567	8a9adc69528e1001fc68dd0aaebbb54a	4c1ccc74e00993733742a3c786dc3c1f	Delivered	2017-02-18 12:45:31	NaT	2017-02-23 09:01:52	2017-03-02 10:05:06	2017-03-21	5	Boleto	construcao_ferramentas_seguranca	RS	February	Saturday	Afternoon	Not Delayed	Delivered
19031	7013bcfc1c97fe719a7b5e05e61c12db	2941af76d38100e0f8740a374f1a5dc3	Delivered	2017-02-18 13:29:47	NaT	2017-02-22 16:25:25	2017-03-01 08:07:38	2017-03-17	5	Boleto	missed in df_products	SP	February	Saturday	Afternoon	Not Delayed	Delivered
22663	5cf925b116421afa85ee25e99b4c34fb	29c35fc91fc13fb5073c8f30505d860d	Delivered	2017-02-18 16:48:35	NaT	2017-02-22 11:23:10	2017-03-09 07:28:47	2017-03-31	5	Boleto	cool_stuff	CE	February	Saturday	Afternoon	Not Delayed	Delivered
23156	12a95a3c06dbaec84bcfb0e2da5d228a	1e101e0daffaddce8159d25a8e53f2b2	Delivered	2017-02-17 13:05:55	NaT	2017-02-22 11:23:11	2017-03-02 11:09:19	2017-03-20	5	Boleto	cool_stuff	RJ	February	Friday	Afternoon	Not Delayed	Delivered
26800	c1d4211b3dae76144deccd6c74144a88	684cb238dc5b5d6366244e0e0776b450	Delivered	2017-01-19 12:48:08	NaT	2017-01-25 14:56:50	2017-01-30 18:16:01	2017-03-01	4	Boleto	esporte_lazer	SP	January	Thursday	Afternoon	Not Delayed	Delivered
38290	d69e5d356402adc8cf17e08b5033acfb	68d081753ad4fe22fc4d410a9eb1ca01	Delivered	2017-02-19 01:28:47	NaT	2017-02-23 03:11:48	2017-03-02 03:41:58	2017-03-27	5	Boleto	moveis_decoracao	SP	February	Sunday	Night	Not Delayed	Delivered
39334	d77031d6a3c8a52f019764e68f211c69	0bf35cac6cc7327065da879e2d90fae8	Delivered	2017-02-18 11:04:19	NaT	2017-02-23 07:23:36	2017-03-02 16:15:23	2017-03-22	5	Boleto	esporte_lazer	SP	February	Saturday	Morning	Not Delayed	Delivered
48401	7002a78c79c519ac54022d4f8a65e6e8	d5de688c321096d15508faae67a27051	Delivered	2017-01-19 22:26:59	NaT	2017-01-27 11:08:05	2017-02-06 14:22:19	2017-03-16	2	Boleto	moveis_decoracao	MG	January	Thursday	Evening	Not Delayed	Delivered
61743	2eecb0d85f281280f79fa00f9cec1a95	a3d3c38e58b9d2dfb9207cab690b6310	Delivered	2017-02-17 17:21:55	NaT	2017-02-22 11:42:51	2017-03-03 12:16:03	2017-03-20	5	Boleto	ferramentas_jardim	MG	February	Friday	Evening	Not Delayed	Delivered
63052	51eb2eebd5d76a24625b31c33dd41449	07a2a7e0f63fd8cb757ed77d4245623c	Delivered	2017-02-18 15:52:27	NaT	2017-02-23 03:09:14	2017-03-07 13:57:47	2017-03-29	5	Boleto	moveis_decoracao	MG	February	Saturday	Afternoon	Not Delayed	Delivered
67697	88083e8f64d95b932164187484d90212	f67cd1a215aae2a1074638bbd35a223a	Delivered	2017-02-18 22:49:19	NaT	2017-02-22 11:31:06	2017-03-02 12:06:06	2017-03-21	4	Boleto	telefonia	RJ	February	Saturday	Evening	Not Delayed	Delivered
72407	3c0b8706b065f9919d0505d3b3343881	d85919cb3c0529589c6fa617f5f43281	Delivered	2017-02-17 15:53:27	NaT	2017-02-22 11:31:30	2017-03-03 11:47:47	2017-03-23	3	Boleto	cama_mesa_banho	RS	February	Friday	Afternoon	Not Delayed	Delivered
84999	2babbb4b15e6d2dfe95e2de765c97bce	74bebaf46603f9340e3b50c6b086f992	Delivered	2017-02-18 17:15:03	NaT	2017-02-22 11:23:11	2017-03-03 18:43:43	2017-03-31	4	Boleto	cool_stuff	MA	February	Saturday	Evening	Not Delayed	Delivered

Key Observations:

All delivered orders with missing order_approved_dt used “boleto” as the payment method. This may be a characteristic of “boleto” usage.
All these orders were placed in January and February 2017. There may have been a system issue where approval timestamps were not saved.

Let’s examine the 5 created orders that have missing values in the payment approval time.

Let’s look at 5 “created” orders with missing payment approval timestamps.

tmp_miss[lambda x: x.order_status == 'Created']

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
7434	b5359909123fa03c50bdb0cfed07f098	438449d4af8980d107bf04571413a8e7	Created	2017-12-05 01:07:52	NaT	NaT	NaT	2018-01-11	1	Credit Card	Missing in Items	SP	December	Tuesday	Night	Missing delivery dt	Not Delivered
9238	dba5062fbda3af4fb6c33b1e040ca38f	964a6df3d9bdf60fe3e7b8bb69ed893a	Created	2018-02-09 17:21:04	NaT	NaT	NaT	2018-03-07	1	Boleto	Missing in Items	DF	February	Friday	Evening	Missing delivery dt	Not Delivered
21441	7a4df5d8cff4090e541401a20a22bb80	725e9c75605414b21fd8c8d5a1c2f1d6	Created	2017-11-25 11:10:33	NaT	NaT	NaT	2017-12-12	1	Boleto	Missing in Items	RJ	November	Saturday	Morning	Missing delivery dt	Not Delivered
55086	35de4050331c6c644cddc86f4f2d0d64	4ee64f4bfc542546f422da0aeb462853	Created	2017-12-05 01:07:58	NaT	NaT	NaT	2018-01-08	1	Credit Card	Missing in Items	RS	December	Tuesday	Night	Missing delivery dt	Not Delivered
58958	90ab3e7d52544ec7bc3363c82689965f	7d61b9f4f216052ba664f22e9c504ef1	Created	2017-11-06 13:12:34	NaT	NaT	NaT	2017-12-01	5	Credit Card	Missing in Items	PR	November	Monday	Afternoon	Missing delivery dt	Not Delivered

Key Observations:

Orders with “created” status and missing payment approval timestamps were placed long ago and never delivered. The data may not have been processed.

Let’s analyze by average order review score.

df_orders['order_approved_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='tmp_avg_reviews_score'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	67	0.6%	11.8%	41.9%	30.1%
tmp_avg_reviews_score	2	3244	16	0.5%	3.3%	10.0%	6.7%
tmp_avg_reviews_score	3	8268	18	0.2%	8.3%	11.2%	2.9%
tmp_avg_reviews_score	4	19129	16	0.1%	19.2%	10.0%	-9.2%
tmp_avg_reviews_score	5	57044	43	0.1%	57.4%	26.9%	-30.5%

Key Observations:

The difference in proportions is significantly higher for score 1. These orders were likely not delivered.

Let’s examine a word cloud from review messages.

tmp_miss = tmp_miss.merge(df_reviews, on='order_id', how='left')

tmp_miss.viz.wordcloud('review_comment_message')

../../_images/6fc18d6556e8e7555f516b85794f1e395e4ecadd0be9ba19fe8addc71a0159b5.jpg

Key Observations:

Many words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_miss.analysis.sentiment('review_comment_message')

../../_images/db9dc4d4267b1a2db8cd16d164e35f2ac9fc8d544f383e553f5e302cfb5e21b2.jpg

Key Observations:

The sentiment is not predominantly negative.

Let’s randomly sample 20 review comments. We’ll repeat this several times.

messages = (
    tmp_miss['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['contact lenses are positive for farsightedness or negative for myopia i have farsightedness i need plus lenses negative lenses were delivered no one answers me to resolve',
 'product did not come as purchased',
 'the product i ordered came in a number different from the one i ordered',
 'the product arrived at me diverging from what was reported on the website about how it works missing speed reducer button',
 'contact lenses are positive for farsightedness or negative for myopia i have farsightedness i need plus lenses negative lenses were delivered no one answers me to resolve',
 'although the product was out of stock the company made contact to notify and canceled the order',
 'more than 30 days after the purchase was made i received an email stating that the product i purchased was no longer available simply a disregard for the consumer and a regrettable service',
 'excellent',
 'i believe the order has been cancelled',
 'they exchanged my product they sent me someone else s product',
 'delivery of the product different from the one ordered i await guidance for the exchange',
 'congratulations',
 'i did not receive the product it was canceled on request before billing',
 'i always buy from lannister stores and i ve never had any problems i think it s safe',
 'product was supposed to arrive on the 04/01th and until today nothing a huge amount of nonsense i want urgent provisions to be taken',
 'i m waiting could you send a confirmation if it will be delivered',
 'easy to assemble',
 'product different than advertised',
 'i am super disappointed the product came before the deadline but it has come wrong and so far i could not exchange or cancel the purchase',
 'and remold is not reliable']

Key Observations:

Based on review messages, many orders were not delivered, but a significant number were delivered.
Therefore, missing payment approval timestamps cannot be assumed to indicate order cancellation.

Let’s analyze by payment type.

df_orders['order_approved_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='tmp_payment_types'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_payment_types	Voucher	1621	71	4.4%	1.6%	44.4%	42.7%
tmp_payment_types	Not Defined	3	3	100.0%	0.0%	1.9%	1.9%
tmp_payment_types	Voucher, Credit Card	1118	1	0.1%	1.1%	0.6%	-0.5%
tmp_payment_types	Credit Card, Voucher	1127	1	0.1%	1.1%	0.6%	-0.5%
tmp_payment_types	Boleto	19784	30	0.2%	19.9%	18.8%	-1.1%
tmp_payment_types	Credit Card	74259	54	0.1%	74.7%	33.8%	-40.9%

Key Observations:

The proportion of “voucher” payments in missing values has increased significantly. This payment type is notably more frequent in missing values.
The “voucher” payment type has a stronger correlation with missing payment approval timestamps. This is likely a characteristic of this payment method.

Let’s analyze by month.

df_orders['order_approved_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='tmp_order_purchase_month'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_order_purchase_month	August	10843	59	0.5%	10.9%	36.9%	26.0%
tmp_order_purchase_month	September	4305	19	0.4%	4.3%	11.9%	7.5%
tmp_order_purchase_month	October	4959	15	0.3%	5.0%	9.4%	4.4%
tmp_order_purchase_month	February	8508	18	0.2%	8.6%	11.2%	2.7%
tmp_order_purchase_month	December	5674	7	0.1%	5.7%	4.4%	-1.3%
tmp_order_purchase_month	November	7544	9	0.1%	7.6%	5.6%	-2.0%
tmp_order_purchase_month	May	10573	10	0.1%	10.6%	6.2%	-4.4%
tmp_order_purchase_month	January	8069	4	0.0%	8.1%	2.5%	-5.6%
tmp_order_purchase_month	July	10318	6	0.1%	10.4%	3.8%	-6.6%
tmp_order_purchase_month	March	9893	5	0.1%	9.9%	3.1%	-6.8%
tmp_order_purchase_month	April	9343	4	0.0%	9.4%	2.5%	-6.9%
tmp_order_purchase_month	June	9412	4	0.0%	9.5%	2.5%	-7.0%

Key Observations:

August has a noticeably higher proportion of missing values than other months. This is also visible in the graph above.

Missing Values in order_delivered_carrier_dt

tmp_miss = df_orders[df_orders['order_delivered_carrier_dt'].isna()]

Let’s examine the distribution of missing values in the carrier handover time.

df_orders['order_delivered_carrier_dt'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , anomaly_type='missing'
    , freq='W'
)

../../_images/413a3b5289fa0bcc167b945aa91c52d8eea618ce4b0b0c2d35dbeefe7473c3eb.jpg

Key Observations:

In November 2017, there was a spike in orders missing carrier handover timestamps. This may be related to Black Friday.

Let’s analyze by order status.

df_orders['order_delivered_carrier_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='order_status'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
order_status	Unavailable	609	609	100.0%	0.6%	34.2%	33.5%
order_status	Canceled	625	550	88.0%	0.6%	30.8%	30.2%
order_status	Invoiced	314	314	100.0%	0.3%	17.6%	17.3%
order_status	Processing	301	301	100.0%	0.3%	16.9%	16.6%
order_status	Created	5	5	100.0%	0.0%	0.3%	0.3%
order_status	Approved	2	2	100.0%	0.0%	0.1%	0.1%
order_status	Delivered	96478	2	0.0%	97.0%	0.1%	-96.9%

Key Observations:

There are 2 delivered orders with missing order_delivered_carrier_dt.
All orders with “unavailable” status have missing order_delivered_carrier_dt.

Let’s examine these 2 delivered orders.

tmp_miss[lambda x: x.order_status == 'Delivered'].merge(df_payments, on='order_id', how='left')

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered	payment_sequential	payment_type	payment_installments	payment_value
0	2aa91108853cecb43c84a5dc5b277475	afeb16c7f46396c0ed54acb45ccaaa40	Delivered	2017-09-29 08:52:58	2017-09-29 09:07:16	NaT	2017-11-20 19:44:47	2017-11-14	5	Credit Card	moveis_decoracao	SP	September	Friday	Morning	Delayed	Delivered	1	Credit Card	1	193.98
1	2d858f451373b04fb5c984a1cc2defaf	e08caf668d499a6d643dafd7c5cc498a	Delivered	2017-05-25 23:22:43	2017-05-25 23:30:16	NaT	NaT	2017-06-23	5	Credit Card	esporte_lazer	RS	May	Thursday	Night	Missing delivery dt	Delivered	1	Credit Card	4	194.00

Key Observations:

Both orders with missing order_delivered_carrier_dt were paid via credit card.

Let’s analyze by average review score.

df_orders['order_delivered_carrier_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='tmp_avg_reviews_score'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	1350	11.5%	11.8%	75.7%	63.9%
tmp_avg_reviews_score	2	3244	133	4.1%	3.3%	7.5%	4.2%
tmp_avg_reviews_score	3	8268	107	1.3%	8.3%	6.0%	-2.3%
tmp_avg_reviews_score	4	19129	64	0.3%	19.2%	3.6%	-15.6%
tmp_avg_reviews_score	5	57044	129	0.2%	57.4%	7.2%	-50.1%

Key Observations:

The difference in proportions is significantly higher for score 1. These orders were likely not delivered.
Review score 1 has the strongest correlation with missing carrier handover timestamps. This suggests these orders were not delivered, and customers were highly dissatisfied.

Let’s examine a word cloud from review messages.

tmp_miss = tmp_miss.merge(df_reviews, on='order_id', how='left')

tmp_miss.viz.wordcloud('review_comment_message')

../../_images/450b90b3610c169a7b43e6b3f84bbfdd155f568c41bec4ac05a0e524430efce4.jpg

Key Observations:

Many words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_miss.analysis.sentiment('review_comment_message')

../../_images/3ac8d75a41f9a295ad07dc505a169de5ce67b703bd454a76129ad1d55a4a02d2.jpg

Key Observations:

Negative reviews outnumber positive ones, and the boxplot body lies mostly below 0.

Let’s randomly sample 20 review comments. We’ll repeat this several times.

messages = (
    tmp_miss['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['the supplier got in touch but did not provide any solution or delivery time it was not responsible for the delay it blamed a third party i just bought it and i want to receive it',
 'until now i have not received my product i would like an opinion on delivery',
 'product not delivered',
 'i have not received the product because the seller cannot invoice in the name of the company with the cnpj requested purchase cancellation',
 'the company doesn t have the product to deliver and canceled the purchase but the charge came on my credit card statement and they didn t even give a satisfaction about the reversal of the charge',
 'i bought a product that was not available i suggest that you review the site every hour to avoid this type of inconvenience',
 'no communication was passed to me',
 'i bought the charger thinking of taking it on a trip i trusted the delivery time but i was fooled because the product has not yet arrived i requested the cancellation of the purchase disappointing',
 'i have not received the product',
 'dear; to date the product has not been delivered to me i would like to know what steps lannisters com will take so that i don t end up at a loss it was my 1st purchase to test the store',
 'top product came on time',
 'i bought it and until today it has not been delivered',
 'i didn t receive the product',
 'not at all satisfied because we bought the products they did not arrive on the expected date we are very upset with the site we need an urgent response from you',
 'it took too long to inform that the product didn t have it in stock',
 'the seller did not have the ordered product canceled',
 'the product has not been delivered',
 'it s been a month since i bought it and so far i haven t received all the merchandise purchased',
 'product unavailable after purchase approval i would not recommend a partner i was very dissatisfied',
 'the purchase was canceled without me asking']

Key Observations:

Based on review messages, many orders were not delivered, but a significant number were delivered.
Orders with missing carrier handover timestamps were more frequently undelivered compared to those with missing payment approval timestamps.
Some products may have been out of stock, and sellers did not hand them over to carriers.
However, since many orders were still delivered, missing values cannot be assumed to indicate order cancellation.

Let’s analyze by customer state.

df_orders['order_delivered_carrier_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , include_columns='tmp_customer_state'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_customer_state	SP	41746	888	2.1%	42.0%	49.8%	7.8%
tmp_customer_state	RO	253	10	4.0%	0.3%	0.6%	0.3%
tmp_customer_state	PI	495	11	2.2%	0.5%	0.6%	0.1%
tmp_customer_state	PR	5045	91	1.8%	5.1%	5.1%	0.0%
tmp_customer_state	RR	46	1	2.2%	0.0%	0.1%	0.0%
tmp_customer_state	AP	68	1	1.5%	0.1%	0.1%	-0.0%
tmp_customer_state	SE	350	6	1.7%	0.4%	0.3%	-0.0%
tmp_customer_state	AL	413	7	1.7%	0.4%	0.4%	-0.0%
tmp_customer_state	MA	747	12	1.6%	0.8%	0.7%	-0.1%
tmp_customer_state	PB	536	8	1.5%	0.5%	0.4%	-0.1%
tmp_customer_state	AM	148	1	0.7%	0.1%	0.1%	-0.1%
tmp_customer_state	MS	715	11	1.5%	0.7%	0.6%	-0.1%
tmp_customer_state	TO	280	2	0.7%	0.3%	0.1%	-0.2%
tmp_customer_state	RN	485	4	0.8%	0.5%	0.2%	-0.3%
tmp_customer_state	SC	3637	60	1.6%	3.7%	3.4%	-0.3%
tmp_customer_state	GO	2020	31	1.5%	2.0%	1.7%	-0.3%
tmp_customer_state	MG	11635	203	1.7%	11.7%	11.4%	-0.3%
tmp_customer_state	CE	1336	18	1.3%	1.3%	1.0%	-0.3%
tmp_customer_state	PE	1652	23	1.4%	1.7%	1.3%	-0.4%
tmp_customer_state	BA	3380	53	1.6%	3.4%	3.0%	-0.4%
tmp_customer_state	DF	2140	30	1.4%	2.2%	1.7%	-0.5%
tmp_customer_state	PA	975	9	0.9%	1.0%	0.5%	-0.5%
tmp_customer_state	MT	907	6	0.7%	0.9%	0.3%	-0.6%
tmp_customer_state	RS	5466	84	1.5%	5.5%	4.7%	-0.8%
tmp_customer_state	ES	2033	19	0.9%	2.0%	1.1%	-1.0%
tmp_customer_state	RJ	12852	194	1.5%	12.9%	10.9%	-2.0%

Key Observations:

The difference in proportions is slightly higher in São Paulo compared to other states.

Missing Values in order_delivered_customer_dt

tmp_miss = df_orders[df_orders['order_delivered_customer_dt'].isna()]

Let’s examine the distribution of missing values in customer delivery time.

df_orders['order_delivered_customer_dt'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , anomaly_type='missing'
    , freq='W'
)

../../_images/79c1fce28366880e102898fd8327e98b26a880430ccbd4199bf9f6fa72490bc2.jpg

Key Observations:

In November 2017, there was a spike in orders missing customer delivery timestamps.

Let’s analyze by order status.

df_orders['order_delivered_customer_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='order_status'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
order_status	Shipped	1107	1107	100.0%	1.1%	37.3%	36.2%
order_status	Canceled	625	619	99.0%	0.6%	20.9%	20.2%
order_status	Unavailable	609	609	100.0%	0.6%	20.5%	19.9%
order_status	Invoiced	314	314	100.0%	0.3%	10.6%	10.3%
order_status	Processing	301	301	100.0%	0.3%	10.2%	9.8%
order_status	Created	5	5	100.0%	0.0%	0.2%	0.2%
order_status	Approved	2	2	100.0%	0.0%	0.1%	0.1%
order_status	Delivered	96478	8	0.0%	97.0%	0.3%	-96.8%

Key Observations:

There are 8 orders with “delivered” status but missing delivery timestamps.

Let’s analyze by customer state.

df_orders['order_delivered_customer_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , include_columns='tmp_customer_state'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_customer_state	RJ	12852	499	3.9%	12.9%	16.8%	3.9%
tmp_customer_state	BA	3380	124	3.7%	3.4%	4.2%	0.8%
tmp_customer_state	CE	1336	57	4.3%	1.3%	1.9%	0.6%
tmp_customer_state	PE	1652	59	3.6%	1.7%	2.0%	0.3%
tmp_customer_state	MA	747	30	4.0%	0.8%	1.0%	0.3%
tmp_customer_state	SP	41746	1251	3.0%	42.0%	42.2%	0.2%
tmp_customer_state	SE	350	15	4.3%	0.4%	0.5%	0.2%
tmp_customer_state	PI	495	19	3.8%	0.5%	0.6%	0.1%
tmp_customer_state	AL	413	16	3.9%	0.4%	0.5%	0.1%
tmp_customer_state	RR	46	5	10.9%	0.0%	0.2%	0.1%
tmp_customer_state	PB	536	19	3.5%	0.5%	0.6%	0.1%
tmp_customer_state	GO	2020	63	3.1%	2.0%	2.1%	0.1%
tmp_customer_state	RO	253	10	4.0%	0.3%	0.3%	0.1%
tmp_customer_state	PA	975	29	3.0%	1.0%	1.0%	-0.0%
tmp_customer_state	AP	68	1	1.5%	0.1%	0.0%	-0.0%
tmp_customer_state	AM	148	3	2.0%	0.1%	0.1%	-0.0%
tmp_customer_state	AC	81	1	1.2%	0.1%	0.0%	-0.0%
tmp_customer_state	TO	280	6	2.1%	0.3%	0.2%	-0.1%
tmp_customer_state	RN	485	11	2.3%	0.5%	0.4%	-0.1%
tmp_customer_state	DF	2140	60	2.8%	2.2%	2.0%	-0.1%
tmp_customer_state	MT	907	21	2.3%	0.9%	0.7%	-0.2%
tmp_customer_state	MS	715	14	2.0%	0.7%	0.5%	-0.2%
tmp_customer_state	SC	3637	90	2.5%	3.7%	3.0%	-0.6%
tmp_customer_state	ES	2033	38	1.9%	2.0%	1.3%	-0.8%
tmp_customer_state	PR	5045	122	2.4%	5.1%	4.1%	-1.0%
tmp_customer_state	RS	5466	122	2.2%	5.5%	4.1%	-1.4%
tmp_customer_state	MG	11635	280	2.4%	11.7%	9.4%	-2.3%

Key Observations:

The difference in proportions is slightly higher in Rio de Janeiro.

Let’s examine these 8 delivered orders.

tmp_miss[lambda x: x.order_status == 'Delivered'].merge(df_payments, on='order_id', how='left')

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered	payment_sequential	payment_type	payment_installments	payment_value
0	2d1e2d5bf4dc7227b3bfebb81328c15f	ec05a6d8558c6455f0cbbd8a420ad34f	Delivered	2017-11-28 17:44:07	2017-11-28 17:56:40	2017-11-30 18:12:23	NaT	2017-12-18	5	Credit Card	automotivo	SP	November	Tuesday	Evening	Missing delivery dt	Delivered	1	Credit Card	3	134.83
1	f5dd62b788049ad9fc0526e3ad11a097	5e89028e024b381dc84a13a3570decb4	Delivered	2018-06-20 06:58:43	2018-06-20 07:19:05	2018-06-25 08:05:00	NaT	2018-07-16	5	Debit Card	industria_comercio_e_negocios	SP	June	Wednesday	Morning	Missing delivery dt	Delivered	1	Debit Card	1	354.24
2	2ebdfc4f15f23b91474edf87475f108e	29f0540231702fda0cfdee0a310f11aa	Delivered	2018-07-01 17:05:11	2018-07-01 17:15:12	2018-07-03 13:57:00	NaT	2018-07-30	5	Credit Card	relogios_presentes	SP	July	Sunday	Evening	Missing delivery dt	Delivered	1	Credit Card	3	158.07
3	e69f75a717d64fc5ecdfae42b2e8e086	cfda40ca8dd0a5d486a9635b611b398a	Delivered	2018-07-01 22:05:55	2018-07-01 22:15:14	2018-07-03 13:57:00	NaT	2018-07-30	5	Credit Card	relogios_presentes	SP	July	Sunday	Evening	Missing delivery dt	Delivered	1	Credit Card	1	158.07
4	0d3268bad9b086af767785e3f0fc0133	4f1d63d35fb7c8999853b2699f5c7649	Delivered	2018-07-01 21:14:02	2018-07-01 21:29:54	2018-07-03 09:28:00	NaT	2018-07-24	5	Credit Card	brinquedos	SP	July	Sunday	Evening	Missing delivery dt	Delivered	1	Credit Card	4	204.62
5	2d858f451373b04fb5c984a1cc2defaf	e08caf668d499a6d643dafd7c5cc498a	Delivered	2017-05-25 23:22:43	2017-05-25 23:30:16	NaT	NaT	2017-06-23	5	Credit Card	esporte_lazer	RS	May	Thursday	Night	Missing delivery dt	Delivered	1	Credit Card	4	194.00
6	ab7c89dc1bf4a1ead9d6ec1ec8968a84	dd1b84a7286eb4524d52af4256c0ba24	Delivered	2018-06-08 12:09:39	2018-06-08 12:36:39	2018-06-12 14:10:00	NaT	2018-06-26	1	Credit Card	informatica_acessorios	SP	June	Friday	Afternoon	Missing delivery dt	Delivered	1	Credit Card	5	120.12
7	20edc82cf5400ce95e1afacc25798b31	28c37425f1127d887d7337f284080a0f	Delivered	2018-06-27 16:09:12	2018-06-27 16:29:30	2018-07-03 19:26:00	NaT	2018-07-19	5	Credit Card	livros_interesse_geral	SP	June	Wednesday	Afternoon	Missing delivery dt	Delivered	1	Credit Card	1	54.97

Key Observations:

7 out of 8 orders with missing order_delivered_customer_dt were paid via credit card, and 1 was paid via debit card.

Let’s analyze by average review score.

df_orders['order_delivered_customer_dt'].explore.anomalies_by_categories(
    anomaly_type='missing'
    , pct_diff_threshold=-100
    , include_columns='tmp_avg_reviews_score'
)

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	2086	17.7%	11.8%	70.4%	58.5%
tmp_avg_reviews_score	2	3244	223	6.9%	3.3%	7.5%	4.3%
tmp_avg_reviews_score	3	8268	235	2.8%	8.3%	7.9%	-0.4%
tmp_avg_reviews_score	4	19129	156	0.8%	19.2%	5.3%	-14.0%
tmp_avg_reviews_score	5	57044	265	0.5%	57.4%	8.9%	-48.4%

Key Observations:

The difference in proportions is significantly higher for score 1. These orders were likely not delivered.

Let’s examine a word cloud from review messages.

tmp_miss = tmp_miss.merge(df_reviews, on='order_id', how='left')

tmp_miss.viz.wordcloud('review_comment_message')

../../_images/cf3f3d48026e1e052b63e145271090d56ec27c38f7e7b964ff14b1b87e5f4829.jpg

Key Observations:

Many words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_miss.analysis.sentiment('review_comment_message')

../../_images/18cdc1071a7f066d0cacd7271b3471bf42bcfb6cff9c1b563cf3f6ec3d0be7fa.jpg

Key Observations:

Negative reviews outnumber positive ones, and the boxplot body lies mostly below 0.

Let’s randomly sample 20 review comments.
We’ll repeat this several times.

messages = (
    tmp_miss['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['good afternoon to date i still have not received my purchase i tried to contact you and i couldn t i want the amount paid to be refunded or the merchandise urgently awaiting return',
 'good morning i received my order on time but the pushing ball came dry and peeling how should i proceed to receive one in good condition awaiting vanessa',
 'total neglect of the store does not receive the product and they did not contact me a long time later they sent an email to my sister and did not even leave a phone number for me to contact the store i do not recommend',
 'until now i have not received my product i would like an opinion on delivery',
 'i am disappointed with delay in delivery and the lack of satisfaction or explanation of the store',
 'i didn t receive the product',
 'i m waiting until today the product has already completed 30 days of purchase',
 'they don t deliver bad store i do not recommend i want my money back',
 'i have not received the product yet please help',
 'they exchanged my product they sent me someone else s product',
 'i do not recommend buying from this site',
 'the product arrived at me diverging from what was reported on the website about how it works missing speed reducer button',
 'i want the value back',
 'product if you want it left to be delivered i had to buy it elsewhere',
 'i have not received this product and no estimated arrival date',
 'it seemed to be good at first but the delivery period has already expired and i have no information about the product',
 'i am still waiting for the delivery of the product i ordered and it has already exceeded the expected delivery date i am very dissatisfied',
 'i wait to receive the product',
 'i didn t receive the product',
 'seller still not send my order']

Key Observations:

Based on review messages, some orders were not delivered, but this is less frequent than with missing payment approval or carrier handover timestamps.
Many messages confirm order receipt. Thus, these orders cannot be assumed canceled.

del tmp_miss

Anomalies in Order Status#

We have many orders with statuses other than “delivered.” This is unusual. Let’s investigate this.

Let’s examine by status.

df_orders.order_status.value_counts()  

order_status
Delivered      96478
Shipped         1107
Canceled         625
Unavailable      609
Invoiced         314
Processing       301
Created            5
Approved           2
Name: count, dtype: int64

Let’s look at missing values in the timestamps by order status.

columns = [
    "order_status",
    "order_purchase_dt",
    "order_approved_dt",
    "order_delivered_carrier_dt",
    "order_delivered_customer_dt",
    "order_estimated_delivery_dt",
]
(
    df_orders[columns].pivot_table(
        index='order_status',
        aggfunc=lambda x: x.isna().sum(),
        observed=True,
    )
    .reset_index()
    [columns]
)

	order_status	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt
0	Approved	0	2	2
1	Canceled	141	550	619
2	Created	5	5	5
3	Delivered	14	2	8
4	Invoiced	0	314	314
5	Processing	0	301	301
6	Shipped	0	0	1107
7	Unavailable	0	609	609

Let’s look at the number of orders without the delivered status over time.

labels = dict(
    order_purchase_dt = 'Date',
    order_id = 'Number of Orders', 
    order_status = 'Order Status', 
)
df_orders[lambda x: x.order_status != 'Delivered'].viz.line(
    x='order_purchase_dt',
    y='order_id',
    color='order_status',
    agg_func='nunique',
    freq='ME',
    labels=labels,
    markers=True,
    title='Number of Orders without Delivered Status by Month and Order Status',   
)

../../_images/7cdf89bd421704f41285f8a5af6035a6c2991edd464dd7a768e2d98662b455fd.jpg

Key Observations:

In March and April 2018, there was a sharp spike in orders stuck in the “shipped” status.
In February and August 2018, there were spikes in the “canceled” status.
In November 2017, there was a spike in the “unavailable” status. This month included Black Friday.

Let’s examine each status separately.

created

Let’s look at the rows in the dataframe with orders that have the status ‘created’.

df_orders[lambda x: x.order_status == 'Created']

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
7434	b5359909123fa03c50bdb0cfed07f098	438449d4af8980d107bf04571413a8e7	Created	2017-12-05 01:07:52	NaT	NaT	NaT	2018-01-11	1	Credit Card	Missing in Items	SP	December	Tuesday	Night	Missing delivery dt	Not Delivered
9238	dba5062fbda3af4fb6c33b1e040ca38f	964a6df3d9bdf60fe3e7b8bb69ed893a	Created	2018-02-09 17:21:04	NaT	NaT	NaT	2018-03-07	1	Boleto	Missing in Items	DF	February	Friday	Evening	Missing delivery dt	Not Delivered
21441	7a4df5d8cff4090e541401a20a22bb80	725e9c75605414b21fd8c8d5a1c2f1d6	Created	2017-11-25 11:10:33	NaT	NaT	NaT	2017-12-12	1	Boleto	Missing in Items	RJ	November	Saturday	Morning	Missing delivery dt	Not Delivered
55086	35de4050331c6c644cddc86f4f2d0d64	4ee64f4bfc542546f422da0aeb462853	Created	2017-12-05 01:07:58	NaT	NaT	NaT	2018-01-08	1	Credit Card	Missing in Items	RS	December	Tuesday	Night	Missing delivery dt	Not Delivered
58958	90ab3e7d52544ec7bc3363c82689965f	7d61b9f4f216052ba664f22e9c504ef1	Created	2017-11-06 13:12:34	NaT	NaT	NaT	2017-12-01	5	Credit Card	Missing in Items	PR	November	Monday	Afternoon	Missing delivery dt	Not Delivered

Key Observations:

One order has a rating of 5, while four orders have a rating of 1.
The process stops after purchase, before payment approval.

Let’s look at the review messages.

messages = (
    df_orders[lambda x: x.order_status == 'Created']
    .merge(df_reviews, on='order_id', how='left')
    ['review_comment_message']
    .tolist()
)
display(messages)

['the product has not arrived until today',
 'i have not received my product i am not satisfied i am waiting to this day',
 nan,
 'i have been awaiting the product since december and i have not received it deceptualized with the americana because besides the delay i gave a waterway that does not fulfill the advertising delay my order',
 'although the product was out of stock the company made contact to notify and canceled the order']

Key Observations:

Based on review comments, these orders were not delivered.

approved

Let’s look at the rows.

df_orders[lambda x: x.order_status == 'Approved']

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
44897	a2e4c44360b4a57bdff22f3a4630c173	8886130db0ea6e9e70ba0b03d7c0d286	Approved	2017-02-06 20:18:17	2017-02-06 20:30:19	NaT	NaT	2017-03-01	1	Credit Card	informatica_acessorios	MG	February	Monday	Evening	Missing delivery dt	Not Delivered
88457	132f1e724165a07f6362532bfb97486e	b2191912d8ad6eac2e4dc3b6e1459515	Approved	2017-04-25 01:25:34	2017-04-30 20:32:41	NaT	NaT	2017-05-22	4	Credit Card	utilidades_domesticas	SP	April	Tuesday	Night	Missing delivery dt	Not Delivered

Key Observations:

One order received a rating of 1, the other a 4.
The process stops after payment approval, before carrier handover.

Let’s look at the review messages.

messages = (
    df_orders[lambda x: x.order_status == 'Approved']
    .merge(df_reviews, on='order_id', how='left')
    ['review_comment_message']
    .tolist()
)
display(messages)

[nan, nan]

Key Observations:

No comments were left for these orders.

processing

Let’s look at orders with the status ‘processing’ by month.

df_orders['order_status'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=df_orders.order_status == 'Processing'
    , freq='ME'
)

../../_images/ec7a11fa6962b35dd085ccecff5206775a586f417e3faf536609fb9517ad2591.jpg

Let’s look at the count of each order status.

tmp_anomal = df_orders[lambda x: x.order_status == 'Processing']

(
    tmp_anomal[['order_purchase_dt', 'order_approved_dt', 'order_delivered_carrier_dt', 'order_delivered_customer_dt', 'order_estimated_delivery_dt']]
    .count()    
    .to_frame('count')
)

	count
order_purchase_dt	301
order_approved_dt	301
order_delivered_carrier_dt	0
order_delivered_customer_dt	0
order_estimated_delivery_dt	301

Key Observations:

The process stops after payment approval, before carrier handover.

Let’s look at it broken down by the average order rating.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Processing'
    , include_columns='tmp_avg_reviews_score'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	260	2.2%	11.8%	86.4%	74.6%
tmp_avg_reviews_score	2	3244	19	0.6%	3.3%	6.3%	3.1%
tmp_avg_reviews_score	3	8268	10	0.1%	8.3%	3.3%	-5.0%
tmp_avg_reviews_score	4	19129	6	0.0%	19.2%	2.0%	-17.2%
tmp_avg_reviews_score	5	57044	6	0.0%	57.4%	2.0%	-55.4%

Key Observations:

86% of orders with “processing” status have a rating of 1.
6% of orders have a rating of 2.
Customers are clearly dissatisfied.

Let’s look at it broken down by payment type.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Processing'
    , include_columns='tmp_payment_types'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_payment_types	Boleto	19784	70	0.4%	19.9%	23.3%	3.4%
tmp_payment_types	Voucher	1621	7	0.4%	1.6%	2.3%	0.7%
tmp_payment_types	Credit Card, Voucher	1127	3	0.3%	1.1%	1.0%	-0.1%
tmp_payment_types	Voucher, Credit Card	1118	1	0.1%	1.1%	0.3%	-0.8%
tmp_payment_types	Debit Card	1527	2	0.1%	1.5%	0.7%	-0.9%
tmp_payment_types	Credit Card	74259	218	0.3%	74.7%	72.4%	-2.3%

Key Observations:

The “boleto” payment type has a slightly higher proportion difference.

Let’s randomly sample 20 review comments.
We’ll repeat this several times.

tmp_anomal = tmp_anomal.merge(df_reviews, on='order_id', how='left')

messages = (
    tmp_anomal['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['i made a purchase my orders came separate so far so good but a product was missing and it has not yet been delivered and that s what i paid dearly for shipping the deadline was the 6th it s already the 8th and i m not even satisfied',
 'the deadline has passed and the product has not arrived yet what happened',
 'i m still waiting for my product it hasn t arrived yet',
 'i always bought on this site and everything arrived correctly only in this last purchase that i had problems with the delivery of one of the products which did not arrive',
 'the seller contacted me but did not complete the sale',
 'i did not receive the product i wanted i am waiting for the refund the service was so bad they didn t even issue a note',
 'i no longer buy from this baratheon partner for 30 days since i bought it and the invoice has not yet been issued',
 'i have not received the product purchased since 12/26/17 i had no response from the store giving satisfaction about the purchase i request a refund for the purchase',
 'terrible because i ordered a product i already paid it said it would arrive by 24 11 2017 and the product did not arrive on top of that it said that it was unavailable',
 'i am disappointed with the late delivery',
 'my product is overdue it was supposed to be delivered on 03/10',
 'they did not send the product and did not give satisfaction',
 'i bought it for the first time for never again deceptionedx',
 'i did not receive the product they were delivering me to di 15 11 2017 and then it was for the day 25 11 2017 and nothing i am disappointed with the seller and lannister because they took the order i can not life',
 'i just didn t receive the product',
 'i already made purchases through baratheon and everything went well only this request presented problems and if i didn t go after it i would never know about the situation the invoice has already been paid i have not received the product and not even a satisfaction',
 'i m waiting for the product to arrive',
 'absurd to buy a product and it is not available',
 'did not deliver',
 'i buy it all the time never had a problem the shirts purchased received and returned were three dobby i am currently choosing others to use the vale']

Key Observations:

Based on review messages, orders were not delivered.
Some reviews mention items being out of stock.

Let’s examine a word cloud from review messages.

tmp_anomal.viz.wordcloud('review_comment_message')

../../_images/3640b988d55bc5abb9da60a36a90283604321484247bae5491cf8893f05807e7.jpg

Key Observations:

Most words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_anomal.analysis.sentiment('review_comment_message')

../../_images/ef25b8e06c79214ad99c79d9d30adf7f2cda9dceeb21a09d4114a8bd7a4efec9.jpg

Key Observations:

Negative reviews significantly outnumber positive ones, and the boxplot lies in the negative zone.

invoiced

Let’s look at orders with the status ‘invoiced’ by month.

df_orders['order_status'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=df_orders.order_status == 'Invoiced'
    , freq='ME'
)

../../_images/68edcb0a81d33592e16e4f43caf86b48a3dc20506400370a6168e69406d4ca48.jpg

Let’s look at the count of each order status.

tmp_anomal = df_orders[lambda x: x.order_status == 'Invoiced']

(
    tmp_anomal[['order_purchase_dt', 'order_approved_dt', 'order_delivered_carrier_dt', 'order_delivered_customer_dt', 'order_estimated_delivery_dt']]
    .count()    
    .to_frame('count')
)

	count
order_purchase_dt	314
order_approved_dt	314
order_delivered_carrier_dt	0
order_delivered_customer_dt	0
order_estimated_delivery_dt	314

Key Observations:

The process stops after payment approval, before carrier handover.

Let’s look at it broken down by the average order rating.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Invoiced'
    , include_columns='tmp_avg_reviews_score'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	232	2.0%	11.8%	73.9%	62.1%
tmp_avg_reviews_score	2	3244	28	0.9%	3.3%	8.9%	5.7%
tmp_avg_reviews_score	3	8268	17	0.2%	8.3%	5.4%	-2.9%
tmp_avg_reviews_score	4	19129	13	0.1%	19.2%	4.1%	-15.1%
tmp_avg_reviews_score	5	57044	24	0.0%	57.4%	7.6%	-49.7%

Key Observations:

74% of orders with “invoiced” status have a rating of 1.
9% of orders have a rating of 2.
Customers are clearly dissatisfied.

Let’s randomly sample 20 review comments.
We’ll repeat this several times.

tmp_anomal = tmp_anomal.merge(df_reviews, on='order_id', how='left')

messages = (
    tmp_anomal['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['my product did not arrive delivery time of 34 days and still not delivered i do not recommend',
 'i bought it to give to my daughter for christmas i hope it arrives on time',
 'good morning i purchased more than 30 days ago and i did not receive the product i requested a cancellation but i did not receive a response',
 'a simple computer mouse shouldn t take so long to deliver',
 'they gave me a huge deadline to send the product and on the last day they sent me an email saying they don t have the product now i have the bureaucracy to get my money back',
 'i like shopping in this store',
 'takes too long to deliver i hope the product still comes',
 'terrible company did not deliver the product and does not respond to my cancellation requests i already paid the first installment and had no response',
 'after a long time only today i received a message with the following information: good afternoon walter thanks for getting in touch we are sorry for the delay in getting back to you your order was not',
 'unfortunately it happened in no other purchase of several that i have already made on lannister com i did not receive the product but this time i did not receive it a pity',
 'it was the first and last time i buy from this company never',
 'no information total sloppiness',
 'expensive i requested the cancellation of this order on the same day that the invoice was issued i called the supplier informed in the nf and canceled the shipment of the goods i am waiting for an american refund form',
 'i don t recommend this store because they sold me a backpack that they didn t have in stock',
 'the product has not yet arrived it has passed and much of the deadline and no one gives me a return',
 'i bought two white stools on the website to be delivered by 03/13 and so far i haven t received my house well i m waiting for action or i m going to look for my consumer rights',
 'absurd they said that until the 10/04th serious delivery but so far nothing and it s been paid for my orders for a while',
 'i have already bought several times on the lannister site;but this last time i made a purchase of a toner at 04 10 16 and only promised for 25 11 16 and still not i received the product',
 'very disappointed with lannister s targaryen partner purchased product paid and not delivered a lot of disappointment please speed up delivery or want my money back',
 'the product order was placed on 07/11 with delivery forecast for the 21st the invoice was only issued on the 22nd; it s already the 24th and the product hasn t even left for delivery']

Key Observations:

Review messages indicate orders were not delivered.
Some reviews mention items being out of stock.

Let’s examine a word cloud from review messages.

tmp_anomal.viz.wordcloud('review_comment_message')

../../_images/2ae00d2de19a2c7f102024ffd0cdfd65e3c02dbcfb6976b98edfd8d7149c7092.jpg

Key Observations:

Many words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_anomal.analysis.sentiment('review_comment_message')

../../_images/0c747b5301038d5ef39bfa241cf32fdbc89fb348e4371bd3aeead6bdc17bf86d.jpg

Key Observations:

Negative reviews significantly outnumber positive ones, and the boxplot mostly lies below 0.

unavailable

Let’s look at orders with the status ‘unavailable’ by month.

df_orders['order_status'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=df_orders.order_status == 'Unavailable'
    , freq='ME'
)

../../_images/7214a0552f1e10c27c797627781baa9246edab894fed253ddc02f7ccdd75ed3b.jpg

Let’s look at the count of each order status.

tmp_anomal = df_orders[lambda x: x.order_status == 'Unavailable']

( 
    tmp_anomal[['order_purchase_dt', 'order_approved_dt', 'order_delivered_carrier_dt', 'order_delivered_customer_dt', 'order_estimated_delivery_dt']]
    .count()    
    .to_frame('count') 
) 

	count
order_purchase_dt	609
order_approved_dt	609
order_delivered_carrier_dt	0
order_delivered_customer_dt	0
order_estimated_delivery_dt	609

Key Observations:

The process stops after payment approval, before carrier handover.

Let’s look by the customer’s state.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Unavailable'
    , pct_diff_threshold=1
    , include_columns='tmp_customer_state'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_customer_state	SP	41746	292	0.7%	42.0%	47.9%	6.0%
tmp_customer_state	PR	5045	40	0.8%	5.1%	6.6%	1.5%

Key Observations:

The proportion of missing values in São Paulo is higher than in the full dataset.

Let’s look by product category.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Unavailable'
    , pct_diff_threshold=0
    , include_columns='tmp_product_categories'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_product_categories	Missing in Items	775	603	77.8%	0.8%	99.0%	98.2%

Key Observations:

99% of orders lack a category, meaning they are not in the items table.

Let’s look at it broken down by payment type.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Unavailable'
    , pct_diff_threshold=0
    , include_columns='tmp_payment_types'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_payment_types	Boleto	19784	150	0.8%	19.9%	24.6%	4.7%
tmp_payment_types	Voucher, Credit Card	1118	9	0.8%	1.1%	1.5%	0.4%
tmp_payment_types	Credit Card, Voucher	1127	8	0.7%	1.1%	1.3%	0.2%
tmp_payment_types	Voucher	1621	10	0.6%	1.6%	1.6%	0.0%

Key Observations:

The “boleto” payment type has a slightly higher proportion difference.

Let’s look at it broken down by the average order rating.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Unavailable'
    , pct_diff_threshold=0
    , include_columns='tmp_avg_reviews_score'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	472	4.0%	11.8%	77.5%	65.7%
tmp_avg_reviews_score	2	3244	46	1.4%	3.3%	7.6%	4.3%

Key Observations:

The difference in proportions is much higher for a rating of 1.
78% of orders with “unavailable” status have a rating of 1.
8% of orders have a rating of 2.
Customers are clearly dissatisfied.

Let’s randomly sample 20 review comments.
We’ll repeat this several times.

tmp_anomal = tmp_anomal.merge(df_reviews, on='order_id', how='left')

messages = (
    tmp_anomal['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['i m waiting until today the product has already completed 30 days of purchase',
 'almost a month ago the watch was ordered and paid via boleem if you want to and so far nothing i sent two emails and they didn t even give me satisfaction; oslit stores for me is out lannister without prestige',
 'allowed to complete the purchase but did not have the product in stock greater fatigue to get the money back',
 'total dissatisfaction with this store sells a product is not delivery tremendous lack of respect',
 'i was outraged why it sold',
 'i still haven t received the product the deadlines have passed and so far no response now bear the headache to resolve',
 'product did not arrive within the time listed',
 'in 2 months of waiting i did not receive the product i divided it into two times paid both times and did not receive the product a shame i m going to the procon because baratheon doesn t answer the phone shame',
 'i m super dissatisfied i needed the material urgently and i only bought it because lannister stores never had any problems since the supplier doesn t say the same there was no commitment',
 'you sold something you didn t have in stock this procedure is problematic my daughter is still waiting for the doll i bought christmas passed and nothing comes the new year tomorrow it s nothing i want the product',
 'i didn t receive my product i was already charged in cartaxo',
 'shame on such a partner he sold the product confirmed it deducted the amount from the card and now wants to cancel the order because he claims not to have it in stock that s a crime and the stark follows in silence',
 'i would not recommend it because i made my purchase made the payment and after a few days i received an email that the partners would not deliver in my region',
 'i didn t receive the product',
 'i bought a skate through the store and the delivery was scheduled for 05/01/18 until the present date i have not received it',
 'didn t give me any satisfaction',
 'the scheduled days passed',
 'lack of respect for the consumer after everything is finalized they make excuses not to deliver never again the lannister website cannot have a link with this type of seller',
 'i am filing a complaint with the procon i can t talk to anyone in this loha',
 'deadline for delivery 22/12 two days before the store warned that it does not have the product and does not even know when it will be available so i had to go looking for the skate to deliver for christmas to my grandson i want a refund']

Key Observations:

Review messages indicate orders were not delivered.
Some reviews mention items being out of stock.

Let’s examine a word cloud from review messages.

tmp_anomal.viz.wordcloud('review_comment_message')

../../_images/5ddc26919a00b4d2cd5fde226e30047f1451278e2c77a968c1657d5ed60f8ec7.jpg

Key Observations:

Many words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_anomal.analysis.sentiment('review_comment_message')

../../_images/74a5549a0d8dc53d75cf6e11863f96464604c001479e209ff1757fe32bc84640.jpg

Key Observations:

Negative reviews outnumber positive ones, and the boxplot mostly lies below 0.

canceled

Let’s look at orders with the status ‘canceled’ by month.

df_orders['order_status'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=df_orders.order_status == 'Canceled'
    , freq='ME'
)

../../_images/468e4c2f5d878e8dd1e3923de357ee10d1b4a1c0960811c05dd5f770a89c8b2f.jpg

Order cancellation can occur at different stages, so there may be missing values at various points.

Let’s look at the missing values.

tmp_anomal = df_orders[lambda x: x.order_status == 'Canceled']

tmp_anomal.explore.detect_anomalies()

Missings by Column
	Count	Percent
order_approved_dt	141	22.56%
order_delivered_carrier_dt	550	88.00%
order_delivered_customer_dt	619	99.04%

Conversion at different stages

Let’s look at the count of different order status timestamps.

Let’s check if there are any missing values between the dates.

mask = tmp_anomal['order_delivered_carrier_dt'].isna() & tmp_anomal['order_delivered_customer_dt'].notna()
tmp_anomal.loc[mask, 'order_delivered_carrier_dt'] 

SeriesOn([], Name: order_delivered_carrier_dt, dtype: datetime64[ns])

mask = tmp_anomal['order_approved_dt'].isna() & tmp_anomal['order_delivered_carrier_dt'].notna()
tmp_anomal.loc[mask, 'order_approved_dt']

SeriesOn([], Name: order_approved_dt, dtype: datetime64[ns])

All good.

tmp_funnel = ( 
    tmp_anomal[['order_purchase_dt', 'order_approved_dt', 'order_delivered_carrier_dt', 'order_delivered_customer_dt']]
    .count()    
    .to_frame('count') 
    .assign(share = lambda x: (x['count']*100 / x['count']['order_purchase_dt']).round(1).astype(str) + '%')
    .reset_index(names='stage')
) 

px.funnel(
    tmp_funnel, 
    x='count', 
    y='stage', 
    text='share',
    width=600,
    title='Conversion of Different Order Stages with "Canceled" Status'
)

../../_images/f5d385c92ccb1cbc40d421c847c7058d77d4d92bd7cb428164b9de118efc8d47.jpg

Key Observations:

The process stops at different stages, most often between payment approval and carrier handover.

Let’s look at the conversion at each stage by month.

For this, we will count the number of canceled orders with specific timestamps in each period and divide by the number of canceled orders at the time of purchase.

tmp_res_df = (
    tmp_anomal.resample('ME', on='order_purchase_dt')
    .agg(
        purchase = ('order_id', 'count')
        , approved = ('order_approved_dt', 'count')
        , delivered_carrier = ('order_delivered_carrier_dt', 'count')
        , delivered_customer = ('order_delivered_customer_dt', 'count')
    )
) 
tmp_res_df = tmp_res_df.div(tmp_res_df['purchase'], axis=0)
tmp_res_df = (  
    tmp_res_df.reset_index(names='date')
    .melt(id_vars='date', var_name='date_type', value_name='count')
)

Let’s look at the non-normalized values. That is, divide each value (count with a specific timestamp) by the total value for the period.

labels = dict(
    date = 'Date',
    date_type = 'Date Type',
    count = 'Conversion'
)
tmp_res_df.viz.line(
    x='date'
    , y='count'
    , color='date_type'
    , labels=labels
    , title='Conversion of Different Order Stages with "Canceled" Status by Month'
)

../../_images/f6659e158da5a622526cd6d252f966a2340f907285064b6e97666d643c0b131c.jpg

Key Observations:

Canceled orders almost never have delivery timestamps, which is logical.
From December 2017 to March 2018, there was a significant spike in canceled orders that had carrier handover timestamps but no delivery timestamps, indicating delivery issues during this period.
About 80% of canceled orders have payment approval timestamps, but this proportion increased significantly starting January 2018, approaching 100%.

Number of Last Stages

Let’s look at the last stage to which orders with the status ‘canceled’ reach over time.

For this:

transform the wide table into a long one, making the name of the time variable a category;
remove missing values in the time (this will be the variable with the value after melt);
convert these categories into a categorical type in pandas and specify the order;
group by order;
take the first time in each group (all entries in the group will have the same time);
take the maximum stage (since we specified the order, this will be the last stage of the order).

tmp_df_orders_canceled = df_orders[lambda x: x.order_status == 'Canceled']
tmp_df_orders_canceled['tmp_date'] = tmp_df_orders_canceled['order_purchase_dt']

tmp_df_orders_canceled = (
    tmp_df_orders_canceled.rename(
        columns={
            'order_purchase_dt': 'purchase'
            , 'order_approved_dt': 'approved'
            , 'order_delivered_carrier_dt': 'delivered_carrier'
            , 'order_delivered_customer_dt': 'delivered_customer'
        }
    )
    .melt(
        id_vars=['tmp_date', 'order_id']
        , value_vars=['purchase', 'approved', 'delivered_carrier', 'delivered_customer']
        , var_name='date_stage'
    )
    .dropna(subset='value')
    .drop('value', axis=1)
)

date_stage_order = ['purchase', 'approved', 'delivered_carrier', 'delivered_customer']
tmp_df_orders_canceled['date_stage'] = (
    tmp_df_orders_canceled['date_stage']
    .astype('category')
    .cat.reorder_categories(date_stage_order, ordered=True)
)

tmp_df_orders_canceled = (
    tmp_df_orders_canceled.groupby('order_id', as_index=False)
    .agg(
        tmp_date = ('tmp_date', 'first')
        , last_date_stage = ('date_stage', 'max')
    )
)

Let’s look at it over time.

labels = dict(
    date = 'Date',
    order_id = 'Number of Orders',
    last_date_stage = 'Last Stage'
)
tmp_df_orders_canceled.viz.line(
    x='tmp_date'
    , y='order_id'
    , color='last_date_stage'
    , agg_func='nunique'
    , freq='ME'
    , labels=labels
    , markers=True
    , title='Number of Orders by Month and Last Stage'
)

../../_images/b7af5e07b0797c8238a6a3deb54a64d2b59dcabe1a28ce4627002d243cc9a126.jpg

Key Observations:

In most months, the process stops after payment approval.
From December 2017 to March 2018, there was a spike in orders that stopped after carrier handover.
In August 2018, there was a sharp peak in orders that stopped immediately after purchase.

Let’s look by the customer’s state.

tmp_anomal = df_orders[lambda x: x.order_status == 'Canceled']

df_orders['order_status'].explore.anomalies_by_categories(
   custom_mask=df_orders.order_status == 'Canceled'
   , pct_diff_threshold=0
    , include_columns='tmp_customer_state'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_customer_state	SP	41746	327	0.8%	42.0%	52.3%	10.3%
tmp_customer_state	RJ	12852	86	0.7%	12.9%	13.8%	0.8%
tmp_customer_state	RO	253	3	1.2%	0.3%	0.5%	0.2%
tmp_customer_state	PI	495	4	0.8%	0.5%	0.6%	0.1%
tmp_customer_state	RR	46	1	2.2%	0.0%	0.2%	0.1%
tmp_customer_state	GO	2020	13	0.6%	2.0%	2.1%	0.0%

Key Observations:

The proportion of missing values in São Paulo is significantly higher than in the full dataset.

Let’s look by product category.

df_orders['order_status'].explore.anomalies_by_categories(
   custom_mask=df_orders.order_status == 'Canceled'
   , pct_diff_threshold=1
    , include_columns='tmp_product_categories'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_product_categories	Missing in Items	775	164	21.2%	0.8%	26.2%	25.5%
tmp_product_categories	brinquedos	3835	31	0.8%	3.9%	5.0%	1.1%

Key Observations:

Missing product categories have a much higher proportion difference, possibly due to items being out of stock.

Let’s look at it broken down by payment type.

df_orders['order_status'].explore.anomalies_by_categories(
   custom_mask=df_orders.order_status == 'Canceled'
   , pct_diff_threshold=0
    , include_columns='tmp_payment_types'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_payment_types	Voucher	1621	76	4.7%	1.6%	12.2%	10.5%
tmp_payment_types	Voucher, Credit Card	1118	12	1.1%	1.1%	1.9%	0.8%
tmp_payment_types	Not Defined	3	3	100.0%	0.0%	0.5%	0.5%

Key Observations:

The “voucher” payment type has a noticeably higher proportion difference.

Let’s look at it broken down by the average order rating.

df_orders['order_status'].explore.anomalies_by_categories(
   custom_mask=df_orders.order_status == 'Canceled'
   , pct_diff_threshold=0
    , include_columns='tmp_avg_reviews_score'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	434	3.7%	11.8%	69.4%	57.6%
tmp_avg_reviews_score	2	3244	45	1.4%	3.3%	7.2%	3.9%

Key Observations:

69% of orders with “canceled” status have a rating of 1.
7% of orders have a rating of 2.
Customers are clearly dissatisfied.

Let’s randomly sample 20 review comments.
We’ll repeat this several times.

tmp_anomal = tmp_anomal.merge(df_reviews, on='order_id', how='left')

messages = (
    tmp_anomal['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['the delivery date has passed and the product has not arrived',
 'i haven t received anything so far',
 'i didn t send my order',
 'i received the product it s of great quality only the padlock that came with the product that came with a defect',
 'i did not receive the product or contact the supplier i already sent an email via the lannister website and nothing happened i m waiting for a return because i don t want to give up the purchase i want the product',
 'my order was canceled without my request on the 20/11th and so far they have not refunded my money it was a gift there was a lot of lack of professionalism',
 'not received',
 'i no longer recommend this store i lost credibility with you',
 'until now i have not received the product',
 'easy to assemble',
 'they gave me a period of 26 days for delivery and with 4 remaining they inform me that it would not be possible to deliver my product simply ridiculous',
 'good morning my products arrived but the iron was missing and i have urgently to receive it protocol 2517102738811 thanks',
 'i bought a 110v grinder and they sent me a 220v i haven t had a response so far',
 'good quality product',
 'i always buy from lannister stores and i ve never had any problems i think it s safe',
 'there was a problem with the delivery the post office did not deliver it to the person at the agency even with the tracking code and document they informed that the product was not in the unit',
 'i canceled the order 02 672741651 but is being charged',
 'fake and with scratches on the screen',
 'i still haven t received the product can i cancel the parcel of the order and try to buy another one',
 'i received the product full of defects i already requested a refund i m waiting for a refund']

Key Observations:

Review messages indicate orders were not delivered.
Some reviews mention items being out of stock.

Let’s examine a word cloud from review messages.

tmp_anomal.viz.wordcloud('review_comment_message')

../../_images/28cbd764faf758379151653fe1042b545e2709242a38033f36eda7fd49a1bea3.jpg

Key Observations:

Many words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_anomal.analysis.sentiment('review_comment_message')

../../_images/05083116af802d4c1d0c6b42ab01b25211ffb6509927de68adb74e89655a147b.jpg

Key Observations:

Negative messages significantly outnumber positive ones, and the boxplot lies below 0.

shipped

Let’s look at the number of orders with the status ‘delivered’ by month.

df_orders['order_status'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=df_orders.order_status == 'Shipped'
    , freq='ME'
)

../../_images/4d7a0da0a223c79bd1fb296a5a433e7ef048c6adbdce3df65fa012100ed3012e.jpg

Let’s look at the count of each order status.

tmp_anomal = df_orders[lambda x: x.order_status == 'Shipped']

(
    tmp_anomal[['order_purchase_dt', 'order_approved_dt', 'order_delivered_carrier_dt', 'order_delivered_customer_dt', 'order_estimated_delivery_dt']]
    .count()    
    .to_frame('count')
)

	count
order_purchase_dt	1107
order_approved_dt	1107
order_delivered_carrier_dt	1107
order_delivered_customer_dt	0
order_estimated_delivery_dt	1107

Key Observations:

The process stops after carrier handover, before customer delivery.

Let’s look by the customer’s state.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Shipped'
    , pct_diff_threshold=1
    , include_columns='tmp_customer_state'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_customer_state	RJ	12852	289	2.2%	12.9%	26.1%	13.2%
tmp_customer_state	BA	3380	68	2.0%	3.4%	6.1%	2.7%
tmp_customer_state	CE	1336	38	2.8%	1.3%	3.4%	2.1%
tmp_customer_state	PE	1652	35	2.1%	1.7%	3.2%	1.5%

Key Observations:

The proportion of missing values in Rio de Janeiro is significantly higher than in the full dataset.

Let’s look at it broken down by the average order rating.

df_orders['order_status'].explore.anomalies_by_categories(
    custom_mask=df_orders.order_status == 'Shipped'
    , pct_diff_threshold=1
    , include_columns='tmp_avg_reviews_score'
)

Custom anomalies distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_avg_reviews_score	1	11756	686	5.8%	11.8%	62.0%	50.1%
tmp_avg_reviews_score	2	3244	85	2.6%	3.3%	7.7%	4.4%
tmp_avg_reviews_score	3	8268	123	1.5%	8.3%	11.1%	2.8%

Key Observations:

62% of orders with “shipped” status have a rating of 1.
8% of orders have a rating of 2.
Customers are clearly dissatisfied.

Let’s randomly sample 20 review comments.
We’ll repeat this several times.

tmp_anomal = tmp_anomal.merge(df_reviews, on='order_id', how='left')

messages = (
    tmp_anomal['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['does not deliver order on time',
 '1 month delay in delivery last purchase i make',
 'did not arrive',
 'i always bought it at lannister but this time their partner did not deliver the product he said it was a risk area he left it in the mail when the notice arrived on friday afternoon there was no time to pick it up',
 'my product was not delivered how am i going to do it now',
 'the product has not been delivered yet',
 'not received',
 'the product has not arrived',
 'i don t know if i recommend it the product was marked on the website for a type of cell phone but when it arrived it did not fit the cell phone',
 'i didn t like it i made the purchase paid for shipping and the product is at the post office far from my residence i want my reversal',
 'i canceled that order but i received the tv panel i loved it',
 'i bought the product but heard a return because it was not what i wanted i asked to cancel the purchase',
 'i have not received the product so far and no contact advising the reason',
 'the product was stolen in possession of the post office but i don t blame the store',
 'merchandise burned at the distribution center in fortaleza ce i canceled the purchase to refund the amount paid',
 'i am dissatisfied with the lack of attention to the customer it was asked at the beginning to cancel the order in which i was not answered i don t intend to buy more anytime soon around here',
 'awaiting solution',
 'i receive deliveries at my house and at my service but i could not receive this one targaryen',
 'i would like to know how to proceed to receive the product because i still want it',
 'it is very good to buy in this store except that the post office did not send a notice for the object to be removed']

Key Observations:

Review messages indicate most orders were not delivered.

Let’s examine a word cloud from review messages.

tmp_anomal.viz.wordcloud('review_comment_message')

../../_images/e21683015e97c7f1e65839ec8dddfcc2d0cb397907b1eac8e2d5904612989cfc.jpg

Key Observations:

Many words relate to delivery.

Let’s analyze the sentiment of the text.

tmp_anomal.analysis.sentiment('review_comment_message')

../../_images/f6066c6ce62b3689d680cb5bf22e1ac2c5878b1b71773b2180bc8979791f97f4.jpg

Key Observations:

Negative messages outnumber positive ones, and the boxplot mostly lies below 0.

Status and Delivery Mismatches#

Delivery status missing but delivery timestamp present

Let’s check if there are orders without “delivered” status that still have a delivery timestamp.

df_orders[lambda x: (x.order_status != 'Delivered') & ~x.order_delivered_customer_dt.isna()]

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
2921	1950d777989f6a877539f53795b4c3c3	1bccb206de9f0f25adc6871a1bcf77b2	Canceled	2018-02-19 19:48:52	2018-02-19 20:56:05	2018-02-20 19:57:13	2018-03-21 22:03:51	2018-03-09	3	Credit Card	beleza_saude	MG	February	Monday	Evening	Delayed	Not Delivered
8791	dabf2b0e35b423f94618bf965fcb7514	5cdec0bb8cbdf53ffc8fdc212cd247c6	Canceled	2016-10-09 00:56:52	2016-10-09 13:36:58	2016-10-13 13:36:59	2016-10-16 14:36:59	2016-11-30	5	Credit Card	perfumaria, beleza_saude	SP	October	Sunday	Night	Not Delayed	Not Delivered
58266	770d331c84e5b214bd9dc70a10b829d0	6c57e6119369185e575b36712766b0ef	Canceled	2016-10-07 14:52:30	2016-10-07 15:07:10	2016-10-11 15:07:11	2016-10-14 15:07:11	2016-11-29	1	Credit Card	perfumaria	RJ	October	Friday	Afternoon	Not Delayed	Not Delivered
59332	8beb59392e21af5eb9547ae1a9938d06	bf609b5741f71697f65ce3852c5d2623	Canceled	2016-10-08 20:17:50	2016-10-09 14:34:30	2016-10-14 22:45:26	2016-10-19 18:47:43	2016-11-30	1	Credit Card	brinquedos	RJ	October	Saturday	Evening	Not Delayed	Not Delivered
92636	65d1e226dfaeb8cdc42f665422522d14	70fc57eeae292675927697fe03ad3ff5	Canceled	2016-10-03 21:01:41	2016-10-04 10:18:57	2016-10-25 12:14:28	2016-11-08 10:58:34	2016-11-25	1	Credit Card	esporte_lazer	RJ	October	Monday	Evening	Not Delayed	Not Delivered
94399	2c45c33d2f9cb8ff8b1c86cc28c11c30	de4caa97afa80c8eeac2ff4c8da5b72e	Canceled	2016-10-09 15:39:56	2016-10-10 10:40:49	2016-10-14 10:40:50	2016-11-09 14:53:50	2016-12-08	1	Credit Card	fashion_bolsas_e_acessorios	SC	October	Sunday	Afternoon	Not Delayed	Not Delivered

Key Observations:

There are orders without “delivered” status that have delivery timestamps. Most likely these orders were canceled after delivery.

Let’s examine their reviews

messages = (
    df_orders[lambda x: (x.order_status != 'Delivered') & ~x.order_delivered_customer_dt.isna()]
    .merge(df_reviews, on='order_id', how='left')
    ['review_comment_message']
    .dropna()
    .tolist()
)
display(messages)

['the product was not shipped because i made a mistake typing my home number',
 'i would buy again because they meet the delivery deadline and the products are super packed',
 'the product is fake ridiculous',
 'the product was not delivered no invoice was sent telephone for complaint does not work',
 'the product was purchased on 10/10/2016 it has not arrived and now i receive this review asking what i thought very dissatisfied',
 'it says that my order was delivered but it did not arrive is it possible to inform the signature of the person who received it on the spot']

Key Observations:

Review messages indicate these orders were not delivered.

Status is “delivered” but delivery timestamp is missing

Let’s check if there are orders with “delivered” status but missing delivery timestamps.

df_orders[lambda x: x.order_status.isin(['Delivered']) & x.order_delivered_customer_dt.isna()]

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
3002	2d1e2d5bf4dc7227b3bfebb81328c15f	ec05a6d8558c6455f0cbbd8a420ad34f	Delivered	2017-11-28 17:44:07	2017-11-28 17:56:40	2017-11-30 18:12:23	NaT	2017-12-18	5	Credit Card	automotivo	SP	November	Tuesday	Evening	Missing delivery dt	Delivered
20618	f5dd62b788049ad9fc0526e3ad11a097	5e89028e024b381dc84a13a3570decb4	Delivered	2018-06-20 06:58:43	2018-06-20 07:19:05	2018-06-25 08:05:00	NaT	2018-07-16	5	Debit Card	industria_comercio_e_negocios	SP	June	Wednesday	Morning	Missing delivery dt	Delivered
43834	2ebdfc4f15f23b91474edf87475f108e	29f0540231702fda0cfdee0a310f11aa	Delivered	2018-07-01 17:05:11	2018-07-01 17:15:12	2018-07-03 13:57:00	NaT	2018-07-30	5	Credit Card	relogios_presentes	SP	July	Sunday	Evening	Missing delivery dt	Delivered
79263	e69f75a717d64fc5ecdfae42b2e8e086	cfda40ca8dd0a5d486a9635b611b398a	Delivered	2018-07-01 22:05:55	2018-07-01 22:15:14	2018-07-03 13:57:00	NaT	2018-07-30	5	Credit Card	relogios_presentes	SP	July	Sunday	Evening	Missing delivery dt	Delivered
82868	0d3268bad9b086af767785e3f0fc0133	4f1d63d35fb7c8999853b2699f5c7649	Delivered	2018-07-01 21:14:02	2018-07-01 21:29:54	2018-07-03 09:28:00	NaT	2018-07-24	5	Credit Card	brinquedos	SP	July	Sunday	Evening	Missing delivery dt	Delivered
92643	2d858f451373b04fb5c984a1cc2defaf	e08caf668d499a6d643dafd7c5cc498a	Delivered	2017-05-25 23:22:43	2017-05-25 23:30:16	NaT	NaT	2017-06-23	5	Credit Card	esporte_lazer	RS	May	Thursday	Night	Missing delivery dt	Delivered
97647	ab7c89dc1bf4a1ead9d6ec1ec8968a84	dd1b84a7286eb4524d52af4256c0ba24	Delivered	2018-06-08 12:09:39	2018-06-08 12:36:39	2018-06-12 14:10:00	NaT	2018-06-26	1	Credit Card	informatica_acessorios	SP	June	Friday	Afternoon	Missing delivery dt	Delivered
98038	20edc82cf5400ce95e1afacc25798b31	28c37425f1127d887d7337f284080a0f	Delivered	2018-06-27 16:09:12	2018-06-27 16:29:30	2018-07-03 19:26:00	NaT	2018-07-19	5	Credit Card	livros_interesse_geral	SP	June	Wednesday	Afternoon	Missing delivery dt	Delivered

Key Observations:

The dataset contains 8 orders with “delivered” status but missing delivery timestamps.

Let’s examine their reviews

messages = (
    df_orders[lambda x: x.order_status.isin(['Delivered']) & x.order_delivered_customer_dt.isna()]
    .merge(df_reviews, on='order_id', how='left')
    ['review_comment_message']
    .dropna()
    .tolist()
)
display(messages)

['arrived quickly everything ok',
 'new product very good',
 'the product arrived much earlier than expected came in great condition and very well packaged i recommend',
 'i bought a product from a brand and received another inferior to what i bought not to mention that they charged me shipping and i had to pick it up at the post office',
 'i loved']

Key Observations:

Review messages suggest the products were actually delivered.

Order canceled or unavailable but has delivery timestamp

Let’s check if there are orders with “canceled” or “unavailable” status that still have delivery timestamps.

df_orders[lambda x: x.order_status.isin(['Canceled', 'Unavailable']) & ~x.order_delivered_customer_dt.isna()]

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
2921	1950d777989f6a877539f53795b4c3c3	1bccb206de9f0f25adc6871a1bcf77b2	Canceled	2018-02-19 19:48:52	2018-02-19 20:56:05	2018-02-20 19:57:13	2018-03-21 22:03:51	2018-03-09	3	Credit Card	beleza_saude	MG	February	Monday	Evening	Delayed	Not Delivered
8791	dabf2b0e35b423f94618bf965fcb7514	5cdec0bb8cbdf53ffc8fdc212cd247c6	Canceled	2016-10-09 00:56:52	2016-10-09 13:36:58	2016-10-13 13:36:59	2016-10-16 14:36:59	2016-11-30	5	Credit Card	perfumaria, beleza_saude	SP	October	Sunday	Night	Not Delayed	Not Delivered
58266	770d331c84e5b214bd9dc70a10b829d0	6c57e6119369185e575b36712766b0ef	Canceled	2016-10-07 14:52:30	2016-10-07 15:07:10	2016-10-11 15:07:11	2016-10-14 15:07:11	2016-11-29	1	Credit Card	perfumaria	RJ	October	Friday	Afternoon	Not Delayed	Not Delivered
59332	8beb59392e21af5eb9547ae1a9938d06	bf609b5741f71697f65ce3852c5d2623	Canceled	2016-10-08 20:17:50	2016-10-09 14:34:30	2016-10-14 22:45:26	2016-10-19 18:47:43	2016-11-30	1	Credit Card	brinquedos	RJ	October	Saturday	Evening	Not Delayed	Not Delivered
92636	65d1e226dfaeb8cdc42f665422522d14	70fc57eeae292675927697fe03ad3ff5	Canceled	2016-10-03 21:01:41	2016-10-04 10:18:57	2016-10-25 12:14:28	2016-11-08 10:58:34	2016-11-25	1	Credit Card	esporte_lazer	RJ	October	Monday	Evening	Not Delayed	Not Delivered
94399	2c45c33d2f9cb8ff8b1c86cc28c11c30	de4caa97afa80c8eeac2ff4c8da5b72e	Canceled	2016-10-09 15:39:56	2016-10-10 10:40:49	2016-10-14 10:40:50	2016-11-09 14:53:50	2016-12-08	1	Credit Card	fashion_bolsas_e_acessorios	SC	October	Sunday	Afternoon	Not Delayed	Not Delivered

Key Observations:

The dataset contains 6 orders with “canceled” status that have customer delivery timestamps

Let’s examine their reviews

messages = (
    df_orders[lambda x: x.order_status.isin(['Canceled', 'Unavailable']) & ~x.order_delivered_customer_dt.isna()]
    .merge(df_reviews, on='order_id', how='left')
    ['review_comment_message']
    .dropna()
    .tolist()
)
display(messages)

['the product was not shipped because i made a mistake typing my home number',
 'i would buy again because they meet the delivery deadline and the products are super packed',
 'the product is fake ridiculous',
 'the product was not delivered no invoice was sent telephone for complaint does not work',
 'the product was purchased on 10/10/2016 it has not arrived and now i receive this review asking what i thought very dissatisfied',
 'it says that my order was delivered but it did not arrive is it possible to inform the signature of the person who received it on the spot']

Key Observations:

Review messages indicate some items were delivered while others were not.

Date Inconsistencies#

order_purchase_dt

Let’s check if there are timestamps earlier than purchase dates.

for col_dt in ['order_approved_dt', 'order_delivered_carrier_dt', 'order_delivered_customer_dt']:
    rows_cnt = df_orders[~(df_orders['order_purchase_dt'].isna() | df_orders[col_dt].isna()
                          | (df_orders['order_purchase_dt'] <= df_orders[col_dt]))].shape[0]
    if rows_cnt:
        print(f'{col_dt} < order_purchase_dt, rows count: {rows_cnt}')

order_delivered_carrier_dt < order_purchase_dt, rows count: 166

Key Observations:

There are 166 orders where carrier handover time is earlier than purchase time. This is unusual.

Let’s examine the dataframe

tmp_mask = ~(df_orders['order_purchase_dt'].isna() | df_orders['order_delivered_carrier_dt'].isna()
                          | (df_orders['order_purchase_dt'] <= df_orders['order_delivered_carrier_dt']))
tmp_df_orders = df_orders[tmp_mask]
print(f'rows: {tmp_df_orders.shape[0]}')
display(tmp_df_orders.head(5))

rows: 166

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
615	b9afddbdcfadc9a87b41a83271c3e888	85c6af75161b8b2b1af98e82b5a6a5a5	Delivered	2018-08-16 13:50:48	2018-08-16 14:05:13	2018-08-16 13:27:00	2018-08-24 14:58:37	2018-09-04	5	Credit Card	cama_mesa_banho	RJ	August	Thursday	Afternoon	Not Delayed	Delivered
1111	ad133696906f6a78826daa0911b7daec	e6f5b234bb0d847f10eebd70130c5d49	Delivered	2018-06-15 15:41:22	2018-06-15 16:19:23	2018-06-15 14:52:00	2018-06-22 18:09:37	2018-07-18	5	Credit Card	beleza_saude	SC	June	Friday	Afternoon	Not Delayed	Delivered
1329	74e033208dc13a7b8127eb8e73d09b76	72fcbb1145f2889eddcba6d6c1d6c090	Delivered	2018-05-02 10:48:44	2018-05-02 11:13:45	2018-05-02 09:49:00	2018-05-07 23:06:36	2018-05-29	5	Credit Card	beleza_saude	PR	May	Wednesday	Morning	Not Delayed	Delivered
1372	a6b58794fd2ba533359a76c08df576e3	ccd8e3459ad58ae538c9d2cf35532ba4	Delivered	2018-05-14 15:18:23	2018-05-14 15:33:35	2018-05-14 13:46:00	2018-05-19 19:33:32	2018-06-08	4	Credit Card	cama_mesa_banho	RS	May	Monday	Afternoon	Not Delayed	Delivered
1864	5792e0b1c8c8a2bf53af468c9a422c88	5c427f06e7fcf902e309743516e2c580	Delivered	2018-07-26 13:25:14	2018-07-26 13:35:14	2018-07-26 12:42:00	2018-07-30 14:45:02	2018-08-09	5	Credit Card	relogios_presentes	SP	July	Thursday	Afternoon	Not Delayed	Delivered

Let’s analyze by day

tmp_df_orders.explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=tmp_mask
    , freq='D'
)

../../_images/5691373a2c4ff9371b8bc8b48447568f93e65e43a1dd46d3196ed855befe3c18.jpg

Key Observations:

These anomalies only occurred between 25 April and 24 August 2018.

Let’s analyze by order status

df_orders.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='order_status'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
order_status	Delivered	96478	165	0.2%	97.0%	99.4%	2.4%
order_status	Shipped	1107	1	0.1%	1.1%	0.6%	-0.5%

Let’s look at it broken down by payment type.

df_orders.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='tmp_payment_types'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_payment_types	Credit Card	74259	155	0.2%	74.7%	93.4%	18.7%
tmp_payment_types	Debit Card	1527	5	0.3%	1.5%	3.0%	1.5%
tmp_payment_types	Voucher	1621	3	0.2%	1.6%	1.8%	0.2%
tmp_payment_types	Boleto	19784	3	0.0%	19.9%	1.8%	-18.1%

Key Observations:

Over 90% of anomalous orders were paid by credit card.

Let’s analyze by time of day

df_orders.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='tmp_purchase_time_of_day'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_purchase_time_of_day	Afternoon	32211	136	0.4%	32.4%	81.9%	49.5%
tmp_purchase_time_of_day	Morning	22634	28	0.1%	22.8%	16.9%	-5.9%
tmp_purchase_time_of_day	Evening	36127	2	0.0%	36.3%	1.2%	-35.1%

Key Observations:

Most anomalies occurred in the afternoon.

Let’s examine their reviews

messages = (
    tmp_df_orders.merge(df_reviews, on='order_id', how='left')
    ['review_comment_message']
    .dropna()
    .sample(20)
    .tolist()
)
display(messages)

['kidnaps',
 'very good product and fast delivery',
 'beautiful just like the photo',
 'i haven t tested the product yet but apparently it looks good after i test complete my curriculum mentarium',
 'product as photo and specifications very good',
 'i thought it was good and it was delivered well before the deadline',
 'all well satisfied',
 'i don t have the product in hand',
 'i really liked the product i highly recommend it',
 'very good',
 'the product is very good it delivers what it promises',
 'arrived before the deadline established everything perfect',
 'i m testing the product i hope it s good',
 'i really liked great quality',
 'request answered correctly grateful',
 'they should only put the watch in suitable packaging came wrapped in plastic',
 'i loved the table what a pity that i got confused in the color i bought it thinking it would be white but it is nude but it s ok very beautiful it s worth it',
 'is in the process of returning the printer did not accept the cartridge',
 'the product is much smaller than what it shows in the photo this type of product testifies against the american with',
 'delivery was ahead of schedule i recommend']

Key Observations:

Nothing unusual found.

order_approved_dt

Let’s check if there are timestamps that should occur after approval but appear earlier.

for col_dt in ['order_delivered_carrier_dt', 'order_delivered_customer_dt']:
    rows_cnt = df_orders[~(df_orders['order_approved_dt'].isna() | df_orders[col_dt].isna()
                          | (df_orders['order_approved_dt'] <= df_orders[col_dt]))].shape[0]
    if rows_cnt:
        print(f'{col_dt} < order_approved_dt, rows count: {rows_cnt}')

order_delivered_carrier_dt < order_approved_dt, rows count: 1359
order_delivered_customer_dt < order_approved_dt, rows count: 61

Key Observations:

There are 1,359 orders where carrier handover time is earlier than payment approval time.
There are 61 orders where delivery time is earlier than payment approval time.

Let’s examine each one separately.

order_delivered_carrier_dt < order_approved_dt

Let’s examine the dataframe

tmp_mask = ~(df_orders['order_approved_dt'].isna() | df_orders['order_delivered_carrier_dt'].isna()
                          | (df_orders['order_approved_dt'] <= df_orders['order_delivered_carrier_dt']))
tmp_df_orders = df_orders[tmp_mask]
print(f'rows: {tmp_df_orders.shape[0]}')
display(tmp_df_orders.head(5))

rows: 1359

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
15	dcb36b511fcac050b97cd5c05de84dc3	3b6828a50ffe546942b7a473d70ac0fc	Delivered	2018-06-07 19:03:12	2018-06-12 23:31:02	2018-06-11 14:54:00	2018-06-21 15:34:32	2018-07-04	5	Credit Card	perfumaria	GO	June	Thursday	Evening	Not Delayed	Delivered
64	688052146432ef8253587b930b01a06d	81e08b08e5ed4472008030d70327c71f	Delivered	2018-04-22 08:48:13	2018-04-24 18:25:22	2018-04-23 19:19:14	2018-04-24 19:31:58	2018-05-15	4	Credit Card	informatica_acessorios, malas_acessorios	MG	April	Sunday	Morning	Not Delayed	Delivered
199	58d4c4747ee059eeeb865b349b41f53a	1755fad7863475346bc6c3773fe055d3	Delivered	2018-07-21 12:49:32	2018-07-26 23:31:53	2018-07-24 12:57:00	2018-07-25 23:58:19	2018-07-31	5	Boleto	automotivo	SP	July	Saturday	Afternoon	Not Delayed	Delivered
210	412fccb2b44a99b36714bca3fef8ad7b	c6865c523687cb3f235aa599afef1710	Delivered	2018-07-22 22:30:05	2018-07-23 12:31:53	2018-07-23 12:24:00	2018-07-24 19:26:42	2018-07-31	5	Credit Card	industria_comercio_e_negocios	SP	July	Sunday	Evening	Not Delayed	Delivered
415	56a4ac10a4a8f2ba7693523bb439eede	78438ba6ace7d2cb023dbbc81b083562	Delivered	2018-07-22 13:04:47	2018-07-27 23:31:09	2018-07-24 14:03:00	2018-07-28 00:05:39	2018-08-06	5	Boleto	eletronicos	SP	July	Sunday	Afternoon	Not Delayed	Delivered

Let’s examine by days

tmp_df_orders.explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=tmp_mask
    , freq='D'
)

../../_images/e04ab5829608921c7790476c74169f1b7b79af5ed417dae02edf09c38489caa0.jpg

Key Observations:

Days with most anomalies:
- 19-23 April 2018
- 3-4 July 2018
Possible system issues caused delayed payment approvals.

Let’s analyze by order status

df_orders.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='order_status'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
order_status	Delivered	96478	1350	1.4%	97.0%	99.3%	2.3%
order_status	Shipped	1107	9	0.8%	1.1%	0.7%	-0.5%

Key Observations:

Nearly all orders were eventually delivered.

Let’s analyze by time of day

df_orders.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='tmp_purchase_time_of_day'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_purchase_time_of_day	Afternoon	32211	555	1.7%	32.4%	40.8%	8.4%
tmp_purchase_time_of_day	Morning	22634	326	1.4%	22.8%	24.0%	1.2%
tmp_purchase_time_of_day	Night	8469	94	1.1%	8.5%	6.9%	-1.6%
tmp_purchase_time_of_day	Evening	36127	384	1.1%	36.3%	28.3%	-8.1%

Key Observations:

More anomalies occurred in the afternoon.

order_delivered_customer_dt < order_approved_dt

tmp_mask = ~(df_orders['order_approved_dt'].isna() | df_orders['order_delivered_customer_dt'].isna()
                          | (df_orders['order_approved_dt'] <= df_orders['order_delivered_customer_dt']))
tmp_df_orders = df_orders[tmp_mask]

Let’s examine by days

tmp_df_orders.explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , custom_mask=tmp_mask
    , freq='D'
)

../../_images/8681d4e634772e605259c8698cd81ed47dc72b91baa5604353b89d0ea43328ff.jpg

Key Observations:

Anomalies occurred sporadically on specific dates.

Let’s look at it broken down by payment type.

df_orders.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='tmp_payment_types'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_payment_types	Boleto	19784	21	0.1%	19.9%	34.4%	14.5%
tmp_payment_types	Debit Card	1527	7	0.5%	1.5%	11.5%	9.9%
tmp_payment_types	Credit Card, Voucher	1127	1	0.1%	1.1%	1.6%	0.5%
tmp_payment_types	Voucher	1621	1	0.1%	1.6%	1.6%	0.0%
tmp_payment_types	Credit Card	74259	31	0.0%	74.7%	50.8%	-23.9%

Key Observations:

“Boleto” payments had significantly more anomalies.

Let’s analyze by customer state

df_orders.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='tmp_customer_state'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_customer_state	SP	41746	53	0.1%	42.0%	86.9%	44.9%
tmp_customer_state	CE	1336	1	0.1%	1.3%	1.6%	0.3%
tmp_customer_state	PR	5045	3	0.1%	5.1%	4.9%	-0.2%
tmp_customer_state	MG	11635	2	0.0%	11.7%	3.3%	-8.4%
tmp_customer_state	RJ	12852	2	0.0%	12.9%	3.3%	-9.6%

Key Observations:

Most anomalies occurred in São Paulo.

order_delivered_carrier_dt

Let’s check if there are timestamps that should occur after carrier handover but appear earlier.

tmp_mask = ~(df_orders['order_delivered_carrier_dt'].isna() | df_orders['order_delivered_customer_dt'].isna()
                        | (df_orders['order_delivered_carrier_dt'] <= df_orders['order_delivered_customer_dt']))
rows_cnt = df_orders[tmp_mask].shape[0]
if rows_cnt:
    print(f'order_delivered_customer_dt < order_delivered_carrier_dt, rows count: {rows_cnt}')

order_delivered_customer_dt < order_delivered_carrier_dt, rows count: 23

Key Observations:

There are 23 orders where delivery time is earlier than carrier handover time.

review_creation_dt < order_purchase_dt

We have order creation time and review creation time. Let’s check if any reviews were created before their corresponding orders.

temp_df = df_orders.merge(df_reviews, on='order_id', how='left')
temp_df = temp_df[lambda x: x.order_purchase_dt.dt.date > x.review_creation_dt]
temp_df.shape[0]

The dataset contains 65 orders where reviews were created before the orders themselves.

Let’s examine them

temp_df.head()

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered	review_id	review_score	review_comment_title	review_comment_message	review_creation_dt	review_answer_dt
1811	ed3efbd3a87bea76c2812c66a0b32219	191984a8ba4cbb2145acb4fe35b69664	Canceled	2018-09-20 13:54:16	NaT	NaT	NaT	2018-10-17	2	Voucher	Missing in Items	MG	September	Thursday	Afternoon	Missing delivery dt	Not Delivered	6e4344680dbd30c75f78394e0dcfffdf	2	NaN	the product came with a defect it does not tur...	2018-07-28	2018-07-30 11:06:16
2040	8d4c637f1accf7a88a4555f02741e606	b1dd715db389a2077f43174e7a675d07	Canceled	2018-08-29 16:27:49	NaT	NaT	NaT	2018-09-13	5	Voucher	Missing in Items	SP	August	Wednesday	Afternoon	Missing delivery dt	Not Delivered	0f54fea9e89c2a9398b2bd56e3880eda	5	i loved	NaN	2018-08-26	2018-08-26 20:21:38
3110	5290c34bd38a8a095b885f13958db1e1	92af427e290117f39d9ff908566072e0	Canceled	2018-08-21 10:25:18	NaT	NaT	NaT	2018-09-06	3	Voucher	Missing in Items	SP	August	Tuesday	Morning	Missing delivery dt	Not Delivered	3a5a10d592a6e639cdb9330c9e6d2a87	3	3	NaN	2018-08-18	2018-08-21 00:47:13
3702	03310aa823a66056268a3bab36e827fb	25dbbf0c477fd4ae0880aaffbb12e8b3	Canceled	2018-08-07 16:33:59	NaT	NaT	NaT	2018-09-04	1	Voucher	Missing in Items	SC	August	Tuesday	Afternoon	Missing delivery dt	Not Delivered	08528f70f579f0c830189efc523d2182	1	wrong product	delivery of the product different from the one...	2018-08-03	2018-08-06 00:09:52
4396	4c8b9947280829d0a8b7e81cc249b875	403c35c4d8813bf67b3d396b91ca1619	Canceled	2018-08-09 14:54:47	NaT	NaT	NaT	2018-08-21	1	Voucher	Missing in Items	SP	August	Thursday	Afternoon	Missing delivery dt	Not Delivered	18071cb247f057c348988037fa94a0cd	1	ambiguous ad goes wrong	contact lenses are positive for farsightedness...	2018-08-02	2018-08-03 01:34:51

Let’s look at how many orders do not have an approval payment date.

temp_df.order_approved_dt.isna().sum()

np.int64(57)

Let’s look at how many orders do not have a delivery date.

temp_df.order_approved_dt.isna().sum()

np.int64(57)

Let’s look at how many of them were canceled.

temp_df.order_status.value_counts() 

order_status
Canceled       58
Delivered       6
Shipped         1
Approved        0
Created         0
Invoiced        0
Processing      0
Unavailable     0
Name: count, dtype: int64

Key Observations:

The dataset contains 65 orders where reviews were created before the orders themselves. 58 orders were canceled. 6 were delivered. 1 was in delivery process.

Let’s examine the 6 delivered orders

temp_df[temp_df.order_status=='Delivered']

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered	review_id	review_score	review_comment_title	review_comment_message	review_creation_dt	review_answer_dt
32582	ebc94658c583ab37ad4f8e9091c4bef2	92dbb92ac5c6b1822c3aa951d33e61d0	Delivered	2018-04-14 08:22:43	2018-04-24 18:04:50	2018-04-26 15:57:00	2018-04-30 20:39:01	2018-05-03	2	Voucher	utilidades_domesticas	MG	April	Saturday	Morning	Not Delayed	Delivered	d25026282e14fc531f0711246e121cc2	2	NaN	it was delivered on time but the product with ...	2018-03-30	2018-04-02 00:32:35
44345	82fd1196a459f594fb1d66e667fc74c4	ef9b9b1cb4921466d4190e3116fe248e	Delivered	2018-04-10 21:09:18	2018-04-11 09:30:27	2018-04-17 16:04:51	2018-04-21 00:51:34	2018-05-02	2	Voucher	cama_mesa_banho	SP	April	Tuesday	Evening	Not Delayed	Delivered	19194f53840fc6c2857dd8809972a209	4	NaN	NaN	2018-01-23	2018-01-24 12:43:58
50153	4bb9c2002502ca416276dc1ff5efb1b3	e0e553d46d86daa33b733b1fd2687731	Delivered	2018-06-23 14:53:03	2018-06-27 14:20:08	2018-06-29 15:30:00	2018-07-17 17:08:47	2018-07-27	1	Voucher	audio	BA	June	Saturday	Afternoon	Not Delayed	Delivered	0857872f03ef22e3dbe8391c4dd331f6	1	was not delivered	NaN	2018-05-18	2018-05-21 18:15:06
70100	4a62fb19d5fa08fb514619dfcc617b3d	4d1bb9a4d16191ced3d0d1c473ccbc97	Delivered	2018-05-09 19:23:46	2018-05-10 14:32:42	2018-05-11 14:44:00	2018-05-15 18:12:35	2018-05-25	3	Voucher	relogios_presentes	SP	May	Wednesday	Evening	Not Delayed	Delivered	a935b3d67f406bac4853776d80a05220	3	finishing	i m trying to get in touch with stark to check...	2018-05-05	2018-05-08 02:06:50
70219	450c49623c365a4edcf0c5a2c93aa7c9	5a1739996fb7a15067fa38c3a0a1097e	Delivered	2017-03-01 08:08:10	2017-03-02 15:32:28	2017-03-03 13:42:55	2017-03-08 10:26:04	2017-03-29	3	Voucher	papelaria	MG	March	Wednesday	Morning	Not Delayed	Delivered	778826f9af6467ff6268bbcf436cd3c5	1	NaN	the product was sent defective	2017-02-22	2017-02-22 14:05:09
90334	96f5be02bc9ffc589f3274500a64a7e2	5396ffb28b2d1cbea8d6ec46201f06ba	Delivered	2018-04-23 14:31:55	2018-04-27 16:11:21	2018-05-11 12:26:00	2018-05-16 18:19:00	2018-06-01	2	Voucher	moveis_escritorio	RJ	April	Monday	Afternoon	Not Delayed	Delivered	e2fc0692ed820a5802c1505a77cf7870	1	NaN	the product delivered was not the product purc...	2018-04-04	2018-04-05 03:20:55

We previously determined that one order can have multiple reviews and one review can cover multiple orders.

Let’s check for duplicates in these orders and reviews.

temp_unque_orders = temp_df.order_id.unique()
temp_unque_reviews = temp_df.review_id.unique()

df_reviews[df_reviews.review_id.isin(temp_unque_reviews)].merge(df_orders, on='order_id', how='left').sort_values('review_id').head()

	review_id	order_id	review_score	review_comment_title	review_comment_message	review_creation_dt	review_answer_dt	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
41	047fd109ced39e02296f6aeb74f6a6f1	236e6ec6171c1870d4bcf4ccfad87f49	4	NaN	NaN	2018-07-31	2018-08-02 15:11:49	0fa12cb295663a40f53876e73aa7c460	Canceled	2018-08-21 15:12:31	NaT	NaT	NaT	2018-09-11	4	Voucher	Missing in Items	SP	August	Tuesday	Afternoon	Missing delivery dt	Not Delivered
10	047fd109ced39e02296f6aeb74f6a6f1	a89abace0dcc01eeb267a9660b5ac126	4	NaN	NaN	2018-07-31	2018-08-02 15:11:49	2f0524a7b1b3845a1a57fcf3910c4333	Canceled	2018-09-06 18:45:47	NaT	NaT	NaT	2018-09-27	4	Voucher	Missing in Items	SP	September	Thursday	Evening	Missing delivery dt	Not Delivered
31	08528f70f579f0c830189efc523d2182	03310aa823a66056268a3bab36e827fb	1	wrong product	delivery of the product different from the one...	2018-08-03	2018-08-06 00:09:52	25dbbf0c477fd4ae0880aaffbb12e8b3	Canceled	2018-08-07 16:33:59	NaT	NaT	NaT	2018-09-04	1	Voucher	Missing in Items	SC	August	Tuesday	Afternoon	Missing delivery dt	Not Delivered
65	08528f70f579f0c830189efc523d2182	53c71d3953507c6239ff73917ed358c9	1	wrong product	delivery of the product different from the one...	2018-08-03	2018-08-06 00:09:52	26dcb450c4b5b390e79e6d5d0f2c6535	Delivered	2018-07-24 20:41:01	2018-07-24 20:55:31	2018-07-27 15:24:00	2018-08-02 18:41:16	2018-08-21	1	Credit Card	cool_stuff	SC	July	Tuesday	Evening	Not Delayed	Delivered
2	08528f70f579f0c830189efc523d2182	7813842ae95e8c497fc0233232ae815a	1	wrong product	delivery of the product different from the one...	2018-08-03	2018-08-06 00:09:52	040d94f8ba8ca26014bd6f7e8a6e0c0d	Canceled	2018-08-17 20:06:36	NaT	NaT	NaT	2018-09-17	1	Voucher	Missing in Items	SC	August	Friday	Evening	Missing delivery dt	Not Delivered

Even accounting for duplicates, both orders show review creation dates preceding order dates.

Let’s check if any review responses were created before the reviews themselves:

df_reviews[lambda x: x.review_creation_dt >=x.review_answer_dt]

	review_id	order_id	review_score	review_comment_title	review_comment_message	review_creation_dt	review_answer_dt

No such cases found.

Table df_payments#

Let’s look at the information about the dataframe.

df_payments.explore.info()

Dataframe Overview
Summary		Column Types
Rows	103.89k	Text	1
Features	5	Categorical	1
Missing cells	---	Int	2
Exact Duplicates	---	Float	1
Fuzzy Duplicates	---	Datetime	0
Memory Usage (Mb)	11

Initial Column Analysis#

We will examine each column individually.

order_id

df_payments['order_id'].explore.info(plot=False)

Summary Statistics for "order_id" (Type: Text)
Summary		Text Metrics
Total Values	103.89k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	99.44k (96%)	Median Length (chars)	32.0
Non-Duplicates	96.48k (93%)	Min Length (chars)	32.0
Exact Duplicates	4.45k (4%)	Most Common Length	32 (100.0%)
Fuzzy Duplicates	4.45k (4%)	Avg Digit Ratio	0.63
Values with Duplicates	2.96k (3%)
Memory Usage	8

Key Observations:

All is well.

payment_sequential

df_payments['payment_sequential'].explore.info(plot=False)

Summary Statistics for "payment_sequential" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	103.89k (100%)	Max	29	Mean	1.09	1	99.36k (96%)
Missing	---	99%	3	Trimmed Mean (10%)	1	2	3.04k (3%)
Distinct	29 (<1%)	95%	1	Mode	1	3	581 (<1%)
Non-Duplicate	3 (<1%)	75%	1	Range	28	4	278 (<1%)
Duplicates	103.86k (99%)	50%	1	IQR	0	5	170 (<1%)
Dup. Values	26 (<1%)	25%	1	Std	0.71	6	118 (<1%)
Zeros	---	5%	1	MAD	0	7	82 (<1%)
Negative	---	1%	1	Kurt	370.59	8	54 (<1%)
Memory Usage	1	Min	1	Skew	16.18	9	43 (<1%)

Key Observations:

The maximum number of payment methods for a single order is 29.

payment_type

df_payments['payment_type'].explore.info(plot=True)

Summary Statistics for "payment_type" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	103.89k (100%)	Avg Word Count	1.8	Credit Card	76.80k (74%)
Missing Values	---	Max Length (chars)	11.0	Boleto	19.78k (19%)
Empty Strings	---	Avg Length (chars)	9.8	Voucher	5.78k (6%)
Distinct Values	5 (<1%)	Median Length (chars)	11.0	Debit Card	1.53k (1%)
Non-Duplicates	---	Min Length (chars)	6.0	Not Defined	3 (<1%)
Exact Duplicates	103.88k (99%)	Most Common Length	11 (73.9%)
Fuzzy Duplicates	103.88k (99%)	Avg Digit Ratio	0.00
Values with Duplicates	5 (<1%)
Memory Usage	<1 Mb

../../_images/f4932efd5566436942466e70a6ccb8966340e556b882599f9e31c7ecc50db7af.jpg

Key Observations:

74% of payments were made using credit cards.
The payment_type field contains undefined payment types (<1%).

payment_installments

df_payments['payment_installments'].explore.info()

Summary Statistics for "payment_installments" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	103.89k (100%)	Max	24	Mean	2.85	1	52.55k (51%)
Missing	---	99%	10	Trimmed Mean (10%)	2.28	2	12.41k (12%)
Distinct	24 (<1%)	95%	10	Mode	1	3	10.46k (10%)
Non-Duplicate	2 (<1%)	75%	4	Range	24	4	7.10k (7%)
Duplicates	103.86k (99%)	50%	1	IQR	3	10	5.33k (5%)
Dup. Values	22 (<1%)	25%	1	Std	2.69	5	5.24k (5%)
Zeros	2 (<1%)	5%	1	MAD	0	8	4.27k (4%)
Negative	---	1%	1	Kurt	2.55	6	3.92k (4%)
Memory Usage	1	Min	0	Skew	1.66	7	1.63k (2%)

../../_images/a61ec134f712b5b02a5f428314c357277a069917d0b63c624d8e29effd195473.jpg

Key Observations:

The maximum number of installments for a product payment is 24.
The median number of payment installments is 1.
75% of orders have installment plans with 4 or fewer payments.
There are 2 orders with a value of 0 in payment_installments.

payment_value

df_payments['payment_value'].explore.info()

Summary Statistics for "payment_value" (Type: Float)
Summary		Percentiles		Detailed Stats		Value Counts
Total	103.89k (100%)	Max	13.66k	Mean	154.10	50	324 (<1%)
Missing	---	99%	1.04k	Trimmed Mean (10%)	114.85	20	274 (<1%)
Distinct	29.08k (28%)	95%	437.63	Mode	50	100	255 (<1%)
Non-Duplicate	13.10k (13%)	75%	171.84	Range	13.66k	77.57	250 (<1%)
Duplicates	74.81k (72%)	50%	100	IQR	115.05	35	165 (<1%)
Dup. Values	15.98k (15%)	25%	56.79	Std	217.49	73.34	160 (<1%)
Zeros	9 (<1%)	5%	26.11	MAD	76.43	30	133 (<1%)
Negative	---	1%	6.69	Kurt	241.83	116.94	131 (<1%)
Memory Usage	1	Min	0	Skew	9.25	56.78	122 (<1%)

../../_images/30167ca3c064fb6302b137ec31628ab7b7ab942da7d38d43faed479656e4b78f.jpg

Key Observations:

There are 9 zero-value payments in payment_value.
The maximum payment is 13.66k. The median payment is 100.
The 13.66k payment is clearly an outlier.

Exploring Outliers#

df_payments.explore.anomalies_report(
    anomaly_type='outlier'
)

Outliers by Column
	Count	Percent
payment_sequential	4526	4.4%
payment_installments	343	0.3%
payment_value	10390	10.0%

Co-occurring Outliers (Pairwise) (method: quantile, threshold: 0.05)
	payment_sequential	payment_installments
payment_sequential
payment_installments	< under 1% / ^ under 1%
payment_value	< 19.7% / ^ 45.1%	< under 1% / ^ 21.9%

Outliers distribution across categories (method: quantile, threshold: 0.05)
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
payment_type	Voucher	5775	4485	77.7%	5.6%	34.1%	28.6%

Sample Outliers Rows
	order_id	payment_sequential	payment_type	payment_installments	payment_value
58222	242c9821e1036401ddc1d061ffcb034c	1	Credit Card	1	11.57
16695	4b73f647e26dc08ed1ac8602a80ca3ee	1	Boleto	1	741.44
53766	350508df3bad3da1cf8a3ac50f8c0406	1	Credit Card	8	565.91
32758	cec3185bcfe3a9c60e73df2ae9bfa4ac	3	Voucher	1	63.45
96404	7d0ed38550fa692c0dff9a4d0eddb8e7	1	Credit Card	1	8.39

../../_images/8d3a62e204f9473a8644fcf8f3221ad5dc0bc91f34a2464f68e41887e92c2ac1.jpg

Let’s examine payments exceeding 5,000.

df_payments[df_payments.payment_value > 5_000]

	order_id	payment_sequential	payment_type	payment_installments	payment_value
34370	736e1922ae60d0d6a89247b851902527	1	Boleto	1	7,274.88
41419	0812eb902a67711a1cb742b3cdaa65ae	1	Credit Card	8	6,929.31
49581	fefacc66af859508bf1a7934eab1e97f	1	Boleto	1	6,922.21
52107	03caa2c082116e1d31e67e9ae3700499	1	Credit Card	1	13,664.08
62409	2cc9089445046817a7539d90805e6e5a	1	Boleto	1	6,081.54
85539	f5136e38d1a14a4dbd87dff67da82701	1	Boleto	1	6,726.66

Let’s check for outliers in total order amounts per user.

(
    df_customers.merge(df_orders, on='customer_id', how='left')
    .merge(df_payments, on='order_id', how='left')
    .groupby('customer_unique_id')['payment_value']
    .sum()
    .sort_values(ascending=False)
    .to_frame()
    .head(10)
)

	payment_value
customer_unique_id
0a0a92112bd4c708ca5fde585afaa872	13,664.08
46450c74a0d8c5ca9395da1daac6c120	9,553.02
da122df9eeddfedc1dc1f5349a1a690c	7,571.63
763c8b1c9c68a0229c42c9fc6f662b93	7,274.88
dc4802a71eae9be1dd28f5d788ceb526	6,929.31
459bef486812aa25204be022145caa62	6,922.21
ff4159b92c40ebe40454e3e6a7c35ed6	6,726.66
4007669dec559734d6f53e029e360987	6,081.54
5d0a2980b292d049061542014e8960bf	4,809.44
eebb5dda148d3893cdaf5b5ca3040ccb	4,764.34

Key Observations:

One user made orders totaling 13,664. This clearly stands out from the rest.
There are also several users who made purchases totaling 6,000 or more.

Let’s identify outliers using quantiles.

We’ll consider values outside the 5th and 95th percentiles as outliers.

df_payments.explore.detect_anomalies(
    anomaly_type='outlier'
    , method='quantile'
    , threshold=0.05
)

Outliers by Column (method: quantile, threshold: 0.05)
	Count	Percent
payment_sequential	4526	4.36%
payment_installments	343	0.33%
payment_value	10390	10.00%

Key Observations:

10% of payment values are outliers. This exceeds the typical norm (5%) but isn’t critical.
For payment installments, outliers account for less than 1%, which is normal.

payment_value

Let’s examine the distribution of payment value outliers over time.

tmp_outl = df_payments.merge(df_orders, on='order_id', how='left')

tmp_outl['payment_value'].explore.anomalies_over_time(
    time_column='order_purchase_dt'
    , anomaly_type='outlier'
    , freq='D'
)

../../_images/71c0aae788153c2ffbb646596ea1e8de0f01d8a7681103933b7fbed9a6d7c926.jpg

Key Observations:

Many payment outliers occurred between November 20-26, 2017, likely related to Black Friday.

del tmp_outl

Exploring Other Anomalies#

Let’s explore zero values.

df_payments.explore.anomalies_report(
    anomaly_type='zero'
    , sample_size=20
)

Zeros by Column
	Count	Percent
payment_installments	2	0.0%
payment_value	9	0.0%

Co-occurring Zeros (Pairwise)
	payment_installments	payment_value
payment_installments
payment_value	0

Zeros distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
payment_type	Voucher	5775	6	0.1%	5.6%	54.5%	49.0%
payment_type	Not Defined	3	3	100.0%	0.0%	27.3%	27.3%

Sample Zeros Rows
	order_id	payment_sequential	payment_type	payment_installments	payment_value
77885	fa65dad1b0e818e3ccc5cb0e39231352	13	Voucher	1	0.0
36822	fa65dad1b0e818e3ccc5cb0e39231352	14	Voucher	1	0.0
62674	45ed6e85398a87c253db47c2d9f48216	3	Voucher	1	0.0
100766	b23878b3e8eb4d25a158f57d96331b18	4	Voucher	1	0.0
43744	6ccb433e00daae1283ccc956189c82ae	4	Voucher	1	0.0
79014	1a57108394169c0b47d8f876acc9ba2d	2	Credit Card	0	129.94
46982	744bade1fcf9ff3f31d860ace076d422	2	Credit Card	0	58.69
51280	4637ca194b6387e2d538dc89b124b0ee	1	Not Defined	1	0.0
94427	c8c528189310eaa44a745b8d9d26908b	1	Not Defined	1	0.0
57411	00b1cb0320190ca0daa2c88b35206009	1	Not Defined	1	0.0
19922	8bcbe01d44d147f901cd3192671144db	4	Voucher	1	0.0

../../_images/aeccf9e6e1ae4ed4ffb65d7ae588695d9697707a6bb9c3f43bffefb35cb27336.jpg

Key Observations:

Orders with zero payment amounts have either “voucher” or “not_defined” as their payment type.

Let’s examine zeros in each column separately.

Zeros in payment_installments

df_payments[df_payments.payment_installments == 0]

	order_id	payment_sequential	payment_type	payment_installments	payment_value
46982	744bade1fcf9ff3f31d860ace076d422	2	Credit Card	0	58.69
79014	1a57108394169c0b47d8f876acc9ba2d	2	Credit Card	0	129.94

Since payment_sequential shows 2, there should have been another payment. Let’s examine these orders.

df_payments[df_payments.order_id == '744bade1fcf9ff3f31d860ace076d422']

	order_id	payment_sequential	payment_type	payment_installments	payment_value
46982	744bade1fcf9ff3f31d860ace076d422	2	Credit Card	0	58.69

df_payments[df_payments.order_id == '1a57108394169c0b47d8f876acc9ba2d']

	order_id	payment_sequential	payment_type	payment_installments	payment_value
79014	1a57108394169c0b47d8f876acc9ba2d	2	Credit Card	0	129.94

Let’s check these orders in df_items.

df_items[df_items.order_id == '744bade1fcf9ff3f31d860ace076d422']

	order_id	order_item_id	product_id	seller_id	shipping_limit_dt	price	freight_value
51167	744bade1fcf9ff3f31d860ace076d422	1	0cf573090c66bb30ac5e53c82bdb0403	7202e2ba20579a9bd1acb29e61fe71f6	2018-04-26 12:31:06	45.90	12.79

df_items[df_items.order_id == '1a57108394169c0b47d8f876acc9ba2d']

	order_id	order_item_id	product_id	seller_id	shipping_limit_dt	price	freight_value
11589	1a57108394169c0b47d8f876acc9ba2d	1	db35a562fb6ba63e19fa42a15349dc04	282f23a9769b2690c5dda22e316f9941	2018-05-18 16:31:54	41.69	23.28
11590	1a57108394169c0b47d8f876acc9ba2d	2	db35a562fb6ba63e19fa42a15349dc04	282f23a9769b2690c5dda22e316f9941	2018-05-18 16:31:54	41.69	23.28

As we can see, the order wasn’t fully recorded in df_payments. The first payment is missing.

Zeros in payment_value

df_payments[df_payments.payment_value == 0]

	order_id	payment_sequential	payment_type	payment_installments
19922	8bcbe01d44d147f901cd3192671144db	4	Voucher	1
36822	fa65dad1b0e818e3ccc5cb0e39231352	14	Voucher	1
43744	6ccb433e00daae1283ccc956189c82ae	4	Voucher	1
51280	4637ca194b6387e2d538dc89b124b0ee	1	Not Defined	1
57411	00b1cb0320190ca0daa2c88b35206009	1	Not Defined	1
62674	45ed6e85398a87c253db47c2d9f48216	3	Voucher	1
77885	fa65dad1b0e818e3ccc5cb0e39231352	13	Voucher	1
94427	c8c528189310eaa44a745b8d9d26908b	1	Not Defined	1
100766	b23878b3e8eb4d25a158f57d96331b18	4	Voucher	1

Let’s look at other payments for order fa65dad1b0e818e3ccc5cb0e39231352.

df_payments[df_payments.order_id == '8bcbe01d44d147f901cd3192671144db']

	order_id	payment_sequential	payment_type	payment_installments	payment_value
5163	8bcbe01d44d147f901cd3192671144db	3	Voucher	1	6.25
19922	8bcbe01d44d147f901cd3192671144db	4	Voucher	1	0.00
20963	8bcbe01d44d147f901cd3192671144db	1	Credit Card	1	36.21
63762	8bcbe01d44d147f901cd3192671144db	2	Voucher	1	31.70

Key Observations:

One payment was processed as zero, and it was the last payment.-
Moreover, all zero payments have either “voucher” or “not_defined” as their type.-
There might be some specific payment logic here.-
It’s better not to modify these zeros.

Table df_items#

Let’s look at the information about the dataframe.

df_items.explore.info()

Dataframe Overview
Summary		Column Types
Rows	112.65k	Text	2
Features	7	Categorical	1
Missing cells	---	Int	1
Exact Duplicates	---	Float	2
Fuzzy Duplicates	---	Datetime	1
Memory Usage (Mb)	30

Initial Column Analysis#

We will examine each column individually.

order_id

df_items['order_id'].explore.info(plot=False)

Summary Statistics for "order_id" (Type: Text)
Summary		Text Metrics
Total Values	112.65k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	98.67k (88%)	Median Length (chars)	32.0
Non-Duplicates	88.86k (79%)	Min Length (chars)	32.0
Exact Duplicates	13.98k (12%)	Most Common Length	32 (100.0%)
Fuzzy Duplicates	13.98k (12%)	Avg Digit Ratio	0.62
Values with Duplicates	9.80k (9%)
Memory Usage	9

order_item_id

df_items['order_item_id'].explore.info()

Summary Statistics for "order_item_id" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	112.65k (100%)	Max	21	Mean	1.20	1	98.67k (88%)
Missing	---	99%	4	Trimmed Mean (10%)	1.03	2	9.80k (9%)
Distinct	21 (<1%)	95%	2	Mode	1	3	2.29k (2%)
Non-Duplicate	1 (<1%)	75%	1	Range	20	4	965 (<1%)
Duplicates	112.63k (99%)	50%	1	IQR	0	5	460 (<1%)
Dup. Values	20 (<1%)	25%	1	Std	0.71	6	256 (<1%)
Zeros	---	5%	1	MAD	0	7	58 (<1%)
Negative	---	1%	1	Kurt	103.86	8	36 (<1%)
Memory Usage	1	Min	1	Skew	7.58	9	28 (<1%)

../../_images/3dd77cf8bb6f4d9b02c38ba43c82385a0c643043be278e9147145ac6e808f4f4.jpg

Key Observations:

The maximum quantity of items in a single order is 21.

product_id

df_items['product_id'].explore.info(plot=False)

Summary Statistics for "product_id" (Type: Text)
Summary		Text Metrics
Total Values	112.65k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	32.95k (29%)	Median Length (chars)	32.0
Non-Duplicates	18.12k (16%)	Min Length (chars)	32.0
Exact Duplicates	79.70k (71%)	Most Common Length	32 (100.0%)
Fuzzy Duplicates	79.70k (71%)	Avg Digit Ratio	0.63
Values with Duplicates	14.83k (13%)
Memory Usage	9

seller_id

df_items['seller_id'].explore.info(plot=False)

Summary Statistics for "seller_id" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	112.65k (100%)	Avg Word Count	1.0	6560211a19b47992c3666cc44a7e94c0	2.03k (2%)
Missing Values	---	Max Length (chars)	32.0	4a3ca9315b744ce9f8e9374361493884	1.99k (2%)
Empty Strings	---	Avg Length (chars)	32.0	1f50f920176fa81dab994f9023523100	1.93k (2%)
Distinct Values	3.10k (3%)	Median Length (chars)	32.0	cc419e0650a3c5ba77189a1882b7556a	1.77k (2%)
Non-Duplicates	509 (<1%)	Min Length (chars)	32.0	da8622b14eb17ae2831f4ac5b9dab84a	1.55k (1%)
Exact Duplicates	109.56k (97%)	Most Common Length	32 (100.0%)	955fee9216a65b617aa5c0531780ce60	1.50k (1%)
Fuzzy Duplicates	109.56k (97%)	Avg Digit Ratio	0.63	1025f0e2d44d7041d6cf58b6550e0bfa	1.43k (1%)
Values with Duplicates	2.59k (2%)			7c67e1448b00f6e969d365cea6b010ab	1.36k (1%)
Memory Usage	9			ea8482cd71df3c1969d7b9473ff13abc	1.20k (1%)

shipping_limit_dt

df_items['shipping_limit_dt'].explore.info()

Summary Statistics for "shipping_limit_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-09-19	Values	112.65k (100%)	Missing Years	1 (20%)
Last date	2020-04-09	Zeros	---	Missing Months	18 (41%)
Avg Days Frequency	0.01	Missings	---	Missing Weeks	88 (47%)
Min Days Interval	0	Distinct	93.32k (83%)	Missing Days	744 (57%)
Max Days Interval	502	Duplicates	19.33k (17%)	Weekend Percentage	2.9%
Memory Usage	1	Dup. Values	13.48k (12%)	Most Common Weekday	Thursday

Key Observations:

In shipping_limit_dt: 20% missing years, 41% missing months, 47% missing weeks, 57% missing days.
The maximum date in shipping_limit_dt is 2020-04-09.

price

df_items['price'].explore.info()

Summary Statistics for "price" (Type: Float)
Summary		Percentiles		Detailed Stats		Value Counts
Total	112.65k (100%)	Max	6.74k	Mean	120.65	59.90	2.48k (2%)
Missing	---	99%	890	Trimmed Mean (10%)	87.19	69.90	1.99k (2%)
Distinct	5.97k (5%)	95%	349.90	Mode	59.90	49.90	1.95k (2%)
Non-Duplicate	2.34k (2%)	75%	134.90	Range	6.73k	89.90	1.55k (1%)
Duplicates	106.68k (95%)	50%	74.99	IQR	95	99.90	1.43k (1%)
Dup. Values	3.63k (3%)	25%	39.90	Std	183.63	39.90	1.34k (1%)
Zeros	---	5%	17	MAD	62.40	29.90	1.32k (1%)
Negative	---	1%	9.99	Kurt	120.83	79.90	1.21k (1%)
Memory Usage	1	Min	0.85	Skew	7.92	19.90	1.20k (1%)

../../_images/ece2ab2c7bd45edd5e3cf074f2421feaaad51d02c2b5caf6472703ae5f0b24da.jpg

Key Observations:

Most products are priced between 39.9 and 134.9.
The median product price is 74.99.

freight_value

df_items['freight_value'].explore.info()

Summary Statistics for "freight_value" (Type: Float)
Summary		Percentiles		Detailed Stats		Value Counts
Total	112.65k (100%)	Max	409.68	Mean	19.99	15.10	3.71k (3%)
Missing	---	99%	84.52	Trimmed Mean (10%)	17.22	7.78	2.26k (2%)
Distinct	7.00k (6%)	95%	45.12	Mode	15.10	14.10	1.88k (2%)
Non-Duplicate	2.08k (2%)	75%	21.15	Range	409.68	11.85	1.85k (2%)
Duplicates	105.65k (94%)	50%	16.26	IQR	8.07	18.23	1.57k (1%)
Dup. Values	4.92k (4%)	25%	13.08	Std	15.81	7.39	1.52k (1%)
Zeros	383 (<1%)	5%	7.78	MAD	5.35	16.11	1.15k (1%)
Negative	---	1%	4.42	Kurt	59.79	15.23	1.01k (<1%)
Memory Usage	1	Min	0	Skew	5.64	8.72	921 (<1%)

../../_images/43e00470003e5aece7840711b0ed12e570ec04bb29fb60a4e63a38788df0f7fe.jpg

Key Observations:

There are zero values in freight_value.

Exploring Outliers#

df_items.explore.anomalies_report(
    anomaly_type='outlier'
    , exclude_columns='seller_id'
)

Outliers by Column
	Count	Percent
order_item_id	4181	3.7%
price	11076	9.8%
freight_value	10239	9.1%

Co-occurring Outliers (Pairwise) (method: quantile, threshold: 0.05)
	order_item_id	price
order_item_id
price	< 5.3% / ^ 14.0%
freight_value	< 6.5% / ^ 16.0%	< 23.2% / ^ 21.4%

Sample Outliers Rows
	order_id	order_item_id	product_id	seller_id	shipping_limit_dt	price	freight_value
2311	0540bf9189ef17aab470c4942f760e78	3	44d958961868395bfe852e70d48081f4	1ca7077d890b907f89be8c954a02686a	2018-03-21 22:28:08	45.99	7.39
1074	027e4543814c3b3509bbe890c569a841	1	e940125d0a3c309f58f41cd21e39af06	6039e27294dc75811c0d8a39069f52c0	2018-04-20 15:10:30	329.0	54.12
7207	1052fc490084f258baac0395a94fa58d	1	6047d1b3d3841a528e860a939daa2035	11bfa66332777660bd0640ee84d47006	2018-05-28 15:55:25	15.0	12.79
158	006127b8b9a1681a982313ed7129c3c0	1	f819f0c84a64f02d3a5606ca95edd272	7d13fca15225358621be4086e1eb0964	2018-03-15 02:30:23	540.0	16.22
32118	48c5decbdf6fcf15cdf335a81ef879c0	1	bc3d15dd90a6794ea7710f1642aae319	6b89abe95848c850399130d149a39b63	2018-04-12 21:29:27	145.9	55.28

../../_images/b2dba6c3236d833db04d99089009fa22bbf1d02adde4e0362a7d6289b06fa4c8.jpg

Key Observations:

About 10% outliers exist in product prices and shipping costs. This exceeds the typical norm (usually 5%) but isn’t critical.

Product Sales Inconsistencies#

Checking if any products in the items table have multiple sellers:

df_items.groupby('product_id')['seller_id'].nunique().sort_values(ascending=False).head(10).to_frame('sellers_cnt')

	sellers_cnt
product_id
d285360f29ac7fd97640bf0baef03de0	8
69455f41626a745aea9ee9164cb9eafd	8
4298b7e67dc399c200662b569563a2b2	7
36f60d45225e60c7da4558b070ce4b60	7
656e0eca68dcecf6a31b8ececfabe3e8	7
dbb67791e405873b259e4656bf971246	6
7dd3fec3502f7ef333da2a73e00db394	5
461f43be3bdf8844e65b62d9ac2c7a5a	5
d04e48982547095af81c231c3d581cb6	5
e0d64dcfaa3b6db5c54ca298ae101d05	5

Key Observations:

Some product IDs were sold by different sellers.

df_items.groupby('product_id')['seller_id'].nunique().value_counts().to_frame('products_cnt')

	products_cnt
seller_id
1	31726
2	1030
3	147
4	32
5	10
7	3
8	2
6	1

Key Observations:

Over 1,000 products have more than 2 sellers.

Examining product d285360f29ac7fd97640bf0baef03de0

df_products[lambda x: x.product_id == 'd285360f29ac7fd97640bf0baef03de0']

	product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm	product_width_cm
24426	d285360f29ac7fd97640bf0baef03de0	relogios_presentes	46.00	501.00	4.00	363.00	18.00	13.00	13.00

tmp_df_res = (df_items[lambda x: x.product_id == 'd285360f29ac7fd97640bf0baef03de0']
        [['shipping_limit_dt', 'price', 'freight_value', 'seller_id']]
        .merge(df_sellers, on='seller_id', how='left')
)
tmp_df_res.seller_id.unique()

array(['2eb70248d66e0e3ef83659f71b244378',
       '4869f7a5dfa277a7dca6462dcf3b52b2',
       'd650b663c3b5f6fb392b6326366efa9a',
       'b33e7c55446eabf8fe1a42d037ac7d6d',
       'fa1c13f2614d7b5c4749cbc52fecda94',
       '7d13fca15225358621be4086e1eb0964',
       '7e93a43ef30c4f03f38b393420bc753a',
       '01ed254b9ff8407dfb9d99ba1e17d923'], dtype=object)

tmp_df_res.seller_state.unique()

['SP']
Categories (23, object): ['AC', 'AM', 'BA', 'CE', ..., 'RS', 'SC', 'SE', 'SP']

tmp_df_res.seller_city.unique()

['Campinas', 'Guariba', 'Cotia', 'Pradopolis', 'Sumare', 'Ribeirao Preto', 'Barueri', 'Mogi Das Cruzes']
Categories (605, object): ['04482255', 'Abadia De Goias', 'Afonso Claudio', 'Aguas Claras Df', ..., 'Votorantim', 'Votuporanga', 'Xanxere', 'Xaxim']

Key Observations:

Sellers are located in different cities.
This might not be an anomaly - different sellers could legitimately sell identical products with matching IDs.

Checking if any products were sold across different seller states

(df_items.merge(df_sellers, on='seller_id', how='left')
 .groupby('product_id')['seller_state'].nunique().sort_values(ascending=False).head(10).to_frame('states_cnt')
)

	states_cnt
product_id
46292acbc08d49b95138891146f7287b	4
da051bd767b7457572d55aba23fd846a	3
64315bd8c0c47303179dd2e25b579d00	3
04df667e73ca2fab4b9a97c86466b699	3
ab328a14d2698a726fe1ed2695f5f43c	3
b5c9804aab5c4cfd555ebfbc609de43d	3
09982590713913e225d3ce5ccb670688	3
8215db96a612d69839c27866feabf350	3
027cdd14a677a5834bc67a9789db5021	3
d41d8cd98f00b204e9800998ecf8427e	3

Key Observations:

Some products were sold by sellers in different states.

Verifying city consistency for customer_id in the customers table (as this is our join key)

(df_customers.groupby('customer_id')[['customer_state', 'customer_city']].nunique() > 1).sum()

customer_state    0
customer_city     0
dtype: int64

All is well…

Date Inconsistencies#

shipping_limit_dt

Analyzing anomalous shipping_limit_dt values

df_items[df_items.shipping_limit_dt > '2018-12-31'].merge(df_orders, on='order_id', how='left')

	order_id	order_item_id	product_id	seller_id	shipping_limit_dt	price	freight_value	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
0	13bdf405f961a6deec817d817f5c6624	1	96ea060e41bdecc64e2de00b97068975	7a241947449cc45dbfda4f9d0798d9d0	2020-02-05 03:30:51	69.99	14.66	b279a1d441c73c1974d7a63618855aa0	Canceled	2017-03-16 02:30:51	2017-03-16 02:30:51	NaT	NaT	2017-08-08	1	Boleto	utilidades_domesticas	RJ	March	Thursday	Night	Missing delivery dt	Not Delivered
1	9c94a4ea2f7876660fa6f1b59b69c8e6	1	282b126b2354516c5f400154398f616d	7a241947449cc45dbfda4f9d0798d9d0	2020-02-03 20:23:22	75.99	14.70	22e922696a7d1ab9a19c6b702fedc387	Shipped	2017-03-14 19:23:22	2017-03-14 19:23:22	2017-03-16 14:31:15	NaT	2017-08-04	1	Boleto	utilidades_domesticas	DF	March	Tuesday	Evening	Missing delivery dt	Not Delivered
2	c2bb89b5c1dd978d507284be78a04cb2	1	87b92e06b320e803d334ac23966c80b1	7a241947449cc45dbfda4f9d0798d9d0	2020-04-09 22:35:08	99.99	61.44	6357fffb5704244d552615bbfcea1442	Delivered	2017-05-23 22:28:36	2017-05-24 22:35:08	2017-05-29 02:03:28	2017-06-09 13:35:54	2017-10-11	5	Credit Card	utilidades_domesticas	SP	May	Tuesday	Evening	Not Delayed	Delivered
3	c2bb89b5c1dd978d507284be78a04cb2	2	87b92e06b320e803d334ac23966c80b1	7a241947449cc45dbfda4f9d0798d9d0	2020-04-09 22:35:08	99.99	61.44	6357fffb5704244d552615bbfcea1442	Delivered	2017-05-23 22:28:36	2017-05-24 22:35:08	2017-05-29 02:03:28	2017-06-09 13:35:54	2017-10-11	5	Credit Card	utilidades_domesticas	SP	May	Tuesday	Evening	Not Delayed	Delivered

Key Observations:

Found 4 orders with abnormally large shipping_limit_dt values, despite having normal estimated delivery times.

Exploring Other Anomalies#

Examining zero values

tmp_zeros = df_items.explore.detect_anomalies(
    anomaly_type='zero'
    , return_mode='by_column'
)['freight_value']

Zeros by Column
	Count	Percent
freight_value	383	0.34%

Key Observations:

Zero freight values may indicate free shipping.

Examing rows

tmp_zeros.sample(5)

	order_id	order_item_id	product_id	seller_id	shipping_limit_dt	price
58418	84f89b4f4b956843277c6b990bd06b2f	5	53b36df67ebb7c41585e8d54d6772e08	7d13fca15225358621be4086e1eb0964	2018-05-01 23:35:09	99.90
37202	546fb5e88f96553369c04bd920b9becf	1	422879e10f46682990de24d770e7f83d	1f50f920176fa81dab994f9023523100	2018-05-07 02:15:24	53.90
97216	dc8c9fb3edc11055d2fa5198744dff2b	1	7a10781637204d8d10485c71a6108a2e	4869f7a5dfa277a7dca6462dcf3b52b2	2018-05-15 02:54:17	199.00
58271	84a649c5b7af51c49e0041fb5e3486d8	1	53b36df67ebb7c41585e8d54d6772e08	7d13fca15225358621be4086e1eb0964	2018-05-14 07:55:13	99.90
40807	5ced629739c2b3247b4fcd4f0498c3bb	1	53b36df67ebb7c41585e8d54d6772e08	4869f7a5dfa277a7dca6462dcf3b52b2	2018-04-26 19:30:44	106.90

Reviewing zero-value over time

df_items.freight_value.explore.anomalies_over_time(
    time_column='shipping_limit_dt'
    , anomaly_type='zero'
    , freq='W'
)

../../_images/3ec6bbff4196711e2b38a9a85647417c7338407e50143f5e870bf0cd857cda45.jpg

Key Observations:

Most zero shipping costs occurred between April-July 2018.

Table df_customers#

Let’s look at the information about the dataframe.

df_customers.explore.info()

Dataframe Overview
Summary		Column Types
Rows	99.44k	Text	2
Features	5	Categorical	2
Missing cells	---	Int	1
Exact Duplicates	---	Float	0
Fuzzy Duplicates	---	Datetime	0
Memory Usage (Mb)	17

Initial Column Analysis#

We will examine each column individually.

customer_id

df_customers['customer_id'].explore.info(plot=False)

Summary Statistics for "customer_id" (Type: Text)
Summary		Text Metrics
Total Values	99.44k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	99.44k (100%)	Median Length (chars)	32.0
Non-Duplicates	99.44k (100%)	Min Length (chars)	32.0
Exact Duplicates	---	Most Common Length	32 (100.0%)
Fuzzy Duplicates	---	Avg Digit Ratio	0.62
Values with Duplicates	---
Memory Usage	8

customer_unique_id

df_customers['customer_unique_id'].explore.info(plot=False)

Summary Statistics for "customer_unique_id" (Type: Text)
Summary		Text Metrics
Total Values	99.44k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	96.10k (97%)	Median Length (chars)	32.0
Non-Duplicates	93.10k (94%)	Min Length (chars)	32.0
Exact Duplicates	3.35k (3%)	Most Common Length	32 (100.0%)
Fuzzy Duplicates	3.35k (3%)	Avg Digit Ratio	0.63
Values with Duplicates	3.00k (3%)
Memory Usage	8

Key Observations:

customer_unique_id has 3% duplicates - acceptable as this field doesn’t require uniqueness in this table.

customer_zip_code_prefix

df_customers['customer_zip_code_prefix'].explore.info(plot=False)

Summary Statistics for "customer_zip_code_prefix" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	99.44k (100%)	Max	99.99k	Mean	35.14k	22790	142 (<1%)
Missing	---	99%	97.51k	Trimmed Mean (10%)	32.04k	24220	124 (<1%)
Distinct	14.99k (15%)	95%	90.55k	Mode	22790	22793	121 (<1%)
Non-Duplicate	3.01k (3%)	75%	58.90k	Range	98987	24230	117 (<1%)
Duplicates	84.45k (85%)	50%	24.42k	IQR	47.55k	22775	110 (<1%)
Dup. Values	11.98k (12%)	25%	11.35k	Std	29.80k	29101	101 (<1%)
Zeros	---	5%	3.31k	MAD	24.29k	13212	95 (<1%)
Negative	---	1%	1.33k	Kurt	-0.79	35162	93 (<1%)
Memory Usage	1	Min	1.00k	Skew	0.78	22631	89 (<1%)

customer_city

df_customers['customer_city'].explore.info()

Summary Statistics for "customer_city" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	99.44k (100%)	Avg Word Count	1.8	Sao Paulo	15.54k (16%)
Missing Values	---	Max Length (chars)	32.0	Rio De Janeiro	6.88k (7%)
Empty Strings	---	Avg Length (chars)	10.3	Belo Horizonte	2.77k (3%)
Distinct Values	4.11k (4%)	Median Length (chars)	9.0	Brasilia	2.13k (2%)
Non-Duplicates	1.14k (1%)	Min Length (chars)	3.0	Curitiba	1.52k (2%)
Exact Duplicates	95.33k (96%)	Most Common Length	9 (22.4%)	Campinas	1.44k (1%)
Fuzzy Duplicates	95.33k (96%)	Avg Digit Ratio	0.00	Porto Alegre	1.38k (1%)
Values with Duplicates	2.97k (3%)			Salvador	1.25k (1%)
Memory Usage	1			Guarulhos	1.19k (1%)

../../_images/f4dde73104284ed247853dcd7e2dc43824c017639332dfedf5bee340b89c3843.jpg

Key Observations:

Most customers are from São Paulo city (16%).

customer_state

df_customers['customer_state'].explore.info()

Summary Statistics for "customer_state" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	99.44k (100%)	Avg Word Count	1.0	SP	41.75k (42%)
Missing Values	---	Max Length (chars)	2.0	RJ	12.85k (13%)
Empty Strings	---	Avg Length (chars)	2.0	MG	11.63k (12%)
Distinct Values	27 (<1%)	Median Length (chars)	2.0	RS	5.47k (5%)
Non-Duplicates	---	Min Length (chars)	2.0	PR	5.04k (5%)
Exact Duplicates	99.41k (99%)	Most Common Length	2 (100.0%)	SC	3.64k (4%)
Fuzzy Duplicates	99.41k (99%)	Avg Digit Ratio	0.00	BA	3.38k (3%)
Values with Duplicates	27 (<1%)			DF	2.14k (2%)
Memory Usage	<1 Mb			ES	2.03k (2%)

../../_images/3289c0b8f57736f38ef1ab1c289c308c251439588a17db923ba921ca97713363.jpg

Key Observations:

Most customers are from SP state (42%).

Table df_reviews#

Let’s look at the information about the dataframe.

df_reviews.explore.info()

Dataframe Overview
Summary		Column Types
Rows	100.00k	Text	3
Features	7	Categorical	1
Missing cells	146.56k (21%)	Int	1
Exact Duplicates	---	Float	0
Fuzzy Duplicates	---	Datetime	2
Memory Usage (Mb)	28

Initial Column Analysis#

We will examine each column individually.

review_id

df_reviews['review_id'].explore.info(plot=False)

Summary Statistics for "review_id" (Type: Text)
Summary		Text Metrics
Total Values	100.00k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	99.17k (99%)	Median Length (chars)	32.0
Non-Duplicates	98.37k (98%)	Min Length (chars)	32.0
Exact Duplicates	827 (<1%)	Most Common Length	32 (100.0%)
Fuzzy Duplicates	827 (<1%)	Avg Digit Ratio	0.62
Values with Duplicates	802 (<1%)
Memory Usage	8

Key Observations:

review_id contains 827 duplicates.

order_id

df_reviews['order_id'].explore.info(plot=False)

Summary Statistics for "order_id" (Type: Text)
Summary		Text Metrics
Total Values	100.00k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	99.44k (99%)	Median Length (chars)	32.0
Non-Duplicates	98.89k (99%)	Min Length (chars)	32.0
Exact Duplicates	559 (<1%)	Most Common Length	32 (100.0%)
Fuzzy Duplicates	559 (<1%)	Avg Digit Ratio	0.63
Values with Duplicates	555 (<1%)
Memory Usage	8

Key Observations:

order_reviews table has 559 duplicate order_ids.

review_score

df_reviews['review_score'].explore.info(column_type='categorical')

Summary Statistics for "review_score" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	100.00k (100%)	Avg Word Count	1.0	5	57.42k (57%)
Missing Values	---	Max Length (chars)	1.0	4	19.20k (19%)
Empty Strings	---	Avg Length (chars)	1.0	1	11.86k (12%)
Distinct Values	5 (<1%)	Median Length (chars)	1.0	3	8.29k (8%)
Non-Duplicates	---	Min Length (chars)	1.0	2	3.23k (3%)
Exact Duplicates	100.00k (99%)	Most Common Length	1 (100.0%)
Fuzzy Duplicates	100.00k (99%)	Avg Digit Ratio	1.00
Values with Duplicates	5 (<1%)
Memory Usage	5

../../_images/32d16edaa523da1fb93b769ef5e519b61de675dbe0fd28b4d629463df8d5b998.jpg

Key Observations:

Over half of reviews (57%) give maximum 5-star ratings.

review_comment_title

df_reviews['review_comment_title'].explore.info(column_type='text')

Summary Statistics for "review_comment_title" (Type: Categorical)
Summary		Text Metrics
Total Values	100.00k (100%)	Avg Word Count	2.1
Missing Values	88.29k (88%)	Max Length (chars)	43.0
Empty Strings	30 (<1%)	Avg Length (chars)	12.3
Distinct Values	2.98k (3%)	Median Length (chars)	11.0
Non-Duplicates	2.39k (2%)	Min Length (chars)	0.0
Exact Duplicates	97.02k (97%)	Most Common Length	9 (1.9%)
Fuzzy Duplicates	97.03k (97%)	Avg Digit Ratio	0.03
Values with Duplicates	585 (<1%)
Memory Usage	5

../../_images/4d83670e8103d59d219d16010ec9f5a66a435057adf08bb9e7e621456990355a.jpg

Key Observations:

88% of review titles are missing.
Most common review title (8%) is ‘recomendo’.

review_comment_message

df_reviews['review_comment_message'].explore.info() 

Summary Statistics for "review_comment_message" (Type: Text)
Summary		Text Metrics
Total Values	100.00k (100%)	Avg Word Count	13.3
Missing Values	58.27k (58%)	Max Length (chars)	347.0
Empty Strings	114 (<1%)	Avg Length (chars)	71.3
Distinct Values	33.49k (33%)	Median Length (chars)	54.0
Non-Duplicates	32.03k (32%)	Min Length (chars)	0.0
Exact Duplicates	66.51k (67%)	Most Common Length	9 (1.4%)
Fuzzy Duplicates	66.53k (67%)	Avg Digit Ratio	0.01
Values with Duplicates	1.46k (1%)
Memory Usage	8

../../_images/77377124bbfa5baa69f71f2a86c7db07625f7e965a4e7fce9991fc856347f77f.jpg

Key Observations:

58% of orders lack review messages.
Only 36% of review comments are unique.
Most frequent comment (1%) contains “muito bom”.

review_creation_dt

df_reviews['review_creation_dt'].explore.info()

Summary Statistics for "review_creation_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-10-02	Values	100.00k (100%)	Missing Years	---
Last date	2018-08-31	Zeros	---	Missing Months	---
Avg Days Frequency	0.01	Missings	---	Missing Weeks	---
Min Days Interval	0	Distinct	637 (<1%)	Missing Days	62 (9%)
Max Days Interval	9	Duplicates	99.36k (99%)	Weekend Percentage	24.5%
Memory Usage	1	Dup. Values	608 (<1%)	Most Common Weekday	Wednesday

Key Observations:

review_creation_dt has 9% missing days.

review_answer_dt

df_reviews['review_answer_dt'].explore.info()

Summary Statistics for "review_answer_dt" (Type: Datetime)
Summary		Data Quality Stats		Temporal Stats
First date	2016-10-07	Values	100.00k (100%)	Missing Years	---
Last date	2018-10-29	Zeros	---	Missing Months	---
Avg Days Frequency	0.01	Missings	---	Missing Weeks	---
Min Days Interval	0	Distinct	99.01k (99%)	Missing Days	37 (5%)
Max Days Interval	6	Duplicates	990 (<1%)	Weekend Percentage	25.5%
Memory Usage	1	Dup. Values	959 (<1%)	Most Common Weekday	Friday

Key Observations:

review_answer_dt has 5% missing days.

Exploring Missing Values#

Checking columns with missing values:

df_reviews.explore.anomalies_report(
    anomaly_type='missing'
    , pct_diff_threshold=10
    , show_by_categories=False
    , show_sample=False
    , width=600
)

Missings by Column
	Count	Percent
review_comment_title	88289	88.3%
review_comment_message	58275	58.3%

Co-occurring Missings (Pairwise)
	review_comment_title	review_comment_message
review_comment_title
review_comment_message	< 97.0% / ^ 64.0%

../../_images/e85824bac939ccd5025001bc4e13d6feee5455e5cc45d6f1d705b60f8ff5b5bd.jpg

Key Observations:

Missing review titles/messages aren’t anomalies - they were simply not provided.

Exploring Duplicates#

Examining duplicates in order_id and review_id:

df_reviews[['order_id', 'review_id']].duplicated().sum()

np.int64(0)

No instances where both order_id and review_id are duplicated simultaneously.

Theoretical possibility: one order could have multiple reviews, but multiple orders sharing one review is unusual.

Analyzing order_id and review_id duplicates separately

review_id

tmp_dupl = df_reviews[df_reviews.review_id.duplicated()]

Reviewing review_id duplicate distribution over time

df_reviews.review_id.explore.anomalies_over_time(
    time_column='review_creation_dt'
    , anomaly_type='duplicate'
    , freq='W'
)

../../_images/3befeb46bb86c111f3f3d4591c1cd941d5db96c9c06a097379367c4e56c0970e.jpg

Key Observations:

March 2018 saw a significant spike in duplicate review_ids (one review applied to multiple orders).

Checking for duplicates with different customers

tmp_dupl = (
    tmp_dupl.merge(df_orders, on='order_id', how='left')
    .merge(df_customers, on='customer_id', how='left')
)
tmp_dupl.groupby('review_id')['customer_unique_id'].nunique().value_counts()

customer_unique_id
1    802
Name: count, dtype: int64

Key Observations:

No duplicates with different customers.

Examining product quantities in these orders

(
    tmp_dupl.merge(df_items, on='order_id', how='left')
    .groupby('order_id')['product_id']
    .nunique()
    .value_counts()
)

product_id
  700
   41
   33
    5
    3
Name: count, dtype: int64

Key Observations:

Most orders contain one product.
Some orders show no products (due to missing records in the items table).

tmp_dupl = (tmp_dupl.merge(df_payments, on='order_id', how='left')
            .merge(df_items, on='order_id', how='left')
)

tmp_dupl = tmp_dupl[['review_id', 'order_id', 'review_score', 'review_comment_title', 'review_comment_message'
                   , 'review_creation_dt', 'order_delivered_customer_dt', 'order_status', 'payment_type'
                   , 'payment_value', 'product_id', 'price', 'freight_value']].sort_values('review_id').drop_duplicates()
tmp_dupl.head()

	review_id	order_id	review_score	review_comment_title	review_comment_message	review_creation_dt	order_delivered_customer_dt	order_status	payment_type	payment_value	product_id	price	freight_value
214	00130cbe1f9d422698c812ed8ded1919	dfcdfc43867d1c1381bfaf62d6b9c195	1	NaN	the original hp 60xl cartridge is not recogniz...	2018-03-07	2018-02-26 18:18:45	Delivered	Credit Card	115.05	36f60d45225e60c7da4558b070ce4b60	89.15	25.90
827	0115633a9c298b6a98bcbe4eee75345f	78a4201f58af3463bdab842eea4bc801	5	NaN	NaN	2017-09-21	2017-09-20 22:24:03	Delivered	Credit Card	325.41	7814c273ab16783d73a9863ebfa8b141	229.00	96.41
865	0174caf0ee5964646040cd94e15ac95e	f93a732712407c02dce5dd5088d0f47b	1	NaN	product delivered in the supplier s packaging ...	2018-03-07	2018-03-06 21:39:06	Delivered	Credit Card	73.43	aed5efaf7a7554165f61fd016846916f	61.50	11.93
979	017808d29fd1f942d97e50184dfb4c13	b1461c8882153b5fe68307c46a506e39	5	NaN	NaN	2018-03-02	2018-03-01 17:52:00	Delivered	Credit Card	45.17	d5d9e871a07a5ba84cef719ad92f0ab9	35.90	9.27
930	0254bd905dc677a6078990aad3331a36	331b367bdd766f3d1cf518777317b5d9	1	NaN	the order consists of 2 products and so far i ...	2017-09-09	2017-09-14 20:22:47	Delivered	Credit Card	283.20	7814c273ab16783d73a9863ebfa8b141	229.00	54.20

Comparing duplicate values across columns (replacing missing values with na):

(tmp_dupl.fillna({'review_comment_message': '__na__'})
 .groupby('review_id')
 [['review_comment_message', 'review_score', 'order_status', 'payment_type', 'payment_value', 'product_id', 'price']]
 .nunique()
 .apply(pd.Series.value_counts)
)

	review_comment_message	review_score	order_status	payment_type	payment_value	product_id	price
0	NaN	NaN	NaN	NaN	NaN	26.00	26.00
1	802.00	802.00	796.00	784.00	771.00	712.00	724.00
2	NaN	NaN	6.00	18.00	23.00	55.00	48.00
3	NaN	NaN	NaN	NaN	4.00	5.00	4.00
4	NaN	NaN	NaN	NaN	3.00	4.00	NaN
7	NaN	NaN	NaN	NaN	1.00	NaN	NaN

Key Observations:

Identical reviews were left for different orders with matching ratings and descriptions, but varying products/prices.-
This is unusual - could indicate bulk reviews for multiple orders or data collection errors.

order_id

Analyzing order_id duplicate distribution over time

tmp_dupl = df_reviews[df_reviews.order_id.duplicated()]

df_reviews.order_id.explore.anomalies_over_time(
    time_column='review_creation_dt'
    , anomaly_type='duplicate'
    , freq='W'
)

../../_images/d8e99508b01913a44d52fe1bdfacecb93a92c8c8c3bd125ab443bc945c333e31.jpg

Key Observations:

March 2018 showed a spike in order_id duplicates (multiple reviews for single orders).

tmp_dupl = (tmp_dupl.merge(df_orders, on='order_id', how='left')
            .merge(df_customers, on='customer_id', how='left')
)

tmp_dupl = tmp_dupl[['order_id', 'review_id', 'review_score', 'review_comment_title', 'review_comment_message'
                   , 'review_creation_dt', 'order_delivered_customer_dt', 'order_status']].sort_values('order_id').drop_duplicates()
tmp_dupl.head(10)

	order_id	review_id	review_score	review_comment_title	review_comment_message	review_creation_dt	order_delivered_customer_dt	order_status
34	0035246a40f520710769010f752e7507	89a02c45c340aeeb1354a24e7d4b2c1e	5	NaN	NaN	2017-08-29	2017-08-28 19:14:03	Delivered
251	013056cfe49763c6f66bda03396c5ee3	73413b847f63e02bc752b364f6d05ee9	4	NaN	NaN	2018-03-04	2018-03-03 12:09:00	Delivered
377	0176a6846bcb3b0d3aa3116a9a768597	d8e8c42271c8fb67b9dad95d98c8ff80	5	NaN	NaN	2017-12-30	2017-12-29 21:22:35	Delivered
446	02355020fd0a40a0d56df9f6ff060413	0c8e7347f1cdd2aede37371543e3d163	3	NaN	one of the products delivery02 purchased in th...	2018-03-21	2018-04-10 14:48:49	Delivered
159	029863af4b968de1e5d6a82782e662f5	61fe4e7d1ae801bbe169eb67b86c6eda	4	NaN	NaN	2017-07-19	2017-07-13 20:42:36	Delivered
65	02e0b68852217f5715fb9cc885829454	fa493ead9b093fb0fa6f7d4905b0ef3b	4	NaN	i liked it and delivered quickly	2017-12-01	2017-11-30 18:19:22	Delivered
92	02e723e8edb4a123d414f56cc9c4665e	39b4603793c1c7f5f36d809b4a218664	5	NaN	excellent	2017-09-01	NaT	Canceled
128	03515a836bb855b03f7df9dee520a8fc	5e1ee2e924a1dd10d80b99555383c0a7	5	NaN	NaN	2018-01-31	2018-02-05 14:33:00	Delivered
134	03c939fd7fd3b38f8485a0f95798f1f6	f4bb9d6dd4fb6dcc2298f0e7b17b8e1e	4	NaN	NaN	2018-03-29	2018-03-19 21:48:52	Delivered
254	03c939fd7fd3b38f8485a0f95798f1f6	405eb2ea45e1dbe2662541ae5b47e2aa	3	NaN	it would be great if they had delivered the 3 ...	2018-03-06	2018-03-19 21:48:52	Delivered

Key Observations:

Customers sometimes left multiple reviews per order (e.g., one pre-delivery and one post-delivery) - not necessarily anomalous.

Let’s examine how many duplicates share identical values across different columns.

For description fields, we’ll replace missing values with na.

(tmp_dupl.fillna({'review_comment_message': '__na__'})
 .groupby('order_id')
 [['review_comment_message', 'review_score', 'order_status']]
 .nunique()
 .apply(pd.Series.value_counts)
)

	review_comment_message	review_score	order_status
1	553	554	555.00
2	2	1	NaN

Secondary Review Rating Comparison

We’ll analyze whether follow-up reviews for the same order had higher or lower ratings:

Compare average ratings with initial ratings
If initial rating was lower, subsequent ratings were either equal or higher

(tmp_dupl.sort_values(['order_id', 'review_creation_dt'])
 .groupby('order_id')
 .agg(
     first_review_score = ('review_score', 'first')
     , mean_review_score = ('review_score', 'mean')
 )
 .assign(
     is_first_less_mean = lambda x: x.first_review_score < x.mean_review_score
 )
 ['is_first_less_mean']
 .value_counts()
)

is_first_less_mean
False    554
True       1
Name: count, dtype: int64

Key Observations:

Subsequent reviews for the same order typically received lower ratings than the initial review.

del tmp_dupl

Table df_products#

Let’s look at the information about the dataframe.

df_products.explore.info()

Dataframe Overview
Summary		Column Types
Rows	32.95k	Text	1
Features	9	Categorical	1
Missing cells	2.45k (<1%)	Int	7
Exact Duplicates	---	Float	0
Fuzzy Duplicates	---	Datetime	0
Memory Usage (Mb)	4

Initial Column Analysis#

We will examine each column individually.

product_id

df_products['product_id'].explore.info(plot=False)

Summary Statistics for "product_id" (Type: Text)
Summary		Text Metrics
Total Values	32.95k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	32.95k (100%)	Median Length (chars)	32.0
Non-Duplicates	32.95k (100%)	Min Length (chars)	32.0
Exact Duplicates	---	Most Common Length	32 (100.0%)
Fuzzy Duplicates	---	Avg Digit Ratio	0.63
Values with Duplicates	---
Memory Usage	3

product_category_name

df_products['product_category_name'].explore.info()

Summary Statistics for "product_category_name" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	32.95k (100%)	Avg Word Count	1.0	cama_mesa_banho	3.03k (9%)
Missing Values	610 (2%)	Max Length (chars)	46.0	esporte_lazer	2.87k (9%)
Empty Strings	---	Avg Length (chars)	15.0	moveis_decoracao	2.66k (8%)
Distinct Values	73 (<1%)	Median Length (chars)	15.0	beleza_saude	2.44k (7%)
Non-Duplicates	1 (<1%)	Min Length (chars)	3.0	utilidades_domesticas	2.33k (7%)
Exact Duplicates	32.88k (99%)	Most Common Length	10.0 (15.1%)	automotivo	1.90k (6%)
Fuzzy Duplicates	32.88k (99%)	Avg Digit Ratio	0.00	informatica_acessorios	1.64k (5%)
Values with Duplicates	72 (<1%)			brinquedos	1.41k (4%)
Memory Usage	<1 Mb			relogios_presentes	1.33k (4%)

../../_images/9d0fbab5f468f9d1ef7421c359f03c3198f1739485beed944559cb168acbf4ba.jpg

Key Observations:

product_category_name contains 2% missing values
Dataset contains 73 unique product categories

product_name_lenght

df_products['product_name_lenght'].explore.info()

Summary Statistics for "product_name_lenght" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	32.34k (98%)	Max	76	Mean	48.48	60	2.18k (7%)
Missing	610 (2%)	99%	63	Trimmed Mean (10%)	49.63	59	2.02k (6%)
Distinct	66 (<1%)	95%	60	Mode	60	58	1.89k (6%)
Non-Duplicate	7 (<1%)	75%	57	Range	71	57	1.72k (5%)
Duplicates	32.88k (99%)	50%	51	IQR	15	55	1.68k (5%)
Dup. Values	59 (<1%)	25%	42	Std	10.25	56	1.68k (5%)
Zeros	---	5%	29	MAD	10.38	54	1.44k (4%)
Negative	---	1%	20	Kurt	0.19	53	1.33k (4%)
Memory Usage	<1 Mb	Min	5	Skew	-0.90	52	1.26k (4%)

../../_images/c45a40167b05c9213d6269de8b75d991e2f922d37ba84a4a6e5eacdaf6fffc7d.jpg

Key Observations:

product_name_lenght has 2% missing values
Maximum product name length: 76 characters

product_description_lenght

df_products['product_description_lenght'].explore.info()

Summary Statistics for "product_description_lenght" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	32.34k (98%)	Max	3.99k	Mean	771.50	404	94 (<1%)
Missing	610 (2%)	99%	3.29k	Trimmed Mean (10%)	662.91	729	86 (<1%)
Distinct	2.96k (9%)	95%	2.06k	Mode	404	651	66 (<1%)
Non-Duplicate	669 (2%)	75%	972	Range	3.99k	703	66 (<1%)
Duplicates	29.99k (91%)	50%	595	IQR	633	236	65 (<1%)
Dup. Values	2.29k (7%)	25%	339	Std	635.12	184	65 (<1%)
Zeros	---	5%	150	MAD	434.40	303	63 (<1%)
Negative	---	1%	84	Kurt	4.83	352	62 (<1%)
Memory Usage	<1 Mb	Min	4	Skew	1.96	375	60 (<1%)

../../_images/6fac48d43aac5f20a12d2c2f52737513913add054319c48a00e8e4e7277d4a83.jpg

Key Observations:

There are 2% missing values in product_description_lenght.
The maximum length of the product description is 3.99k characters.
The minimum length of the product description is 4 characters.

product_photos_qty

df_products['product_photos_qty'].explore.info(column_type='categorical')

Summary Statistics for "product_photos_qty" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	32.95k (100%)	Avg Word Count	1.0	1.0	16.49k (50%)
Missing Values	610 (2%)	Max Length (chars)	4.0	2.0	6.26k (19%)
Empty Strings	---	Avg Length (chars)	3.0	3.0	3.86k (12%)
Distinct Values	19 (<1%)	Median Length (chars)	3.0	4.0	2.43k (7%)
Non-Duplicates	2 (<1%)	Min Length (chars)	3.0	5.0	1.48k (5%)
Exact Duplicates	32.93k (99%)	Most Common Length	3 (97.5%)	6.0	968 (3%)
Fuzzy Duplicates	32.93k (99%)	Avg Digit Ratio	0.67	7.0	343 (1%)
Values with Duplicates	17 (<1%)			8.0	192 (<1%)
Memory Usage	2			9.0	105 (<1%)

../../_images/67198a9655fa3ef5147dce23bebaf7a748e5e70a767df80595aeaa02ac566313.jpg

Key Observations:

In product_photos_qty, 2% of values are missing.
The maximum number of photos for a single product is 20.
50% of products have 1 photo.

product_weight_g

df_products['product_weight_g'].explore.info()

Summary Statistics for "product_weight_g" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	32.95k (99%)	Max	40.42k	Mean	2.28k	200	2.08k (6%)
Missing	2 (<1%)	99%	22.54k	Trimmed Mean (10%)	1.20k	300	1.56k (5%)
Distinct	2.20k (7%)	95%	10.85k	Mode	200	150	1.26k (4%)
Non-Duplicate	1.03k (3%)	75%	1.90k	Range	40.42k	400	1.21k (4%)
Duplicates	30.75k (93%)	50%	700	IQR	1.60k	100	1.19k (4%)
Dup. Values	1.17k (4%)	25%	300	Std	4.28k	500	1.11k (3%)
Zeros	4 (<1%)	5%	105	MAD	741.30	250	1.00k (3%)
Negative	---	1%	60	Kurt	15.13	600	957 (3%)
Memory Usage	<1 Mb	Min	0	Skew	3.60	350	832 (3%)

../../_images/71199689b52eec9f6817186ad80e039b49dd66c0e5fb8007a20b5e11a3a4bc8d.jpg

Key Observations:

In product_weight_g, there are 2 missing values.
In product_weight_g, there are 4 zero values.
The maximum product weight is 40.42k grams.
The product weight of 40.42k grams is clearly an outlier.

product_length_cm

df_products['product_length_cm'].explore.info()

Summary Statistics for "product_length_cm" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	32.95k (99%)	Max	105	Mean	30.82	16	5.52k (17%)
Missing	2 (<1%)	99%	100	Trimmed Mean (10%)	27.79	20	2.82k (9%)
Distinct	99 (<1%)	95%	65	Mode	16	30	2.03k (6%)
Non-Duplicate	1 (<1%)	75%	38	Range	98	18	1.50k (5%)
Duplicates	32.85k (99%)	50%	25	IQR	20	25	1.39k (4%)
Dup. Values	98 (<1%)	25%	18	Std	16.91	17	1.31k (4%)
Zeros	---	5%	16	MAD	11.86	19	1.27k (4%)
Negative	---	1%	16	Kurt	3.51	40	1.22k (4%)
Memory Usage	<1 Mb	Min	7	Skew	1.75	22	972 (3%)

../../_images/afff6059d8389941f14146ec58cb31abed7d0760cbdbc89400f06358ccc99e96.jpg

Key Observations:

In product_length_cm, there are 2 missing values.
The maximum product length is 105 cm. The minimum is 7 cm. The median is 25 cm.

product_height_cm

df_products['product_height_cm'].explore.info()

Summary Statistics for "product_height_cm" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	32.95k (99%)	Max	105	Mean	16.94	10	2.55k (8%)
Missing	2 (<1%)	99%	69	Trimmed Mean (10%)	14.74	15	2.02k (6%)
Distinct	102 (<1%)	95%	44	Mode	10	20	1.99k (6%)
Non-Duplicate	3 (<1%)	75%	21	Range	103	16	1.59k (5%)
Duplicates	32.85k (99%)	50%	13	IQR	13	11	1.55k (5%)
Dup. Values	99 (<1%)	25%	8	Std	13.64	5	1.53k (5%)
Zeros	---	5%	3	MAD	8.90	12	1.52k (5%)
Negative	---	1%	2	Kurt	6.68	8	1.47k (4%)
Memory Usage	<1 Mb	Min	2	Skew	2.14	2	1.36k (4%)

../../_images/55936825db295a5d7f482edaa2cf74c8b9922f4bbcd8914eddd9738125a46591.jpg

Key Observations:

In product_height_cm, there are 2 missing values.
The maximum product height is 105 cm. The minimum is 2 cm. The median is 13 cm.

product_width_cm

df_products['product_width_cm'].explore.info()

Summary Statistics for "product_width_cm" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	32.95k (99%)	Max	118	Mean	23.20	11	3.72k (11%)
Missing	2 (<1%)	99%	63	Trimmed Mean (10%)	21.39	20	3.05k (9%)
Distinct	95 (<1%)	95%	47	Mode	11	16	2.81k (9%)
Non-Duplicate	9 (<1%)	75%	30	Range	112	15	2.39k (7%)
Duplicates	32.85k (99%)	50%	20	IQR	15	30	1.79k (5%)
Dup. Values	86 (<1%)	25%	15	Std	12.08	12	1.54k (5%)
Zeros	---	5%	11	MAD	8.90	25	1.33k (4%)
Negative	---	1%	11	Kurt	4.07	14	1.26k (4%)
Memory Usage	<1 Mb	Min	6	Skew	1.67	13	1.13k (3%)

../../_images/c08d13d56c9602590ce30e6433b89b1784c96dd7eb3ce21c77a18681fed127c6.jpg

Key Observations:

In product_width_cm, there are 2 missing values.
The maximum product width is 118 cm. The minimum is 6 cm. The median is 20 cm.

Exploring Missing Values#

Let’s see which columns have missing values.

df_products.explore.anomalies_report(
    anomaly_type='missing'
    , pct_diff_threshold=10
)

Missings by Column
	Count	Percent
product_category_name	610	1.9%
product_name_lenght	610	1.9%
product_description_lenght	610	1.9%
product_photos_qty	610	1.9%
product_weight_g	2	0.0%
product_length_cm	2	0.0%
product_height_cm	2	0.0%
product_width_cm	2	0.0%

Co-occurring Missings (Pairwise)
	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm
product_category_name
product_name_lenght	< 100.0% / ^ 100.0%
product_description_lenght	< 100.0% / ^ 100.0%	< 100.0% / ^ 100.0%
product_photos_qty	< 100.0% / ^ 100.0%	< 100.0% / ^ 100.0%	< 100.0% / ^ 100.0%
product_weight_g	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%
product_length_cm	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 100.0% / ^ 100.0%
product_height_cm	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 100.0% / ^ 100.0%	< 100.0% / ^ 100.0%
product_width_cm	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 50.0% / ^ under 1%	< 100.0% / ^ 100.0%	< 100.0% / ^ 100.0%	< 100.0% / ^ 100.0%

Missings distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
product_category_name	nan	610	610	100.0%	1.9%	99.8%	98.0%

Sample Missings Rows
	product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm	product_width_cm
23260	96aca2f53bcaed6f466449f7fb18ae75	nan	nan	nan	nan	400.0	16.0	16.0	16.0
428	629beb8e7317703dcc5f35b5463fd20e	nan	nan	nan	nan	1400.0	25.0	25.0	25.0
14367	f18333439cbbfd0ae30acf09ae09c0f0	nan	nan	nan	nan	23300.0	56.0	48.0	40.0
16816	f55cd6a68f96c3041176998ce1850ff3	nan	nan	nan	nan	200.0	20.0	5.0	20.0
18850	1e9e02777a02de03c17eb432be7daf71	nan	nan	nan	nan	200.0	16.0	5.0	15.0

../../_images/a9e5bd5eb55c45113b896501a0c972f3053d63135dca01bae53c48b7d50edb36.jpg

Let’s hypothesize that missing values in the following columns are in the same rows:

product category name
product name length
product description length
number of product photos

df_products.explore.detect_simultaneous_anomalies(['product_category_name', 'product_name_lenght', 'product_description_lenght', 'product_photos_qty'])

Simultaneous Missing Anomalies Analysis
Column	Individual Anomalies	% of Total	% in Simultaneous
product_category_name	610	1.85%	100.00%
product_name_lenght	610	1.85%	100.00%
product_description_lenght	610	1.85%	100.00%
product_photos_qty	610	1.85%	100.00%

Key Observations:

Missing values in product category name, product name length, product description length, and number of product photos are in the same rows.

Let’s hypothesize that missing values in the following columns are in the same rows:

product length
product width
product height
product weight

df_products.explore.detect_simultaneous_anomalies(['product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm'])

Simultaneous Missing Anomalies Analysis
Column	Individual Anomalies	% of Total	% in Simultaneous
product_weight_g	2	0.01%	100.00%
product_length_cm	2	0.01%	100.00%
product_height_cm	2	0.01%	100.00%
product_width_cm	2	0.01%	100.00%

Key Observations:

Missing values in product length, width, height, and weight are located in the same rows.

Exploring Outliers#

df_products.explore.anomalies_report(
    anomaly_type='outlier'
)

Outliers by Column
	Count	Percent
product_name_lenght	2322	7.0%
product_description_lenght	3212	9.7%
product_photos_qty	849	2.6%
product_weight_g	3279	10.0%
product_length_cm	1694	5.1%
product_height_cm	2956	9.0%
product_width_cm	1616	4.9%

Co-occurring Outliers (Pairwise) (method: quantile, threshold: 0.05)
	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm
product_name_lenght
product_description_lenght	< 8.2% / ^ 11.3%
product_photos_qty	< 13.9% / ^ 5.1%	< 13.1% / ^ 3.5%
product_weight_g	< 7.2% / ^ 10.1%	< 11.8% / ^ 12.1%	< 2.7% / ^ 10.5%
product_length_cm	< 7.5% / ^ 5.5%	< 11.2% / ^ 5.9%	< 3.7% / ^ 7.4%	< 22.0% / ^ 11.3%
product_height_cm	< 7.1% / ^ 9.0%	< 7.6% / ^ 7.0%	< 3.6% / ^ 12.6%	< 30.7% / ^ 27.7%	< 5.9% / ^ 10.3%
product_width_cm	< 8.5% / ^ 5.9%	< 13.7% / ^ 6.9%	< 2.7% / ^ 5.1%	< 37.6% / ^ 18.5%	< 30.3% / ^ 28.9%	< 18.3% / ^ 10.0%

Outliers distribution across categories (method: quantile, threshold: 0.05)
Column	Category	% Diff

Sample Outliers Rows
	product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm	product_width_cm
18231	b29c9ac1d3d4e8866f6c4ad814ae9591	beleza_saude	18.0	1177.0	1.0	500.0	31.0	11.0	23.0
15521	c413fa08b405af6edb9d49e85e241c08	telefonia	37.0	326.0	2.0	100.0	18.0	4.0	12.0
14354	975c79322e3ae1a967d8f5b82608ffd7	livros_importados	52.0	502.0	1.0	500.0	30.0	2.0	20.0
15381	e2e9a22be498ad23b97170cbc20c477c	automotivo	39.0	360.0	1.0	14350.0	63.0	63.0	21.0
5751	fe31b2791161c24e921f90788d443e1c	consoles_games	60.0	2462.0	5.0	1100.0	22.0	13.0	18.0

../../_images/9e84e46cd9d217eb279a25c2d89aedb3bed423c30ee06a465c9b7534b87c59e9.jpg

Key Observations:

The proportion of outliers in the number of photos, length, and width of products is within normal limits.
The proportion of outliers in product weight and height exceeds the norm (usually 5%), but is not critical.

Exploring Other Anomalies#

Examining zero values

df_products.explore.anomalies_report(
    anomaly_type='zero'
)

Zeros by Column
	Count	Percent
product_weight_g	4	0.0%

Not enough columns with zero to analyze combinations

Zeros distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
product_category_name	cama_mesa_banho	3029	4	0.1%	9.2%	100.0%	90.8%

Sample Zeros Rows
	product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_length_cm	product_height_cm	product_width_cm
13683	8038040ee2a71048d4bdbbdc985b69ab	cama_mesa_banho	48.0	528.0	1.0	30.0	25.0	30.0
32079	e673e90efa65a5409ff4196c038bb5af	cama_mesa_banho	53.0	528.0	1.0	30.0	25.0	30.0
9769	81781c0fed9fe1ad6e8c81fca1e1cb08	cama_mesa_banho	51.0	529.0	1.0	30.0	25.0	30.0
14997	36ba42dd187055e1fbe943b2d11430ca	cama_mesa_banho	53.0	528.0	1.0	30.0	25.0	30.0

Not enough columns with anomalies for correlation analysis

Key Observations:

All 4 products with zero weight belong to the category cama_mesa_banho (home textiles).

Table df_categories#

Let’s look at the information about the dataframe.

df_categories.explore.info()

Dataframe Overview
Summary		Column Types
Rows	71	Text	0
Features	2	Categorical	2
Missing cells	---	Int	0
Exact Duplicates	---	Float	0
Fuzzy Duplicates	---	Datetime	0
Memory Usage (Mb)	<1 Mb

Initial Column Analysis#

We will examine each column individually.

product_category_name

df_categories['product_category_name'].explore.info(plot=False)

Summary Statistics for "product_category_name" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	71 (100%)	Avg Word Count	1.0	agro_industria_e_comercio	1 (1%)
Missing Values	---	Max Length (chars)	46.0	instrumentos_musicais	1 (1%)
Empty Strings	---	Avg Length (chars)	17.0	market_place	1 (1%)
Distinct Values	71 (100%)	Median Length (chars)	16.0	malas_acessorios	1 (1%)
Non-Duplicates	71 (100%)	Min Length (chars)	3.0	livros_tecnicos	1 (1%)
Exact Duplicates	---	Most Common Length	15 (9.9%)	livros_interesse_geral	1 (1%)
Fuzzy Duplicates	---	Avg Digit Ratio	0.00	livros_importados	1 (1%)
Values with Duplicates	---			la_cuisine	1 (1%)
Memory Usage	<1 Mb			informatica_acessorios	1 (1%)

Key Observations:

The product_category_name table has 71 unique product categories, while the products table has 73 categories.

product_category_name_english

df_categories['product_category_name_english'].explore.info(plot=False)

Summary Statistics for "product_category_name_english" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	71 (100%)	Avg Word Count	2.3	Agro Industry And Commerce	1 (1%)
Missing Values	---	Max Length (chars)	39.0	Home Appliances 2	1 (1%)
Empty Strings	---	Avg Length (chars)	16.1	Kitchen Dining Laundry Garden Furniture	1 (1%)
Distinct Values	71 (100%)	Median Length (chars)	15.0	Industry Commerce And Business	1 (1%)
Non-Duplicates	71 (100%)	Min Length (chars)	3.0	Housewares	1 (1%)
Exact Duplicates	---	Most Common Length	10 (8.5%)	Home Construction	1 (1%)
Fuzzy Duplicates	---	Avg Digit Ratio	0.00	Home Confort	1 (1%)
Values with Duplicates	---			Home Comfort 2	1 (1%)
Memory Usage	<1 Mb			Home Appliances	1 (1%)

Table df_sellers#

Let’s look at the information about the dataframe.

df_sellers.explore.info()

Dataframe Overview
Summary		Column Types
Rows	3.10k	Text	1
Features	4	Categorical	2
Missing cells	---	Int	1
Exact Duplicates	---	Float	0
Fuzzy Duplicates	---	Datetime	0
Memory Usage (Mb)	<1 Mb

Initial Column Analysis#

We will examine each column individually.

seller_id

df_sellers['seller_id'].explore.info(plot=False)

Summary Statistics for "seller_id" (Type: Text)
Summary		Text Metrics
Total Values	3.10k (100%)	Avg Word Count	1.0
Missing Values	---	Max Length (chars)	32.0
Empty Strings	---	Avg Length (chars)	32.0
Distinct Values	3.10k (100%)	Median Length (chars)	32.0
Non-Duplicates	3.10k (100%)	Min Length (chars)	32.0
Exact Duplicates	---	Most Common Length	32 (100.0%)
Fuzzy Duplicates	---	Avg Digit Ratio	0.62
Values with Duplicates	---
Memory Usage	<1 Mb

seller_zip_code_prefix

df_sellers['seller_zip_code_prefix'].explore.info(plot=False)

Summary Statistics for "seller_zip_code_prefix" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	3.10k (100%)	Max	99.73k	Mean	32.29k	14940	49 (2%)
Missing	---	99%	95.90k	Trimmed Mean (10%)	28.67k	13660	10 (<1%)
Distinct	2.25k (73%)	95%	89.29k	Mode	14940	13920	9 (<1%)
Non-Duplicate	1.71k (55%)	75%	64.55k	Range	98729	16200	9 (<1%)
Duplicates	849 (27%)	50%	14.94k	IQR	57.46k	87050	8 (<1%)
Dup. Values	537 (17%)	25%	7.09k	Std	32.71k	14020	8 (<1%)
Zeros	---	5%	2.46k	MAD	16.18k	1026	8 (<1%)
Negative	---	1%	1.14k	Kurt	-0.86	37540	7 (<1%)
Memory Usage	<1 Mb	Min	1.00k	Skew	0.92	13481	7 (<1%)

seller_city

df_sellers['seller_city'].explore.info()

Summary Statistics for "seller_city" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	3.10k (100%)	Avg Word Count	1.7	Sao Paulo	695 (22%)
Missing Values	---	Max Length (chars)	38.0	Curitiba	127 (4%)
Empty Strings	---	Avg Length (chars)	10.2	Rio De Janeiro	96 (3%)
Distinct Values	605 (20%)	Median Length (chars)	9.0	Belo Horizonte	68 (2%)
Non-Duplicates	337 (11%)	Min Length (chars)	2.0	Ribeirao Preto	52 (2%)
Exact Duplicates	2.49k (80%)	Most Common Length	9 (29.4%)	Guarulhos	50 (2%)
Fuzzy Duplicates	2.49k (80%)	Avg Digit Ratio	0.00	Ibitinga	49 (2%)
Values with Duplicates	268 (9%)			Santo Andre	45 (1%)
Memory Usage	<1 Mb			Campinas	41 (1%)

../../_images/d3b63e0eeede83f0fc5e679e4d380a5fdabaf128c9952ef9fdd0a67162cffb33.jpg

Key Observations:

The most sellers are from the city of sao paulo (22%).

seller_state

df_sellers['seller_state'].explore.info()

Summary Statistics for "seller_state" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	3.10k (100%)	Avg Word Count	1.0	SP	1.85k (60%)
Missing Values	---	Max Length (chars)	2.0	PR	349 (11%)
Empty Strings	---	Avg Length (chars)	2.0	MG	244 (8%)
Distinct Values	23 (<1%)	Median Length (chars)	2.0	SC	190 (6%)
Non-Duplicates	5 (<1%)	Min Length (chars)	2.0	RJ	171 (6%)
Exact Duplicates	3.07k (99%)	Most Common Length	2 (100.0%)	RS	129 (4%)
Fuzzy Duplicates	3.07k (99%)	Avg Digit Ratio	0.00	GO	40 (1%)
Values with Duplicates	18 (<1%)			DF	30 (<1%)
Memory Usage	<1 Mb			ES	23 (<1%)

../../_images/edd884fb00cda8f0f2ab854b5e79971312c8839881dea7132da291cd5787ca55.jpg

Key Observations:

The most sellers are from the state of sp (60%).

Table df_geolocations#

Let’s look at the information about the dataframe.

df_geolocations.explore.info()

Dataframe Overview
Summary		Column Types
Rows	1.00m	Text	0
Features	5	Categorical	2
Missing cells	---	Int	1
Exact Duplicates	279.76k (28%)	Float	2
Fuzzy Duplicates	---	Datetime	0
Memory Usage (Mb)	26

Key Observations:

The df_geolocations table has 28% fully duplicated rows.

Initial Column Analysis#

We will examine each column individually.

geolocation_zip_code_prefix

df_geolocations['geolocation_zip_code_prefix'].explore.info(plot=False)

Summary Statistics for "geolocation_zip_code_prefix" (Type: Integer)
Summary		Percentiles		Detailed Stats		Value Counts
Total	1.00m (100%)	Max	99.99k	Mean	36.57k	24220	1.15k (<1%)
Missing	---	99%	98.01k	Trimmed Mean (10%)	33.75k	24230	1.10k (<1%)
Distinct	19.02k (2%)	95%	91.75k	Mode	24220	38400	965 (<1%)
Non-Duplicate	1.04k (<1%)	75%	63.50k	Range	98989	35500	907 (<1%)
Duplicates	981.15k (98%)	50%	26.53k	IQR	52.43k	11680	879 (<1%)
Dup. Values	17.97k (2%)	25%	11.07k	Std	30.55k	22631	832 (<1%)
Zeros	---	5%	3.22k	MAD	28.07k	30140	810 (<1%)
Negative	---	1%	1.32k	Kurt	-0.94	11740	788 (<1%)
Memory Usage	8	Min	1.00k	Skew	0.69	38408	773 (<1%)

geolocation_lat

df_geolocations['geolocation_lat'].explore.info(plot=False)

Summary Statistics for "geolocation_lat" (Type: Float)
Summary		Percentiles		Detailed Stats		Value Counts
Total	1.00m (100%)	Max	45.07	Mean	-21.18	-27.10	314 (<1%)
Missing	---	99%	-2.50	Trimmed Mean (10%)	-21.98	-23.50	190 (<1%)
Distinct	717.36k (72%)	95%	-7.67	Mode	-27.10	-23.51	141 (<1%)
Non-Duplicate	584.67k (58%)	75%	-19.98	Range	81.67	-23.49	127 (<1%)
Duplicates	282.80k (28%)	50%	-22.92	IQR	3.62	-23.01	102 (<1%)
Dup. Values	132.69k (13%)	25%	-23.60	Std	5.72	-23.00	89 (<1%)
Zeros	---	5%	-28.98	MAD	1.65	-22.97	89 (<1%)
Negative	998.83k (99%)	1%	-30.10	Kurt	2.85	-15.84	85 (<1%)
Memory Usage	8	Min	-36.61	Skew	1.57	-23.54	83 (<1%)

geolocation_lng

df_geolocations['geolocation_lng'].explore.info(plot=False)

Summary Statistics for "geolocation_lng" (Type: Float)
Summary		Percentiles		Detailed Stats		Value Counts
Total	1.00m (100%)	Max	121.11	Mean	-46.39	-48.63	314 (<1%)
Missing	---	99%	-34.94	Trimmed Mean (10%)	-46.44	-46.87	190 (<1%)
Distinct	717.61k (72%)	95%	-38.50	Mode	-48.63	-46.72	141 (<1%)
Non-Duplicate	585.03k (58%)	75%	-43.77	Range	222.57	-46.87	127 (<1%)
Duplicates	282.55k (28%)	50%	-46.64	IQR	4.81	-43.38	102 (<1%)
Dup. Values	132.59k (13%)	25%	-48.57	Std	4.27	-46.55	91 (<1%)
Zeros	---	5%	-53.22	MAD	3.84	-43.39	89 (<1%)
Negative	1.00m (99%)	1%	-57.65	Kurt	4.73	-43.32	89 (<1%)
Memory Usage	8	Min	-101.47	Skew	-0.10	-48.02	85 (<1%)

geolocation_city

df_geolocations['geolocation_city'].explore.info()

Summary Statistics for "geolocation_city" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	1.00m (100%)	Avg Word Count	1.8	Sao Paulo	160.72k (16%)
Missing Values	---	Max Length (chars)	36.0	Rio De Janeiro	62.15k (6%)
Empty Strings	---	Avg Length (chars)	10.5	Belo Horizonte	27.80k (3%)
Distinct Values	5.93k (<1%)	Median Length (chars)	9.0	Curitiba	16.59k (2%)
Non-Duplicates	420 (<1%)	Min Length (chars)	2.0	Porto Alegre	13.52k (1%)
Exact Duplicates	994.23k (99%)	Most Common Length	9 (22.9%)	Brasilia	12.65k (1%)
Fuzzy Duplicates	994.23k (99%)	Avg Digit Ratio	0.00	Salvador	11.87k (1%)
Values with Duplicates	5.51k (<1%)			Guarulhos	11.34k (1%)
Memory Usage	2			Sao Bernardo Do Campo	9.63k (<1%)

../../_images/8de7b7b4b7047a83649cf785bfad9747754c8d9e4ff5306d913246a37abe84fd.jpg

Key Observations:

In geolocation_city, the most entries are for the city of sao paulo (16%).

geolocation_state

df_geolocations['geolocation_state'].explore.info()

Summary Statistics for "geolocation_state" (Type: Categorical)
Summary		Text Metrics		Value Counts
Total Values	1.00m (100%)	Avg Word Count	1.0	SP	404.27k (40%)
Missing Values	---	Max Length (chars)	2.0	MG	126.34k (13%)
Empty Strings	---	Avg Length (chars)	2.0	RJ	121.17k (12%)
Distinct Values	27 (<1%)	Median Length (chars)	2.0	RS	61.85k (6%)
Non-Duplicates	---	Min Length (chars)	2.0	PR	57.86k (6%)
Exact Duplicates	1.00m (99%)	Most Common Length	2 (100.0%)	SC	38.33k (4%)
Fuzzy Duplicates	1.00m (99%)	Avg Digit Ratio	0.00	BA	36.05k (4%)
Values with Duplicates	27 (<1%)			GO	20.14k (2%)
Memory Usage	1			ES	16.75k (2%)

../../_images/ecebe05e0ee6c581e571202cb143a628815eff8d038f1daa9f4179c543b46e6e.jpg

Key Observations:

In geolocation_state, the most entries are for the state of SP (40%).

Exploring Duplicates#

We have complete row duplicates in the geolocation table. Let’s examine them.

Let’s check if we have duplicates in the geolocation table in the geolocation_zip_code_prefix field, excluding common duplicates.

tmp_geo = df_geolocations.drop_duplicates()

tmp_geo.explore.detect_anomalies('duplicate', columns=['geolocation_zip_code_prefix'])

Duplicates by Column
	Count	Percent
geolocation_zip_code_prefix	701390	97.36%

Key Observations:

In the df_geolocations table, there are 97% duplicates in the geolocation_zip_code_prefix column.

Let’s see why there are duplicates.

tmp_geo.groupby('geolocation_zip_code_prefix').nunique().head(10)

	geolocation_lat	geolocation_lng	geolocation_city	geolocation_state
geolocation_zip_code_prefix
1001	10	10	1	1
1002	6	6	1	1
1003	10	10	1	1
1004	14	14	1	1
1005	11	11	1	1
1006	7	7	1	1
1007	15	15	1	1
1008	12	12	1	1
1009	7	7	1	1
1010	5	5	1	1

This makes sense, as geolocation_zip_code_prefix can have many different unique coordinates.

But we need to take this into account when joining tables, since we only have zip_code_prefix in the customer and seller tables.

When joining, we may get many duplicates.

We can average the coordinates, but we can’t do the same with cities and states.

Let’s check if we have multiple states for a single prefix.

tmp_geo.groupby('geolocation_zip_code_prefix').geolocation_state.nunique().sort_values(ascending=False).head()

geolocation_zip_code_prefix
  2
  2
  2
   2
   2
Name: geolocation_state, dtype: int64

Let’s see the maximum number of cities with the same prefix.

tmp_geo.groupby('geolocation_zip_code_prefix').geolocation_city.nunique().sort_values(ascending=False).head()

geolocation_zip_code_prefix
  4
  4
  3
  3
   3
Name: geolocation_city, dtype: int64

Key Observations:

In the df_geolocations table, there are prefixes with 2 unique states.
In the df_geolocations table, there are prefixes with 4 unique cities.

There’s nothing we can do about this. We’ll need to account for this when analyzing geolocation coordinates.

Since we have states in the customer and seller tables, we can avoid using city and state from the geolocation table.

And we can average the coordinates.

Exploring Outliers#

Let’s see how many sales we have outside South America.

tmp_geo = df_geolocations.copy()
tmp_geo['in_south_america'] = (
    (tmp_geo.geolocation_lat >= -53.90) &  # Southern border
    (tmp_geo.geolocation_lat <= 12.45) &   # Northern border
    (tmp_geo.geolocation_lng >= -81.32) &  # Western border
    (tmp_geo.geolocation_lng <= -34.79)    # Eastern border
)

temp_df = (df_orders[['order_id', 'customer_id']].merge(df_customers, on='customer_id', how='left')
              .merge(tmp_geo.drop_duplicates(subset=["geolocation_zip_code_prefix"]), left_on='customer_zip_code_prefix', right_on='geolocation_zip_code_prefix', how='left')
              .dropna()
              [['in_south_america', 'customer_city', 'customer_state', 'geolocation_lat', 'geolocation_lng']]
)
temp_df.in_south_america.value_counts()

in_south_america
True     99157
False        6
Name: count, dtype: int64

Key Observations:

There are 6 sales outside South America in the dataset.

Let’s look at these orders.

temp_df[temp_df.in_south_america == False]

	in_south_america	customer_city	customer_state	geolocation_lat	geolocation_lng
728	False	Ilha Dos Valadares	PR	42.18	-8.72
5634	False	Porto Trombetas	PA	41.15	-8.58
12766	False	Areia Branca Dos Assis	PR	39.06	-9.40
74197	False	Areia Branca Dos Assis	PR	39.06	-9.40
89139	False	Areia Branca Dos Assis	PR	39.06	-9.40
96816	False	Porto Trombetas	PA	41.15	-8.58

These coordinates are outside South America. Either it’s an error, or the order was placed outside South America.

(42.18, -8.72) is off the coast of Spain/Portugal
(20.09, -30.54) is in the central Atlantic Ocean
(13.00, -23.58) is in the eastern Atlantic Ocean near Cape Verde
(-11.31, -34.73) is in the South Atlantic
(20.09, -30.54) is a repeating point in the central Atlantic Ocean

del temp_df, tmp_geo

Exploring Cross-Table Anomalies#

Temporal Boundary Checks#

Examining time interval boundaries across different tables.

for key, df in dfs:
    datetime_cols = [col for col in df.columns if pd.api.types.is_datetime64_any_dtype(df[col])]
    for col in datetime_cols:
        min_date = df[col].min()
        max_date = df[col].max()    
        print(f"[{min_date.date()} - {max_date.date()}] DataFrame '{key}', Column '{col}':")

[2016-09-04 - 2018-10-17] DataFrame 'orders', Column 'order_purchase_dt':
[2016-09-15 - 2018-09-03] DataFrame 'orders', Column 'order_approved_dt':
[2016-10-08 - 2018-09-11] DataFrame 'orders', Column 'order_delivered_carrier_dt':
[2016-10-11 - 2018-10-17] DataFrame 'orders', Column 'order_delivered_customer_dt':
[2016-09-30 - 2018-11-12] DataFrame 'orders', Column 'order_estimated_delivery_dt':
[2016-09-19 - 2020-04-09] DataFrame 'items', Column 'shipping_limit_dt':
[2016-10-02 - 2018-08-31] DataFrame 'reviews', Column 'review_creation_dt':
[2016-10-07 - 2018-10-29] DataFrame 'reviews', Column 'review_answer_dt':

Key Observations:

The latest date in order_approved_dt is earlier than in order_purchase_dt.-
Anomalously large maximum date in shipping_limit_dt compared to other temporal variables.

Payment-Order Amount Mismatches#

Checking for orders where payment total differs from order value.

temp_df = (
    df_items[['order_id', 'price', 'freight_value']]
    .groupby('order_id')
    .sum()
    .assign(total_price=lambda x: x['price'] + x['freight_value'])
    .drop(columns=['price', 'freight_value'])
    .reset_index()
    .merge(df_payments, on='order_id', how='inner')
    .merge(df_orders, on='order_id', how='inner')
    .dropna(subset=['payment_value', 'total_price'])
)
temp_df['payment_matches_total'] = temp_df['payment_value'].round(2) == temp_df['total_price'].round(2)
temp_df['payment_matches_total'].value_counts()

payment_matches_total
True     95179
False     7877
Name: count, dtype: int64

Key Observations:

Dataset contains 7,877 orders with payment-amount discrepancies.

tmp_mask = ~temp_df.payment_matches_total

Let’s analyze by payment type.

temp_df.explore.anomalies_by_categories(
    custom_mask=tmp_mask
    , pct_diff_threshold=-100
    , include_columns='tmp_payment_types'
)

Customs distribution across categories
Column	Category	Total	Anomaly	Anomaly Rate	Total %	Anomaly %	% Diff
tmp_payment_types	Voucher, Credit Card	2907	2907	100.0%	2.8%	36.9%	34.1%
tmp_payment_types	Credit Card, Voucher	2424	2424	100.0%	2.4%	30.8%	28.4%
tmp_payment_types	Voucher	2540	1425	56.1%	2.5%	18.1%	15.6%
tmp_payment_types	Credit Card, Debit Card	2	2	100.0%	0.0%	0.0%	0.0%
tmp_payment_types	Debit Card	1521	15	1.0%	1.5%	0.2%	-1.3%
tmp_payment_types	Boleto	19614	69	0.4%	19.0%	0.9%	-18.2%
tmp_payment_types	Credit Card	74048	1035	1.4%	71.9%	13.1%	-58.7%

Key Observations:

Payment type analysis shows most mismatches involve voucher payments (likely systemic issue).

Exploring Relationships Between Tables#

Reviewing inter-table connections for future joins and key consistency.

df_orders and df_payments

fron.analyze_join_keys(df_orders, df_payments, "order_id", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: one-to-many (1:N)
- Left total duplicates (keep=False): 0 (0% of rows)
- Right total duplicates (keep=False): 7,407 (7.1% of rows)

Coverage:
- Left-only keys: 1
- Right-only keys: 0

Counts:
- Left keys: 99,441 unique
- Right keys: 99,440 unique
- Common keys: 99,440

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 99,441 rows
- Right table size: 103,886 rows
- Inner join size: 103,886 rows
- Left join size: 103,887 rows
- Right join size: 103,886 rows
- Outer join size: 103,887 rows

Key Observations:

Orders table contains 1 order_id missing from payments table.

Let’s look at what this order is.

temp_df = df_orders.merge(df_payments, on='order_id', how='left')
temp_df[temp_df.payment_value.isna()]

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered	payment_sequential	payment_type	payment_installments	payment_value
32133	bfbd0f9bdef84302105ad712db648a6c	86dc2ffce2dfff336de2f386a786e574	Delivered	2016-09-15 12:16:38	2016-09-15 12:16:38	2016-11-07 17:11:53	2016-11-09 07:47:38	2016-10-04	1	Missing in Pays	beleza_saude	SP	September	Thursday	Afternoon	Delayed	Delivered	NaN	NaN	NaN	NaN

df_orders and df_items

fron.analyze_join_keys(df_orders, df_items, "order_id", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: one-to-many (1:N)
- Left total duplicates (keep=False): 0 (0% of rows)
- Right total duplicates (keep=False): 23,787 (21.1% of rows)

Coverage:
- Left-only keys: 775
- Right-only keys: 0

Counts:
- Left keys: 99,441 unique
- Right keys: 98,666 unique
- Common keys: 98,666

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 99,441 rows
- Right table size: 112,650 rows
- Inner join size: 112,650 rows
- Left join size: 113,425 rows
- Right join size: 112,650 rows
- Outer join size: 113,425 rows

Key Observations:

Orders table contains 775 order_ids missing from items table.

We have a payments table. Let’s check if orders missing from items exist in payments.

missing_orders = (df_orders.merge(df_items, on='order_id', how='left')
                  [lambda x: x['order_item_id'].isna()].order_id.unique()
)
len(missing_orders)

df_payments[df_payments.order_id.isin(missing_orders)].order_id.nunique()

All these orders are present in the payments table.

Let’s check how many of these orders are canceled.

df_orders[df_orders['order_id'].isin(missing_orders)].order_status.value_counts()

order_status
Unavailable    603
Canceled       164
Created          5
Invoiced         2
Shipped          1
Approved         0
Delivered        0
Processing       0
Name: count, dtype: int64

Key Observations:

These orders are either canceled, unavailable, or just created.

Let’s examine orders with “shipped” status.

df_orders[df_orders['order_id'].isin(missing_orders) & (df_orders.order_status == 'Shipped')]

	order_id	customer_id	order_status	order_purchase_dt	order_approved_dt	order_delivered_carrier_dt	order_delivered_customer_dt	order_estimated_delivery_dt	tmp_avg_reviews_score	tmp_payment_types	tmp_product_categories	tmp_customer_state	tmp_order_purchase_month	tmp_order_purchase_weekday	tmp_purchase_time_of_day	tmp_is_delayed	tmp_is_delivered
23254	a68ce1686d536ca72bd2dadc4b8671e5	d7bed5fac093a4136216072abaf599d5	Shipped	2016-10-05 01:47:40	2016-10-07 03:11:22	2016-11-07 16:37:37	NaT	2016-12-01	1	Boleto	Missing in Items	RS	October	Wednesday	Night	Missing delivery dt	Not Delivered

df_orders and df_customers

fron.analyze_join_keys(df_orders, df_customers, "customer_id", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: one-to-one (1:1)
- Left total duplicates (keep=False): 0 (0% of rows)
- Right total duplicates (keep=False): 0 (0% of rows)

Coverage:
- Left-only keys: 0
- Right-only keys: 0

Counts:
- Left keys: 99,441 unique
- Right keys: 99,441 unique
- Common keys: 99,441

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 99,441 rows
- Right table size: 99,441 rows
- Inner join size: 99,441 rows
- Left join size: 99,441 rows
- Right join size: 99,441 rows
- Outer join size: 99,441 rows

Key Observations:

All is well…

df_orders and df_reviews

fron.analyze_join_keys(df_orders, df_reviews, "order_id", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: one-to-many (1:N)
- Left total duplicates (keep=False): 0 (0% of rows)
- Right total duplicates (keep=False): 1,114 (1.1% of rows)

Coverage:
- Left-only keys: 0
- Right-only keys: 0

Counts:
- Left keys: 99,441 unique
- Right keys: 99,441 unique
- Common keys: 99,441

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 99,441 rows
- Right table size: 100,000 rows
- Inner join size: 100,000 rows
- Left join size: 100,000 rows
- Right join size: 100,000 rows
- Outer join size: 100,000 rows

Key Observations:

All is well…

df_items and df_products

fron.analyze_join_keys(df_items, df_products, "product_id", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: many-to-one (N:1)
- Left total duplicates (keep=False): 94,533 (83.9% of rows)
- Right total duplicates (keep=False): 0 (0% of rows)

Coverage:
- Left-only keys: 0
- Right-only keys: 0

Counts:
- Left keys: 32,951 unique
- Right keys: 32,951 unique
- Common keys: 32,951

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 112,650 rows
- Right table size: 32,951 rows
- Inner join size: 112,650 rows
- Left join size: 112,650 rows
- Right join size: 112,650 rows
- Outer join size: 112,650 rows

Key Observations:

All is well…

df_items and df_sellers

fron.analyze_join_keys(df_items, df_sellers, "seller_id", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: many-to-one (N:1)
- Left total duplicates (keep=False): 112,141 (99.5% of rows)
- Right total duplicates (keep=False): 0 (0% of rows)

Coverage:
- Left-only keys: 0
- Right-only keys: 0

Counts:
- Left keys: 3,095 unique
- Right keys: 3,095 unique
- Common keys: 3,095

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 112,650 rows
- Right table size: 3,095 rows
- Inner join size: 112,650 rows
- Left join size: 112,650 rows
- Right join size: 112,650 rows
- Outer join size: 112,650 rows

Key Observations:

All is well…

df_customers and df_geolocations

fron.analyze_join_keys(df_customers, df_geolocations, left_on = 'customer_zip_code_prefix', right_on = "geolocation_zip_code_prefix", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: many-to-many (N:M)
- Left total duplicates (keep=False): 96,429 (97% of rows)
- Right total duplicates (keep=False): 999,120 (99.9% of rows)

Coverage:
- Left-only keys: 157
- Right-only keys: 4,178

Counts:
- Left keys: 14,994 unique
- Right keys: 19,015 unique
- Common keys: 14,837

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 99,441 rows
- Right table size: 1,000,163 rows
- Inner join size: 15,083,455 rows
- Left join size: 15,083,733 rows
- Right join size: 15,115,319 rows
- Outer join size: 15,115,597 rows

Key Observations:

In df_customers table, there are 157 zip_code_prefixes not present in df_geolocations.-
In df_geolocations table, there are 4178 zip_code_prefixes not present in df_customers.

df_sellers and df_geolocations

fron.analyze_join_keys(df_sellers, df_geolocations, left_on = 'seller_zip_code_prefix', right_on = "geolocation_zip_code_prefix", short_result=False)

============================= Join Analysis Report =============================

Relationship:
- Type: many-to-many (N:M)
- Left total duplicates (keep=False): 1,386 (44.8% of rows)
- Right total duplicates (keep=False): 999,120 (99.9% of rows)

Coverage:
- Left-only keys: 7
- Right-only keys: 16,776

Counts:
- Left keys: 2,246 unique
- Right keys: 19,015 unique
- Common keys: 2,239

Missing Values:
- Left missing: 0 (0%)
- Right missing: 0 (0%)

Join Result Sizes:
- Left table size: 3,095 rows
- Right table size: 1,000,163 rows
- Inner join size: 435,087 rows
- Left join size: 435,094 rows
- Right join size: 1,166,434 rows
- Outer join size: 1,166,441 rows

Key Observations:

In df_sellers table, there are 7 zip_code_prefixes not present in df_geolocations.-
In df_geolocations table, there are 16776 zip_code_prefixes not present in df_sellers.

Delete temporary fields.

df_orders = df_orders[[col for col in df_orders.columns if not col.startswith('tmp_')]]

Clear memory of temporary variables.

for var_name in list(globals().keys()):
    if var_name.startswith('tmp_'):
        del globals()[var_name]

Data Exploration

On this page

Data Exploration#

Table df_orders#

Initial Column Analysis#

Adding Temporary Dimensions#

Exploring Missing Values#

Anomalies in Order Status#

Status and Delivery Mismatches#

Date Inconsistencies#

Table df_payments#

Initial Column Analysis#

Exploring Outliers#

Exploring Other Anomalies#

Table df_items#

Initial Column Analysis#

Exploring Outliers#

Product Sales Inconsistencies#

Date Inconsistencies#

Exploring Other Anomalies#

Table df_customers#

Initial Column Analysis#

Table df_reviews#

Initial Column Analysis#

Exploring Missing Values#

Exploring Duplicates#

Table df_products#

Initial Column Analysis#

Exploring Missing Values#

Exploring Outliers#

Exploring Other Anomalies#

Table df_categories#

Initial Column Analysis#

Table df_sellers#

Initial Column Analysis#

Table df_geolocations#

Initial Column Analysis#

Exploring Duplicates#

Exploring Outliers#

Exploring Cross-Table Anomalies#

Temporal Boundary Checks#

Payment-Order Amount Mismatches#

Exploring Relationships Between Tables#