Geographical Analysis#

Geo-analysis by ZIP Codes#

Data Preparation#

We will aggregate data by the first 3 digits of the customer’s ZIP code. Each point on the map will represent a unique truncated ZIP code.

Without truncation, we would get excessive detail and no noticeable differences.

We will prepare the data for visualization.

We have 6 orders that were made outside South America.

After truncating the prefixes, we may get additional coordinates outside South America.

We will remove them to avoid interfering with the map analysis.

tmp_df_res = df_sales.copy()
tmp_df_res['is_delayed'] = tmp_df_res['is_delayed'] == 'Delayed'
tmp_df_res['order_has_installment'] = tmp_df_res['order_has_installment'] == 'Has Installments'
tmp_df_res = (
    tmp_df_res.merge(df_customers_origin[['customer_id', 'customer_zip_code_prefix_3_digits']], on='customer_id')
    .groupby('customer_zip_code_prefix_3_digits', as_index=False)
    .agg(
        total_orders=('order_id', 'nunique')
        , orders_delayed_share = ('is_delayed', 'mean')
        , total_payment = ('total_payment', 'sum')
        , aov = ('total_payment', 'mean')
        , avg_installments = ('total_installments_cnt', 'mean')
        , first_orders_cnt = ('sale_is_customer_first_purchase', 'sum')
        , installment_orders_cnt = ('order_has_installment', 'sum')
        , total_reviews = ('reviews_cnt', 'sum')
        , avg_review_score = ('order_avg_reviews_score', 'mean')
        , avg_delivery_delay_days = ('delivery_delay_days', 'mean')
        , avg_delivery_time_days = ('delivery_time_days', 'mean')
        , avg_products_cnt = ('products_cnt', 'mean')
        , avg_freight_ratio = ('freight_ratio', 'mean')
        , avg_order_weight_kg = ('total_weight_kg', 'mean')
        , avg_order_volume_cm3 = ('total_volume_cm3', 'mean')
    ) 
)
tmp_df_res = (
    df_orders.assign(
        is_canceled = lambda x: x.order_status=='Canceled'
    )
    .merge(df_customers_origin, on='customer_unique_id')
    .groupby('customer_zip_code_prefix_3_digits', as_index=False)
    .agg(
        cancel_rate = ('is_canceled', 'mean')
    )
    .merge(tmp_df_res, on='customer_zip_code_prefix_3_digits', how='right')
    .merge(df_geolocations[lambda x:x.in_south_america], left_on = 'customer_zip_code_prefix_3_digits', right_on='geolocation_zip_code_prefix_3_digits')
)
tmp_df_res['repeat_purchase_rate'] = (tmp_df_res['total_orders'] - tmp_df_res['first_orders_cnt']) / tmp_df_res['total_orders']
tmp_df_res['installment_orders_rate'] = tmp_df_res['installment_orders_cnt'] / tmp_df_res['total_orders']

We will calculate the average MAU by the truncated ZIP code.

temp_mau = (
    df_sales.merge(df_customers_origin[['customer_id', 'customer_zip_code_prefix_3_digits']], on='customer_id')
    .groupby(['customer_zip_code_prefix_3_digits', pd.Grouper(key='order_purchase_dt', freq='ME')], observed=False)
    .agg(mau = ('customer_unique_id', 'nunique'))
    .groupby('customer_zip_code_prefix_3_digits', observed=False)
    .mean()
)
tmp_df_res = tmp_df_res.merge(temp_mau, on='customer_zip_code_prefix_3_digits', how='left')
del temp_mau

Data Visualization#

We will create labels for displaying on the maps.

labels_for_map = dict(
    total_orders = 'Number of Sales'
    , total_payment = 'Sales Amount, R$'
    , aov = 'Average Order Value, R$'
    , mau = 'MAU'
    , avg_freight_ratio = 'Average Freight Ratio'
    , avg_delivery_time_days = 'Average Delivery Time, days'
    , avg_review_score = 'Average Review Score'
    , orders_delayed_share = 'Percentage of Delayed Orders'
    , avg_products_cnt = 'Average Number of Products in Order'
    , installment_orders_rate = 'Installment Payment Rate'
    , avg_installments = 'Average Number of Installments in Order'
    , avg_order_weight_kg = 'Average Order Weight, kg'
    , avg_order_volume_cm3 = 'Average Order Volume, cm3'
    , repeat_purchase_rate = 'Repeat Purchase Rate'
    , cancel_rate = 'Cancel Rate'
    , geolocation_lat = 'Latitude'
    , geolocation_lng = 'Longitude'
    , customer_zip_code_prefix_3_digits = 'Zip Code Prefix'
)

We will create a function for visualization.

def plot_map_zip(metric: str):
    """Create plotly map by 3-digit zip code prefix"""
    title = f"Distribution of {labels_for_map[metric].split(',')[0]} by 3-Digit Zip Code Prefix"
    colorbar_title = labels_for_map[metric].split(',')[1] if ',' in labels_for_map[metric] else None
    hover_data = {'geolocation_lat': False, 'geolocation_lng': False, 'customer_zip_code_prefix_3_digits': True}
    is_percentage = metric in [
        'avg_freight_ratio', 'orders_delayed_share', 
        'installment_orders_rate', 'repeat_purchase_rate', 
        'cancel_rate']
    if metric != 'total_orders':
        hover_data['total_orders'] = ':.2s'
    if not is_percentage:
        hover_data[metric] = ':.2s'
    else:
        hover_data[metric] = ':.1%'
    fig = px.scatter_map(
        tmp_df_res,
        lat='geolocation_lat',
        lon='geolocation_lng',
        color=metric,
        labels=labels_for_map,
        zoom=3,
        height=650,
        hover_data=hover_data,
        width=700,
        title=title,
        color_continuous_scale="matter",
        center={"lat": -14.235004, "lon": -55.92528}
    )
    if is_percentage:
        fig.update_coloraxes(
            colorbar_tickformat=".0%"
        )
    fig.update_layout(
        margin = dict(l=10, r=10, b=10, t=30)
        , title_y=0.99
        , coloraxis_colorbar_title_text = colorbar_title
    )
    pb.to_slide(fig)
    return fig

Where are the sales volume higher?

plot_map_zip('total_orders')
../../_images/ed8250e9c89918aa810b0a1e0e791d787c75138f9bd6a9b34fbdfd6af5cba876.jpg

Where is the sales amount higher?

plot_map_zip('total_payment')
../../_images/34a1b9175c8e72bb2ee23a2315fa4ce88ce5e05bed34c0591c8f78dd6055764e.jpg

Where is the average order value higher?

plot_map_zip('aov')
../../_images/5c241030b7454a6b957f249ee381eb37ed556d374fd66f13ef8d88b57403c11b.jpg

Where is average MAU higher?

plot_map_zip('mau')
../../_images/0a963da4532f9a2185e9ac5e6c2167fad0fd91a3cca24a7b492e808e78f1507c.jpg

Where do customers pay more for delivery?

plot_map_zip('avg_freight_ratio')
../../_images/8823d62dd1f24aafeba29ab3611aec633bee01463fb6782afa2ce98f865843f2.jpg

How is delivery time distributed across regions?

plot_map_zip('avg_delivery_time_days')
../../_images/dffaadac01934417a46111d52bc08e5adf10a99d2d6074321c9183173471ba91.jpg

What is the average rating by regions?

plot_map_zip('avg_review_score')
../../_images/ed7975a9fbe2904f743ed77017d32d8f22c50cfe5fe04f2bdef3c61ed5b3e711.jpg

How are delayed orders distributed across regions?

plot_map_zip('orders_delayed_share')
../../_images/cbc9782c3bf6187f59ceda564d01b0aaa3ddf368af871289f5e74a8a85340c59.jpg

How is the number of items per order distributed across regions?

plot_map_zip('avg_products_cnt')
../../_images/f52b7cce8e059769b372b91d90e8eb8c0bdfd429463bf3b5bbd914b1c67a29e1.jpg

What is the higher proportion of installment payments in which regions?

plot_map_zip('installment_orders_rate')
../../_images/19a8c191699f0824cbfc11955ff1e1908673139fb491bf4c8493ea54382cb55a.jpg

What regions have more installments per order?

plot_map_zip('avg_installments')
../../_images/22149e4a185cbdeacb6d70c4142fbf18be59c71dd2b64aaf9cf0a3645cce283a.jpg

What regions have the heaviest orders?

plot_map_zip('avg_order_weight_kg')
../../_images/1c453d13d7a68255bb30c86119ad1080598f1abc2ccc5881742a0d3875c43ba2.jpg

What regions have the largest volume orders?

plot_map_zip('avg_order_volume_cm3')
../../_images/4577cec52be4b116c1b48c24b393b7988daf430ef52e743e91966c15cdc75418.jpg

What regions have a higher repeat purchases rate?

plot_map_zip('repeat_purchase_rate')
../../_images/929e6d4198d9d0683255aa3e49d0c0666b700ab29b3e25fde7c8707764071549.jpg

What regions have a higher proportion of canceled orders?

plot_map_zip('cancel_rate')
../../_images/70f17f0d501ca05a0a79b36e2b72f3d44dcfb5233452fcc9caee8f1b65333311.jpg

Geo-analysis by State#

Data Preparation#

Creating dataframe for visualization

tmp_df_res = (
    df_sales.merge(df_customers_origin[['customer_id', 'customer_state_short', 'population', ]], on='customer_id', how='left')
)
tmp_df_res['is_delayed'] = tmp_df_res['is_delayed'] == 'Delayed'
tmp_df_res['order_has_installment'] = tmp_df_res['order_has_installment'] == 'Has Installments'

Calculate average MAU by state.

temp_mau = (
    tmp_df_res.groupby(['customer_state_short', pd.Grouper(key='order_purchase_dt', freq='ME')], observed=False)
    .agg(mau = ('customer_unique_id', 'nunique'))
    .groupby('customer_state_short', observed=False)
    .mean()
)
tmp_df_res = (
    tmp_df_res.groupby('customer_state_short', observed=False, as_index=False)
    .agg(
        total_orders=('order_id', 'nunique')
        , total_payment = ('total_payment', 'sum')
        , aov = ('total_payment', 'mean')
        , first_orders_cnt = ('sale_is_customer_first_purchase', 'sum')
        , installment_orders_cnt = ('order_has_installment', 'sum')
        , total_reviews = ('reviews_cnt', 'sum')
        , avg_review_score = ('order_avg_reviews_score', 'mean')
        , avg_delivery_time_days = ('delivery_time_days', 'mean')
        , avg_delivery_delay_days = ('delivery_delay_days', 'mean')
        , avg_installments = ('total_installments_cnt', 'mean')
        , avg_products_cnt = ('products_cnt', 'mean')
        , population = ('population', 'first')
        , avg_order_weight_kg = ('total_weight_kg', 'mean')
        , avg_order_volume_cm3 = ('total_volume_cm3', 'mean')
    )
    .merge(temp_mau, on='customer_state_short', how='left')
)
tmp_df_res = (
    df_orders.assign(
        is_canceled = lambda x: x.order_status=='Canceled'
    )
    .merge(df_customers_origin[['customer_id', 'customer_state_short']], on='customer_id')
    .groupby('customer_state_short', as_index=False)
    .agg(
        cancel_rate = ('is_canceled', 'mean')
    )
    .merge(tmp_df_res, on='customer_state_short', how='right')
)
tmp_df_res['orders_per_thousand_person'] = tmp_df_res['total_orders'] * 1000 / tmp_df_res['population']
tmp_df_res['total_payment_per_thousand_person'] = tmp_df_res['total_payment'] * 1000 / tmp_df_res['population']
tmp_df_res['repeat_purchase_rate'] = (tmp_df_res['total_orders'] - tmp_df_res['first_orders_cnt']) / tmp_df_res['total_orders']
tmp_df_res['installment_orders_rate'] = tmp_df_res['installment_orders_cnt'] / tmp_df_res['total_orders']
del temp_mau

Let’s calculate the median retention of the first lifetime by state through cohorts. We will define the period as 30 days.

retention_1st_month_state = (
    df_sales.merge(df_customers_origin[['customer_id', 'customer_state_short']], on='customer_id', how='left')
)
retention_1st_month_state['cohort'] = retention_1st_month_state['customer_first_purchase_dt'].dt.to_period('M')
retention_1st_month_state['lifetime'] = (retention_1st_month_state['order_purchase_dt'] - retention_1st_month_state['customer_first_purchase_dt']).dt.days // 30
retention_1st_month_state = (
    retention_1st_month_state.groupby(['customer_state_short', 'cohort', 'lifetime'])['customer_unique_id']
    .nunique()
    .unstack()
    .fillna(0)
)
retention_1st_month_state = retention_1st_month_state.div(retention_1st_month_state[0], axis=0)[1].reset_index()
retention_1st_month_state = (
    retention_1st_month_state.groupby('customer_state_short', as_index=False, observed=False)
    .median()
)
retention_1st_month_state['retention_1st_month_state'] = retention_1st_month_state[1]
retention_1st_month_state = retention_1st_month_state[['customer_state_short', 'retention_1st_month_state']]
tmp_df_res = tmp_df_res.merge(retention_1st_month_state, on='customer_state_short', how='left')

Data Visualization#

Add some metrics in labels_for_map

labels_for_map.update(dict(
    total_reviews = 'Number of Reviews'
    , avg_delivery_delay_days = 'Average Delivery Delay, days'
    , orders_per_thousand_person = 'Number of Orders per Thousand Residents'
    , total_payment_per_thousand_person = 'Sales Amount per Thousand Residents, R$'
    , retention_1st_month_state = 'Retention 1st Month by State'
    , customer_state_short = 'Customer State'
))
labels_for_map.update({'mau': 'Average MAU'})

Let’s calculate the centroids of the states.

brazil_states_geojson = "https://raw.githubusercontent.com/codeforamerica/click_that_hood/master/public/data/brazil-states.geojson"
with urlopen(brazil_states_geojson) as response:
    geojson = json.load(response)

gdf = gpd.GeoDataFrame.from_features(geojson['features'])
gdf['centroid'] = gdf['geometry'].centroid
gdf['centroid_lon'] = gdf['centroid'].x
gdf['centroid_lat'] = gdf['centroid'].y

state_centroids = gdf[['sigla', 'centroid_lon', 'centroid_lat']]
state_centroids.head(1)
sigla centroid_lon centroid_lat
0 AC -70.44 -9.31
def plot_map_state(metric: str):
    """Create plotly map by states"""
    title = f"Distribution of {labels_for_map[metric].split(',')[0]} by State"
    colorbar_title = labels_for_map[metric].split(',')[1] if ',' in labels_for_map[metric] else None
    df_for_color_text = tmp_df_res[['customer_state_short', metric]]
    color_level = (tmp_df_res[metric].max() - tmp_df_res[metric].min()) * 0.7 + tmp_df_res[metric].min()
    df_for_color_text['color'] = df_for_color_text[metric].apply(lambda x: 'rgba(255, 255, 255, 0.8)'  if x < color_level else 'rgba(50, 50, 50, 0.8)')

    # print(tmp_df_res)
    hover_data = dict()
    is_percentage = metric in [
        'avg_freight_ratio', 'orders_delayed_share', 
        'installment_orders_rate', 'repeat_purchase_rate', 
        'cancel_rate', 'retention_1st_month_state']
    if metric != 'total_orders':
        hover_data['total_orders'] = ':.2s'
    if not is_percentage:
        hover_data[metric] = ':.2s'
    else:
        if metric == 'retention_1st_month_state':
            hover_data[metric] = ':.2%'
        else:
            hover_data[metric] = ':.1%'

    fig = px.choropleth(
        tmp_df_res,
        geojson=brazil_states_geojson,
        locations='customer_state_short',  
        featureidkey="properties.sigla", 
        color=metric,
        color_continuous_scale="Viridis",
        title=title,
        labels=labels_for_map,
        hover_data=hover_data
    )

    fig.add_trace(
        go.Scattergeo(
            lon=state_centroids['centroid_lon'],
            lat=state_centroids['centroid_lat'],
            text=state_centroids['sigla'],
            hoverinfo='none',
            mode='text',
            textfont=dict(
                color='gray',
                size=10
            ),
            showlegend=False
        )
    )

    color_dict = dict(zip(df_for_color_text['customer_state_short'], df_for_color_text['color']))
    for trace in fig.data:
        if trace.type == 'scattergeo':
            states_in_trace = trace.text
            ordered_colors = [color_dict[state] for state in states_in_trace]
            trace.textfont.color = ordered_colors

    if is_percentage:
        fig.update_coloraxes(
            colorbar_tickformat=".0%" if metric != 'retention_1st_month_state' else ".2%"
        )
        
    fig.update_geos(
        visible=False,
        lataxis_range=[-40.7, 7.3],  
        lonaxis_range=[-85, -34.5],  
        projection_scale=1.2,        
        center=dict(lat=-15, lon=-55)  
    )
    
    fig.update_layout(
        margin={"r":0,"t":50,"l":0,"b":0}
        , width=550
        , height=500
        , coloraxis_colorbar_title_text = colorbar_title
    )    
    pb.to_slide(fig)
    return fig

In which states is the sales volume higher?

plot_map_state('total_orders')
../../_images/bf60cb24df813b952bbaab4883b982aae899d9b4b4cba2a679f2361ef49c1741.jpg

What is the number of orders per resident?

plot_map_state('orders_per_thousand_person')
../../_images/0683bb11aae7bb4127b6b4bd30e2f46ce4ee7848770125a3c92fc5ca0e6e70a4.jpg

In which states is the sales amount higher?

plot_map_state('total_payment')
../../_images/1d22c401adb8e11eee35d146c3116c395275a91dc1e4d93d326fde7fe59ec96f.jpg

What is the sales amount per resident?

plot_map_state('total_payment_per_thousand_person')
../../_images/99f84e015f27c2df82f2aef736e4c9c310c0405ffabb005f429211622fa7552e.jpg

In which states is the average MAU higher?

plot_map_state('mau')
../../_images/b297e35d9bfa7ca48d7abe561b4d1d6c0d9d4152f090ce2992e0013954b5b0f7.jpg

What is the average order value higher in which states?

plot_map_state('aov')
../../_images/d38bd53386baceda4796f984a98a649f164fdeabd99e8e3b1ce44acfc7482b6c.jpg

What is the proportion of repeat purchases by states?

plot_map_state('repeat_purchase_rate')
../../_images/75fb5fb6741d5cdd00d998060c5a64e35a7886349b48b38789633affacd7656e.jpg

In which states is the number of reviews higher?

plot_map_state('total_reviews')
../../_images/84a5ed3ab09b04ef490b8c74e9fa0709f96adfa6a913031a8222f2845a86beca.jpg

What is the review score higher in which states?

plot_map_state('avg_review_score')
../../_images/5d175ea08eb20fd12c8c3597cf18f9f71b06f4fae6634c1754f5b432a1770a0a.jpg

What is the average delivery time higher in which states?

plot_map_state('avg_delivery_time_days')
../../_images/49a82b943bb7a5411a4832f0d3b438482b0f218c8fb6a7be262fc4709d2ee3d2.jpg

What is the average delivery delay higher in which states?

plot_map_state('avg_delivery_delay_days')
../../_images/1c3c8f2927a0a7427f5a0ad4d3bc64f099be5f7539d148f6d43bdf16438eb578.jpg

What is the proportion of canceled orders higher in which states?

plot_map_state('cancel_rate')
../../_images/31625b716f43ce6457e59969c6b0fac0940f30820f58ec66df2f6ba987d95f94.jpg

What is the proportion of orders with installment payments higher in which states?

plot_map_state('installment_orders_rate')
../../_images/b3efd86cf356751be8e7d0a60d77120cefc0104267d5e13378b873f76744b82c.jpg

What is the number of installments higher in which states?

plot_map_state('avg_installments')
../../_images/f8ddb6c17aeac0709e0d37cc68552c6c5f1e6915973cc35aab74db4b8c1ab9b9.jpg

What is the number of items per order higher in which states?

plot_map_state('avg_products_cnt')
../../_images/144fcafc1f322d5a93d952b645a8d1526359c4e85c2be2380b6faf4df572f0f4.jpg

In which states are the heaviest orders?

plot_map_state('avg_order_weight_kg')
../../_images/54b44738f8f8e436d54d93c6913bdc18dd0fc22bde8d189c7392c0771b81716a.jpg

In which states are the largest volume orders?

plot_map_state('avg_order_volume_cm3')
../../_images/c6050b9f1131317d0f27c6f04a3308976b7f78ff6c0b6f34efa75b1e7c53271e.jpg

What is the median retention of the first month higher in which states?

plot_map_state('retention_1st_month_state')
../../_images/3870f51deca9481761ec76929b31964521b1a1393253b3c1cbb2b676f368e2d0.jpg