Duplicate Handling

Duplicate Handling#

We have complete duplicate rows in the geolocation table. We’ll remove them.

df_geolocations.drop_duplicates(inplace=True)

Since the customer and seller tables contain city and state information, while the geolocation table has multiple city/state entries for a single zip prefix, we’ll average coordinates by zip prefix and ignore city/state data.

For states, we’ll calculate average coordinates as they’ll be needed for geo-analysis.

Otherwise, we won’t be able to accurately map customer/seller coordinates by zip prefix.

We’ll verify that each zip prefix maps to only one state.

df_geolocations = df_geolocations.groupby('geolocation_zip_code_prefix')[['geolocation_lat', 'geolocation_lng']].mean().reset_index()