Duplicate Handling#
We have complete duplicate rows in the geolocation table. We’ll remove them.
df_geolocations.drop_duplicates(inplace=True)
Since the customer and seller tables contain city and state information, while the geolocation table has multiple city/state entries for a single zip prefix, we’ll average coordinates by zip prefix and ignore city/state data.
For states, we’ll calculate average coordinates as they’ll be needed for geo-analysis.
Otherwise, we won’t be able to accurately map customer/seller coordinates by zip prefix.
We’ll verify that each zip prefix maps to only one state.
df_geolocations = df_geolocations.groupby('geolocation_zip_code_prefix')[['geolocation_lat', 'geolocation_lng']].mean().reset_index()