How to bulk-clean LinkedIn 'Location' fields into city/country pairs
Recruiters and sales ops teams inherit LinkedIn exports full of "Greater London," "Bay Area," and "remote." A practical workflow to turn those into a clean city/country/region columns — at any scale.
Anyone who has worked with a LinkedIn export has hit this: thousands of rows where the Location column reads "Greater London," "San Francisco Bay Area," "Remote," "travelling — currently Lisbon," or just an emoji of a flag. Filtering by country breaks. Territory assignment breaks. Reporting breaks. Cleaning these up by hand is unworkable past a few hundred rows; the right approach is a pipeline.
What you actually want at the end
The output schema is small and fixed:
- city (string or null) — the most specific city you can confidently identify.
- region (string or null) — the state, province, or metro area, when a city alone is ambiguous ("Portland" → which Portland?).
- country (string) — the country name in your house style.
- country_code (string) — ISO 3166-1 alpha-2, the canonical join key.
- raw (string) — the original input. Keep this for audit and so you can re-run the pipeline when the parser improves.
- confidence (low | medium | high) — so you can route low-confidence rows to a review queue.
Don't try to add latitude/longitude unless you actually need it. Geocoding adds cost and rate limits with no upside if all you're doing is filtering by country.
The three-tier pipeline
No single library handles every flavor of free-text location well. Layer three approaches and only escalate when each fails:
- Try libpostal first. It's deterministic and free, and it correctly handles the structured 70% of inputs ("London, UK", "São Paulo, Brazil", "NYC, NY, USA") without any network call.
- Geocode the rest with Nominatim or the Google Geocoding API. Catches informal phrasings — "Bay Area", "Greater London", "DMV" — because it searches a real index instead of parsing a string.
- Fall back to an LLM with a strict schema for the messiest cases: "remote, mostly Lisbon-based", "GMT+1", multi-city contractors. The schema forces consistent output and a confidence score per field.
Tier 1: libpostal
libpostal is a C library with Python bindings (pypostal) trained on OpenStreetMap. It returns typed components (city, state, country, postcode):
from postal.parser import parse_address def parse_with_libpostal(s: str): parsed = dict( (label, value) for value, label in parse_address(s) ) return { "city": parsed.get("city"), "region": parsed.get("state"), "country": parsed.get("country"), }
This handles cleanly-formatted strings instantly with no network. Anything where it returns a `None` country is your trigger to go to tier 2.
Tier 2: Nominatim or Google
When libpostal can't resolve the country ("NYC", "Bay Area", "DMV"), geocode the string. Nominatim is free with a 1 req/sec rate limit; Google Geocoding is paid with a much higher limit. For LinkedIn-scale work you'll want Google or self-hosted Nominatim.
from geopy.geocoders import Nominatim geo = Nominatim(user_agent="my-app") def geocode_fallback(s: str): loc = geo.geocode(s, addressdetails=True, language="en") if not loc: return None addr = loc.raw["address"] return { "city": addr.get("city") or addr.get("town") or addr.get("village"), "region": addr.get("state"), "country": addr.get("country"), "country_code": addr.get("country_code", "").upper(), }
Caching is non-optional at scale: "Greater London" appears thousands of times in a typical LinkedIn dataset. Hash the input string and cache the geocode result indefinitely; the answer doesn't change.
Tier 3: LLM with a strict schema
The remaining 5–10% are genuinely ambiguous ("remote", "distributed across EU", "travelling"). For those, an LLM with a typed JSON schema is the only consistent way:
{ "city": null, "region": "European Union", "country": null, "country_code": null, "confidence": "low", "raw": "distributed across EU" }
Crucially, treat "the input is too vague to answer" as a valid output. An LLM that always returns a country will guess wrong on "Earth" or "the cloud". A schema that lets country be null and confidence be `low` makes those rows reviewable rather than poisoned.
Where to do this in practice
You don't have to build the pipeline yourself if the volume isn't worth it. For one-off cleanups of recruiter spreadsheets or sales-territory reassignments, you can drop the whole CSV into a tool that does the parsing in one pass and gives you back a normalized export.
Edge cases worth handling explicitly
- Disputed regions (Taiwan, Crimea, Kosovo, Western Sahara). Pick a stance and document it. Geocoders disagree, and the disagreement matters for compliance and contracts.
- Non-Latin scripts. "東京都" and "Tokyo" should normalize to the same country_code. libpostal handles common transliterations; for the long tail, the LLM tier picks them up.
- Multi-city contractors ("London / Berlin / Lisbon"). Either pick the first as primary and store the rest in a secondary column, or split the row. Document the choice.
- Time zones as a proxy ("GMT+1, mostly Spain"). Parse both the time zone and the inferred country; record confidence as medium.
- ISO 3166-1 alpha-2 vs alpha-3. Pick one and stick with it across your data warehouse — converting every join is more painful than picking once.
When to call it good
On a typical LinkedIn-style dataset, the three tiers in order resolve about 70%, 95%, and 99% of rows. Don't chase the last 1%. Set a confidence threshold, route low-confidence rows to a review queue with the original string visible, and move on. The point of cleaning location data is to make filtering work — not to win an argument with the user who wrote "galaxy far, far away".