EngineeringApril 6, 20265 min read

How to extract city, state, and country from a location string

"San Francisco Bay Area," "Greater London," "NYC." Free-text location fields are messy. Here's how to parse them into clean city/state/country with libpostal, geopy, or an LLM.

By Dawid Sibinski

Location fields in CRM and HR systems are notoriously messy. The same person enters "NYC," "New York City," "New York, NY," "Manhattan," or "Greater New York Area" depending on the form. Parsing these into structured city/state/country is harder than it looks because the answer often isn't in the string.

1. Try libpostal first

libpostal is C with Python bindings (pypostal). It parses a freeform address string into typed components: city, state, country, postcode, etc. Trained on OpenStreetMap so it handles non-English spellings, abbreviations, and informal forms:

from postal.parser import parse_address parse_address("São Paulo, Brazil") # [('são paulo', 'city'), ('brazil', 'country')] parse_address("NYC") # [('nyc', 'city')] <- city only, no country

Strength: deterministic, fast, free. Weakness: doesn't infer ("NYC" → US is something a human knows but libpostal doesn't).

2. Geocode with Nominatim or Google

When the string is too informal for libpostal, geocode it. Nominatim (OpenStreetMap, free, with rate limits) returns structured address parts:

from geopy.geocoders import Nominatim geo = Nominatim(user_agent="my-app") loc = geo.geocode("Greater London", addressdetails=True) print(loc.raw["address"]) # {'city': 'London', 'country': 'United Kingdom', ...}

Strength: handles informal phrasings ("Bay Area", "NYC") because it searches a real index. Weakness: rate-limited (1 req/sec for Nominatim), and you're sending strings to a third party — check your data policy.

3. LLM with a strict schema

For the messiest fields where someone wrote "remote, mostly Lisbon-based" or "travelling — currently Bali," an LLM with a typed schema is the only thing that gets close to consistent output. Schema:

{ city: string | null, state_or_region: string | null, country: string, country_code: string, // ISO 3166-1 alpha-2 raw: string, confidence: "high" | "medium" | "low" }

Pair this with libpostal as a first pass — let libpostal handle the easy 70%, fall back to LLM only for the strings it can't parse confidently. Cuts cost and adds determinism where you can have it.

Country from a string

If you only need the country: pycountry (Python) maps every common spelling and ISO code to the canonical name. Combine with the libpostal output's country field for a clean ISO 3166-1 alpha-2 code.

Edge cases

Disputed regions (Taiwan, Kosovo, Crimea) — pick a stance and document it; the geocoders disagree.
Multi-city contractors ("London / Berlin / Lisbon") — schema needs to support a list, not a single value.
Time zones as proxy ("GMT+1, mostly Spain") — extract both the time zone and the inferred country.

Where this comes up most

Recruiting (LinkedIn-style location fields), CRM data hygiene, sales territory assignment, and analytics on customer addresses. The cleanest path in production: libpostal + Nominatim + a small LLM fallback, in that order, with confidence scores per field so you can route low-confidence rows to a human review queue.