All posts
Engineering8 min read

Extract city and country from a location string: Python, JS, and LLM fallback

How to identify city, region, and country from location strings like Leiston, Greater London, NYC, or San Francisco Bay Area with Python, JavaScript, geocoding, and schema-based AI fallback.

By · Updated

To extract city and country from a location string, first parse obvious address components, then geocode ambiguous places, and finally use a schema-based AI fallback for informal text. A string like "Leiston" needs lookup context to resolve to Leiston, Suffolk, United Kingdom; a string like "Greater London" maps to London, United Kingdom with region confidence rather than an exact city.

Location fields in CRM and HR systems are notoriously messy. The same person enters "NYC," "New York City," "New York, NY," "Manhattan," or "Greater New York Area" depending on the form. Parsing these into structured city/state/country is harder than it looks because the country often is not written in the string.

Location string examples

InputCityRegionCountryConfidence
LeistonLeistonSuffolkUnited Kingdommedium
Greater LondonLondonGreater LondonUnited Kingdommedium
NYCNew York CityNew YorkUnited Stateshigh
San Francisco Bay AreaSan FranciscoCaliforniaUnited Statesmedium
remote, mostly Lisbon-basedLisbonPortugallow

Return confidence with every parsed row. "NYC" is high-confidence in most English business datasets; "Leiston" is medium without user country, timezone, or company context; "remote, mostly Lisbon-based" is low because it mixes work mode and location.

The three-step parser

There's no single library that handles every form of free-text location well. The pipeline that works in production layers three approaches, escalating only when each one fails:

  1. Run the string through libpostal first. It's deterministic and handles structured strings like "São Paulo, Brazil" or "London, UK" cleanly without any network call.
  2. Geocode anything libpostal didn't fully resolve with Nominatim or the Google Geocoding API. This catches informal phrasings like "NYC," "Bay Area," or "Greater London."
  3. Fall back to an LLM with a strict JSON schema for the messiest fields — "remote, mostly Lisbon-based," "GMT+1," or multi-city contractors. The schema forces consistent output and a confidence score per field.

Below: the code for each step, where it shines, and where it falls over.

1. Try libpostal first

libpostal is C with Python bindings (pypostal). It parses a freeform address string into typed components: city, state, country, postcode, etc. Trained on OpenStreetMap so it handles non-English spellings, abbreviations, and informal forms:

from postal.parser import parse_address parse_address("São Paulo, Brazil") # [('são paulo', 'city'), ('brazil', 'country')] parse_address("NYC") # [('nyc', 'city')] <- city only, no country

Strength: deterministic, fast, free. Weakness: doesn't infer ("NYC" → US is something a human knows but libpostal doesn't).

2. Geocode with Nominatim or Google

When the string is too informal for libpostal, geocode it. Nominatim (OpenStreetMap, free, with rate limits) returns structured address parts:

from geopy.geocoders import Nominatim geo = Nominatim(user_agent="my-app") loc = geo.geocode("Greater London", addressdetails=True) print(loc.raw["address"]) # {'city': 'London', 'country': 'United Kingdom', ...}

Strength: handles informal phrasings ("Bay Area", "NYC") because it searches a real index. Weakness: rate-limited (1 req/sec for Nominatim), and you're sending strings to a third party — check your data policy.

3. LLM with a strict schema

For the messiest fields where someone wrote "remote, mostly Lisbon-based" or "travelling — currently Bali," an LLM with a typed schema is the only thing that gets close to consistent output. Schema:

{ city: string | null, state_or_region: string | null, country: string, country_code: string, // ISO 3166-1 alpha-2 raw: string, confidence: "high" | "medium" | "low" }

Pair this with libpostal as a first pass — let libpostal handle the easy 70%, fall back to LLM only for the strings it can't parse confidently. Cuts cost and adds determinism where you can have it.

JavaScript option

In JavaScript, use a geocoding API rather than trying to maintain a world city list in your app. Normalize the API response into your own small schema so you can swap providers later:

type ParsedLocation = { city: string | null; region: string | null; country: string | null; countryCode: string | null; confidence: "high" | "medium" | "low"; }; async function parseLocation(input: string): Promise<ParsedLocation> { const url = new URL("https://nominatim.openstreetmap.org/search"); url.searchParams.set("q", input); url.searchParams.set("format", "jsonv2"); url.searchParams.set("addressdetails", "1"); url.searchParams.set("limit", "1"); const [match] = await fetch(url, { headers: { "User-Agent": "your-app/1.0" }, }).then((r) => r.json()); const a = match?.address ?? {}; return { city: a.city ?? a.town ?? a.village ?? null, region: a.state ?? a.county ?? null, country: a.country ?? null, countryCode: a.country_code?.toUpperCase() ?? null, confidence: match ? "medium" : "low", }; }

Nominatim requires a real User-Agent and conservative rate limiting. For production CRM enrichment, cache results by normalized input string and use a paid geocoder when you need service-level guarantees.

Country from a string

If you only need the country: pycountry (Python) maps every common spelling and ISO code to the canonical name. Combine with the libpostal output's country field for a clean ISO 3166-1 alpha-2 code.

Edge cases

  • Disputed regions (Taiwan, Kosovo, Crimea) — pick a stance and document it; the geocoders disagree.
  • Multi-city contractors ("London / Berlin / Lisbon") — schema needs to support a list, not a single value.
  • Time zones as proxy ("GMT+1, mostly Spain") — extract both the time zone and the inferred country.

Where this comes up most

Recruiting (LinkedIn-style location fields), CRM data hygiene, sales territory assignment, and analytics on customer addresses. The cleanest path in production: libpostal + Nominatim + a small LLM fallback, in that order, with confidence scores per field so you can route low-confidence rows to a human review queue.

Tool
Skip the parsing — extract location, name, and headline straight from a LinkedIn profile
Drop a LinkedIn profile URL or screenshot and get structured fields back — including a normalized city/country pair — without writing the libpostal pipeline yourself.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →