How to extract phone numbers from text
A regex that handles most cases, the libphonenumber library that handles the rest, and what to do when the phone numbers are trapped inside PDFs, screenshots, or messaging exports.
"Find all the phone numbers in this blob of text" sounds simple. It is, until you hit international formats, extensions, embedded country codes without the +, and the fact that a 9-digit string of numbers might be a phone number or might be an order ID.
The regex that works for 80% of cases
For US/EU-style numbers in reasonably clean text:
\+?[\d\s\-().]{7,}\d
This catches +44 20 7946 0958, (415) 555-2671, and 020-7946-0958. It also catches things that aren't phone numbers, which is why this is a starting point, not a finished tool.
The right answer: libphonenumber
Google's libphonenumber library knows every numbering plan in every country, including length rules, valid prefixes, and formatting. There are bindings for Python, JavaScript, Java, C#, Ruby, and Go.
import phonenumbers for match in phonenumbers.PhoneNumberMatcher(text, "US"): num = match.number print(phonenumbers.format_number(num, phonenumbers.PhoneNumberFormat.E164))
PhoneNumberMatcher walks the text and returns only validated phone numbers. The country hint ("US") tells it how to interpret numbers without a + prefix. Output format E164 normalizes everything to +CC followed by digits, which is what you want for any database or downstream system.
Whatsapp / messaging exports
Group exports often hide phone numbers behind display names. The contact list is in a separate vCard or CSV that the platform doesn't always export. If all you have is a chat log, libphonenumber on the message bodies will catch numbers people have typed in messages, but won't surface participant numbers that the platform never wrote down in the export.
Phone numbers inside PDFs and images
Business cards, screenshots of contact pages, scanned forms — the text isn't selectable, so regex doesn't help. Two paths: OCR the image first (Tesseract, ocrmypdf), then run libphonenumber on the output; or use ExtractFox's image data extractor with a prompt like "extract every phone number as a column" and skip the OCR step.
Validation and deduplication
Always normalize to E164 before deduping. (415) 555-2671 and +1-415-555-2671 are the same number; +14155552671 is the only form a database can dedupe correctly.