All posts
EngineeringMay 3, 20265 min read

How to extract sentiment from text

VADER, TextBlob, fine-tuned transformers, and the LLM-with-typed-schema pattern — four sentiment-extraction methods, what each one is best at, and the few lines of code to start.

By Dawid Sibinski

Sentiment extraction looks like one task and is really four. Lexicon scoring ("how positive does this text sound?"), classification (positive / negative / neutral), aspect-based sentiment ("food great, service slow"), and emotion detection (anger, joy, fear, sadness). Pick the wrong one and you'll either miss what you actually need or build a complicated pipeline for what should be a one-liner.

VADER: lexicon scoring, no setup

Built into NLTK. Tuned for social media — handles emoji, slang, capitalization-as-emphasis, and exclamation points. Returns a compound score from -1 to +1.

from nltk.sentiment.vader import SentimentIntensityAnalyzer sia = SentimentIntensityAnalyzer() sia.polarity_scores("This update is AMAZING!! 🎉") # {'neg': 0.0, 'neu': 0.32, 'pos': 0.68, 'compound': 0.84}

Strength: zero-cost, deterministic, runs offline, fine for tweets and comments. Weakness: weak on long-form text, sarcasm, and any domain (medical, legal, financial) where the lexicon doesn't fit.

TextBlob: simpler API

from textblob import TextBlob TextBlob("This was disappointing.").sentiment # Sentiment(polarity=-0.75, subjectivity=0.75)

Polarity from -1 to +1, subjectivity from 0 to 1. Slightly more conservative than VADER on emphatic text. Good baseline; not a production choice.

Fine-tuned transformers via Hugging Face

The middle ground for production English-language sentiment. Pretrained models like cardiffnlp/twitter-roberta-base-sentiment-latest (Twitter) or distilbert-base-uncased-finetuned-sst-2-english (movie reviews):

from transformers import pipeline clf = pipeline("sentiment-analysis") clf("The new release crashes constantly.") # [{'label': 'NEGATIVE', 'score': 0.998}]

Strength: handles nuance VADER misses. Weakness: domain mismatch is real — a model fine-tuned on movie reviews underperforms on legal documents. Pick a model trained on text resembling yours.

LLM with a typed schema

When you need more than a single label — aspect-based sentiment, emotion dimensions, evidence quotes — an LLM with a typed schema beats every classical method. Schema:

{ overall_sentiment: "positive" | "negative" | "neutral" | "mixed", confidence: number, aspects: { aspect: string, sentiment: string, evidence_quote: string }[], emotions: { emotion: "joy"|"anger"|"sadness"|"fear"|"surprise"|"disgust", intensity: number }[] }

Aspect-based sentiment is where this pattern shines — "food great, service slow, room dated" decomposes into three aspect-sentiment pairs that no classical method touches.

RapidMiner

RapidMiner's sentiment operators wrap a few classical methods (lexicon-based and SVM-based classifiers) behind a no-code GUI. Useful if you're already in the RapidMiner ecosystem; you can do the same in 5 lines of Python with VADER or transformers.

Choosing

  • Tweets, comments, short text → VADER. Done.
  • Movie/product reviews → cardiffnlp or distilbert SST-2 from Hugging Face.
  • Domain-specific (medical, legal, financial) → fine-tune your own model on a labeled set, or use an LLM with a schema.
  • Aspect-based sentiment, emotion dimensions, evidence quotes → LLM with typed schema.

When the text is locked in a document

Customer feedback often arrives as PDFs (survey exports), images (screenshots of reviews), or scattered across emails. The sentiment-analysis libraries above all assume you have clean text strings. ExtractFox handles the document-to-text step — the structured output drops straight into any of the methods above.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →