Sentiment Analysis
What you'll learn
- How VADER combines a sentiment lexicon with grammatical rules to score text
- How SentiWordNet extends WordNet with positivity, negativity, and objectivity scores
- The role of negation, intensifiers, capitalization, and punctuation in sentiment
- How to compute compound sentiment scores and interpret them
Introduction
Sentiment analysis asks a deceptively simple question: is this text positive, negative, or neutral? While humans can answer intuitively, automating the task requires choosing between two broad strategies: lexicon-based methods that use pre-scored word lists, and machine-learned classifiers trained on labeled data. This chapter focuses on the lexicon-based approach, which is transparent, interpretable, and surprisingly effective.
We'll study two influential systems: VADER, which layers grammatical heuristics on top of a crowd-sourced lexicon, and SentiWordNet, which assigns sentiment scores to every synset in WordNet. Together they illustrate the key trade-offs in rule-based sentiment analysis: simplicity vs. coverage, and precision vs. word-sense ambiguity.
VADER (Valence Aware Dictionary and sEntiment Reasoner)
Think of it this way: VADER starts with a dictionary where every word gets a human-rated valence score from -4 (extremely negative) to +4 (extremely positive). Then it applies five grammatical rules that humans naturally use when conveying sentiment: negation ("not good"), intensification ("very good"), capitalization ("GOOD"), punctuation emphasis ("good!!!"), and conjunctions ("good but not great"). The final compound score normalizes everything to [-1, +1].
The VADER Rule Pipeline
Before summing, VADER modifies each word's base valence according to these rules:
Worked Example
Sentence: "The food here is not very good!!!"
-
Look up base valence scores:
"good" = +1.9 (only sentiment word found in lexicon) -
Apply intensifier rule: "very" is an intensifier (boost = 0.293):
\(1.9 \times (1 + 0.293) = 2.457\) -
Apply negation rule: "not" appears before "good" (within 3 words):
\(2.457 \times -0.74 = -1.818\) -
Apply punctuation rule: 3 exclamation marks, each adding ±0.292:
Since the adjusted score is negative: \(s = -1.818 - 3 \times 0.292 = -2.694\) -
Compute compound score (\(\alpha = 15\)):
\(\text{compound} = \frac{-2.694}{\sqrt{(-2.694)^2 + 15}} = \frac{-2.694}{\sqrt{7.258 + 15}} = \frac{-2.694}{4.720}\) = -0.571
A compound score of -0.571 indicates moderately negative sentiment. The negation of "good" plus the exclamation emphasis made a would-be positive sentence clearly negative.
Proportion Scores
Beyond the compound score, VADER also reports what proportion of the text is positive, negative, and neutral. These are computed by classifying each word-level adjusted score as positive (> 0), negative (< 0), or neutral (= 0), then normalizing:
SentiWordNet
Think of it this way: WordNet groups words into synsets (sets of synonyms representing a concept). SentiWordNet adds three scores to each synset: positivity, negativity, and objectivity, which always sum to 1. The word "bank" in its financial sense might score (0.0, 0.0, 1.0) — completely objective — while "excellent" scores (0.75, 0.0, 0.25) — mostly positive.
Worked Example
Word: "cold" — has multiple senses in WordNet:
-
Sense 1 (temperature: "cold water"):
pos = 0.0, neg = 0.125, obj = 0.875 → sentiment = 0.0 - 0.125 = -0.125 -
Sense 2 (personality: "a cold person"):
pos = 0.0, neg = 0.5, obj = 0.5 → sentiment = 0.0 - 0.5 = -0.5 -
Sense 3 (illness: "caught a cold"):
pos = 0.0, neg = 0.25, obj = 0.75 → sentiment = 0.0 - 0.25 = -0.25 -
Average across all senses (simple disambiguation):
\(\text{sentiment}(\text{cold}) = \frac{-0.125 + (-0.5) + (-0.25)}{3}\) = -0.292
Averaging across senses is a blunt approach. Without context, we cannot know which sense of "cold" the speaker intends, so the score is a compromise. This is the core challenge of SentiWordNet: word sense disambiguation.
Interactive: VADER Sentiment Calculator
Enter a sentence below to see VADER-style sentiment analysis in action. Watch how negation, intensifiers, capitalization, and punctuation affect the scores.
Interactive: Sentiment Highlighter
Paste a longer text to see each word color-coded by sentiment: green for positive, red for negative, and gray for neutral. The overall document sentiment is computed as the VADER compound of the full text.
Summary: VADER vs. SentiWordNet
| Aspect | VADER | SentiWordNet |
|---|---|---|
| Approach | Lexicon + grammatical rules | WordNet synset annotation |
| Score Range | Compound: [-1, +1] | Per-synset: [-1, +1] |
| Word Senses | Single valence per word (no WSD needed) | Multiple senses per word (WSD needed) |
| Handles Negation | Yes (rule-based flipping) | No (requires external rules) |
| Handles Intensifiers | Yes (boosts/dampens scores) | No |
| Best For | Social media, reviews, informal text | Formal text, fine-grained analysis |
| Coverage | ~7,500 words + emoji | ~150,000+ synsets |
| Transparency | High (can trace every rule applied) | Medium (scores are semi-automatic) |
Lexicon-based sentiment analysis is interpretable and requires no training data, making it ideal for quick analysis and domains where labeled data is scarce. VADER's strength is its rule system that captures how humans actually modulate sentiment through grammar. SentiWordNet's strength is its massive coverage grounded in a linguistic ontology. In practice, many systems combine both approaches or use them as features in a machine-learned classifier.