Chapter 10

Sentiment Analysis

2 formulas + rule system Intermediate

What you'll learn

  • How VADER combines a sentiment lexicon with grammatical rules to score text
  • How SentiWordNet extends WordNet with positivity, negativity, and objectivity scores
  • The role of negation, intensifiers, capitalization, and punctuation in sentiment
  • How to compute compound sentiment scores and interpret them
Prerequisites: Chapter 3 (Term Weighting) helps with understanding how individual word scores aggregate into document-level measures.

Introduction

Sentiment analysis asks a deceptively simple question: is this text positive, negative, or neutral? While humans can answer intuitively, automating the task requires choosing between two broad strategies: lexicon-based methods that use pre-scored word lists, and machine-learned classifiers trained on labeled data. This chapter focuses on the lexicon-based approach, which is transparent, interpretable, and surprisingly effective.

We'll study two influential systems: VADER, which layers grammatical heuristics on top of a crowd-sourced lexicon, and SentiWordNet, which assigns sentiment scores to every synset in WordNet. Together they illustrate the key trade-offs in rule-based sentiment analysis: simplicity vs. coverage, and precision vs. word-sense ambiguity.

VADER (Valence Aware Dictionary and sEntiment Reasoner)

Think of it this way: VADER starts with a dictionary where every word gets a human-rated valence score from -4 (extremely negative) to +4 (extremely positive). Then it applies five grammatical rules that humans naturally use when conveying sentiment: negation ("not good"), intensification ("very good"), capitalization ("GOOD"), punctuation emphasis ("good!!!"), and conjunctions ("good but not great"). The final compound score normalizes everything to [-1, +1].

VADER Compound Score
$$\text{compound} = \frac{\textcolor{#db2777}{s}}{\sqrt{\textcolor{#db2777}{s}^2 + \textcolor{#2563eb}{\alpha}}}$$
s The raw sum of all rule-adjusted valence scores in the text
α Normalization constant (default = 15). Controls how quickly the score saturates toward ±1

The VADER Rule Pipeline

Before summing, VADER modifies each word's base valence according to these rules:

Negation A negation word (not, never, no...) within 3 words of a sentiment word flips the valence by multiplying by -0.74
Intensifiers Words like "very" or "extremely" amplify the sentiment by ~29.3%. Dampeners like "somewhat" reduce it.
ALL CAPS If a sentiment word is ALL CAPS in mixed-case text, its score is amplified (by a factor of ~0.733)
Punctuation Exclamation marks boost the sentiment score. Each "!" adds ~0.292 (max 4 exclamation marks counted).

Worked Example

Sentence: "The food here is not very good!!!"

  1. Look up base valence scores:
    "good" = +1.9  (only sentiment word found in lexicon)
  2. Apply intensifier rule: "very" is an intensifier (boost = 0.293):
    \(1.9 \times (1 + 0.293) = 2.457\)
  3. Apply negation rule: "not" appears before "good" (within 3 words):
    \(2.457 \times -0.74 = -1.818\)
  4. Apply punctuation rule: 3 exclamation marks, each adding ±0.292:
    Since the adjusted score is negative: \(s = -1.818 - 3 \times 0.292 = -2.694\)
  5. Compute compound score (\(\alpha = 15\)):
    \(\text{compound} = \frac{-2.694}{\sqrt{(-2.694)^2 + 15}} = \frac{-2.694}{\sqrt{7.258 + 15}} = \frac{-2.694}{4.720}\) = -0.571

A compound score of -0.571 indicates moderately negative sentiment. The negation of "good" plus the exclamation emphasis made a would-be positive sentence clearly negative.

Proportion Scores

Beyond the compound score, VADER also reports what proportion of the text is positive, negative, and neutral. These are computed by classifying each word-level adjusted score as positive (> 0), negative (< 0), or neutral (= 0), then normalizing:

Proportion Scores
$$\text{pos} = \frac{\sum \textcolor{#059669}{s_i^+}}{|\text{tokens}|}, \quad \text{neg} = \frac{\sum \textcolor{#e11d48}{|s_i^-|}}{|\text{tokens}|}, \quad \text{neu} = \frac{\textcolor{#6b7280}{n_{\text{neutral}}}}{|\text{tokens}|}$$
Interpretation guide: A compound score ≥ 0.05 is typically classified as positive, ≤ -0.05 as negative, and between -0.05 and 0.05 as neutral. The three proportion scores always sum to approximately 1.0.

SentiWordNet

Think of it this way: WordNet groups words into synsets (sets of synonyms representing a concept). SentiWordNet adds three scores to each synset: positivity, negativity, and objectivity, which always sum to 1. The word "bank" in its financial sense might score (0.0, 0.0, 1.0) — completely objective — while "excellent" scores (0.75, 0.0, 0.25) — mostly positive.

SentiWordNet Word Sentiment
$$\text{sentiment}(w) = \frac{1}{\textcolor{#2563eb}{|S_w|}} \sum_{s \in \textcolor{#2563eb}{S_w}} \left(\textcolor{#059669}{\text{pos}(s)} - \textcolor{#e11d48}{\text{neg}(s)}\right)$$
Sw The set of all synsets (word senses) for word w in WordNet
pos(s) Positivity score of synset s (0.0 to 1.0)
neg(s) Negativity score of synset s (0.0 to 1.0)
SentiWordNet Synset Constraint
$$\textcolor{#059669}{\text{pos}(s)} + \textcolor{#e11d48}{\text{neg}(s)} + \textcolor{#6b7280}{\text{obj}(s)} = 1$$

Worked Example

Word: "cold" — has multiple senses in WordNet:

  1. Sense 1 (temperature: "cold water"):
    pos = 0.0, neg = 0.125, obj = 0.875 → sentiment = 0.0 - 0.125 = -0.125
  2. Sense 2 (personality: "a cold person"):
    pos = 0.0, neg = 0.5, obj = 0.5 → sentiment = 0.0 - 0.5 = -0.5
  3. Sense 3 (illness: "caught a cold"):
    pos = 0.0, neg = 0.25, obj = 0.75 → sentiment = 0.0 - 0.25 = -0.25
  4. Average across all senses (simple disambiguation):
    \(\text{sentiment}(\text{cold}) = \frac{-0.125 + (-0.5) + (-0.25)}{3}\) = -0.292

Averaging across senses is a blunt approach. Without context, we cannot know which sense of "cold" the speaker intends, so the score is a compromise. This is the core challenge of SentiWordNet: word sense disambiguation.

vs. VADER: VADER assigns a single valence per word (no sense ambiguity) and adds grammatical rules. SentiWordNet is more linguistically principled (grounded in WordNet synsets) but requires word sense disambiguation for accurate results. VADER excels on social media and informal text; SentiWordNet is better suited for formal text when paired with a WSD system.

Interactive: VADER Sentiment Calculator

Enter a sentence below to see VADER-style sentiment analysis in action. Watch how negation, intensifiers, capitalization, and punctuation affect the scores.

Interactive: Sentiment Highlighter

Paste a longer text to see each word color-coded by sentiment: green for positive, red for negative, and gray for neutral. The overall document sentiment is computed as the VADER compound of the full text.

Summary: VADER vs. SentiWordNet

Aspect VADER SentiWordNet
Approach Lexicon + grammatical rules WordNet synset annotation
Score Range Compound: [-1, +1] Per-synset: [-1, +1]
Word Senses Single valence per word (no WSD needed) Multiple senses per word (WSD needed)
Handles Negation Yes (rule-based flipping) No (requires external rules)
Handles Intensifiers Yes (boosts/dampens scores) No
Best For Social media, reviews, informal text Formal text, fine-grained analysis
Coverage ~7,500 words + emoji ~150,000+ synsets
Transparency High (can trace every rule applied) Medium (scores are semi-automatic)
Key Takeaway

Lexicon-based sentiment analysis is interpretable and requires no training data, making it ideal for quick analysis and domains where labeled data is scarce. VADER's strength is its rule system that captures how humans actually modulate sentiment through grammar. SentiWordNet's strength is its massive coverage grounded in a linguistic ontology. In practice, many systems combine both approaches or use them as features in a machine-learned classifier.