What is Lemmatization?
Lemmatization in Natural Language Processing (NLP) is the process of reducing a word to its base or dictionary form, known as a lemma.
🔍 Example:
Word Form |
Lemma |
running |
run |
better |
good |
was |
be |
So, in a sentence like:
“He was running faster than anyone.”
Lemmatization would convert:
• “was” → “be”
• “running” → “run”
🧠 Why Lemmatization Matters
Lemmatization helps NLP systems understand the core meaning of words by stripping away inflections, tenses, or comparative forms. This improves:
- Search accuracy (e.g., searching “run” also finds “ran” or “running”)
- Text classification
- Sentiment analysis
- Information retrieval
🔁 Lemmatization vs. Stemming
Feature |
Stemming |
Lemmatization |
Approach |
Heuristic (rule-based) |
Linguistic (dictionary-based) |
Output |
Crude root form |
Real word (lemma) |
Example |
“running” → “runn” |
“running” → “run” |
Speed |
Faster |
Slower (but more accurate) |
🛠️ How It Works
Lemmatization uses:
- Part of Speech (POS) tagging
- Lexicons or dictionaries
- Morphological analysis
Example using Python’s NLTK:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v")) # Output: run
print(lemmatizer.lemmatize("better", pos="a")) # Output: good
Implementing Lemmatization in spaCy and other libraries
🔍 1. spaCy: Industrial-Strength NLP
spaCy’s lemmatizer is fast, accurate, and context-aware, designed for real-world usage.
🔧 How It Works
- Uses a rule-based lemmatizer, backed by lookup tables and exception lists.
- It considers Part of Speech (POS) to select the correct lemma.
- Includes language-specific lemmatizers (e.g., English, German, French).
🧪 Example
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The children were running faster than their parents.")
for token in doc:
print(f"{token.text} → {token.lemma_} ({token.pos_})")
Output
The → the (DET)
children → child (NOUN)
were → be (AUX)
running → run (VERB)
faster → fast (ADV)
⚙️ Under the Hood
- Rules in lemmatizer.py and lookups.py
- Custom exception handling for irregular forms
- Uses lexical attributes and POS mappings from the en_core_web_* model
Pros:
- Highly accurate
- Multi-language support
- Integrates tightly with other spaCy features like NER and dependency parsing
🧠 2. NLTK: Academic & Educational Toolkit
NLTK provides two lemmatizers: WordNetLemmatizer and ISRI Stemmer (for Arabic).
WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos="v") # Output: 'run'
lemmatizer.lemmatize("better", pos="a") # Output: 'good'
How It Works
- Uses WordNet, a lexical database
- Requires correct POS tag to be most effective
- Doesn’t do any sentence-level analysis
🔁 POS Mapping Needed
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
Pros:
- Simple and fast
- Good for educational purposes or light NLP tasks
Cons:
- Not context-aware
- Can produce inaccurate lemmas without correct POS
💬 3. TextBlob: Friendly API on Top of NLTK
TextBlob wraps NLTK and provides a higher-level API.
from textblob import Word
w = Word("running")
print(w.lemmatize("v")) # Output: 'run'
How It Works
- Internally uses NLTK’s WordNetLemmatizer
- Provides easier syntax but doesn’t add accuracy
- Good for prototyping, sentiment analysis, or light preprocessing
🧪 4. Other Libraries (Quick Mentions)
🔡 Gensim
- Not a lemmatizer, but often paired with NLTK for preprocessing
- Tokenizes, filters stop words, but expects external lemmatization
🌐 Stanza (by Stanford NLP)
- Deep learning-based
- Uses neural models trained on Universal Dependencies
- Provides lemmatization with higher accuracy on many languages
import stanza
stanza.download('en')
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma')
doc = nlp("The children were running.")
for sent in doc.sentences:
for word in sent.words:
print(f"{word.text} → {word.lemma}")
🔍 Comparison Table
Feature/Aspect |
spaCy |
NLTK + WordNet |
TextBlob |
Stanza |
Context-aware |
✅ Yes |
❌ No |
❌ No |
✅ Yes |
POS-sensitive |
✅ Yes |
✅ Yes |
✅ Yes |
✅ Yes |
Language support |
✅ Many |
⚠️ Mostly English |
⚠️ Mostly English |
✅ Many (UD models) |
Speed |
⚡ Fast |
⚡ Fast |
⚡ Fast |
🐢 Slower (neural) |
Accuracy |
🎯 High (rule-based) |
🎯 Medium (lexical) |
🎯 Medium |
🎯 Very High (neural) |
Use Case Best Tool
Real-world applications |
spaCy |
Educational or lexical exploration |
NLTK |
Prototyping with simple syntax |
TextBlob |
High-accuracy, multilingual projects |
Stanza |
Building a custom medical lemmatizer
Creating a custom lemmatizer for a niche domain like medical text is a powerful move—especially when general-purpose lemmatizers (like spaCy or WordNet) miss domain-specific terms such as “diagnoses” or “hypoglycemic”.
Here’s a step-by-step guide to building a custom medical lemmatizer, using spaCy, domain-specific vocabulary, and optional ML for edge cases.
🧩 1. Why You Need a Custom Lemmatizer in Medical NLP
Term |
General Lemma |
Desired Medical Lemma |
diagnoses |
diagnose |
diagnosis |
hypoglycemic |
hypoglycem |
hypoglycemia |
myocardial |
myocard |
myocardium |
Off-the-shelf lemmatizers can’t handle:
- Irregular medical inflections
- Abbreviations like “BP” or “CXR”
- Root forms that don’t exist in general English (e.g. “neoplasia” from “neoplastic”)
🏗️ 2. Tools and Frameworks
- spaCy: for base NLP pipeline
- Custom lookup table: for domain-specific lemma mappings
- (Optional) ML classifier: for ambiguous words
- UMLS or SNOMED CT: for canonical forms (if you have access)
⚙️ 3. Setup a Custom Rule-Based Lemmatizer in spaCy
import spacy
from spacy.language import Language
from spacy.lookups import Lookups
# Step 1: Load base model
nlp = spacy.load("en_core_web_sm")
# Step 2: Create custom lookup table
lookups = Lookups()
custom_lemmas = {
"diagnoses": "diagnosis",
"hypoglycemic": "hypoglycemia",
"myocardial": "myocardium",
"tachypneic": "tachypnea",
"dyspneic": "dyspnea"
}
lookups.add_table("lemma_rules", {"noun": custom_lemmas})
# Step 3: Add the custom lemmatizer to spaCy
from spacy.pipeline.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer(lookups)
nlp.get_pipe("lemmatizer").overwrite = True
nlp.get_pipe("lemmatizer").mode = "lookup"
nlp.get_pipe("lemmatizer").lookups = lookups
# Test
doc = nlp("The patient is hypoglycemic and tachypneic.")
for token in doc:
print(f"{token.text} → {token.lemma_}")
Output
The → the
patient → patient
is → be
hypoglycemic → hypoglycemia
and → and
tachypneic → tachypnea
. → .
📚 4. Extend It with a Medical Lexicon
Build or scrape a lemma dictionary using:
- UMLS Metathesaurus: Contains synonyms and canonical forms
- SNOMED CT: Ontology for medical terminology
- SciSpacy: Has entity linking and domain-specific models
SciSpacy Example
import scispacy
import spacy
from scispacy.linking import EntityLinker
nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
doc = nlp("The patient presented with dyspnea and tachypnea.")
for ent in doc.ents:
print(ent.text, "→", ent._.umls_ents)