What is Lemmatization?
Lemmatization in Natural Language Processing (NLP) is the process of reducing a word to its base or dictionary form, known as a lemma.
🔍 Example:
	
		
			| Word Form | Lemma | 
	
	
		
			| running | run | 
		
			| better | good | 
		
			| was | be | 
	
So, in a sentence like:
“He was running faster than anyone.”
Lemmatization would convert:
• “was” → “be”
• “running” → “run”
🧠 Why Lemmatization Matters
Lemmatization helps NLP systems understand the core meaning of words by stripping away inflections, tenses, or comparative forms. This improves:
	- Search accuracy (e.g., searching “run” also finds “ran” or “running”)
- Text classification
- Sentiment analysis
- Information retrieval
🔁 Lemmatization vs. Stemming
	
		
			| Feature | Stemming | Lemmatization | 
	
	
		
			| Approach | Heuristic (rule-based) | Linguistic (dictionary-based) | 
		
			| Output | Crude root form | Real word (lemma) | 
		
			| Example | “running” → “runn” | “running” → “run” | 
		
			| Speed | Faster | Slower (but more accurate) | 
	
🛠️ How It Works
Lemmatization uses:
	- Part of Speech (POS) tagging
- Lexicons or dictionaries
- Morphological analysis
Example using Python’s NLTK:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))  # Output: run
print(lemmatizer.lemmatize("better", pos="a"))   # Output: good
Implementing Lemmatization in spaCy and other libraries
🔍 1. spaCy: Industrial-Strength NLP
spaCy’s lemmatizer is fast, accurate, and context-aware, designed for real-world usage.
🔧 How It Works
	- Uses a rule-based lemmatizer, backed by lookup tables and exception lists.
- It considers Part of Speech (POS) to select the correct lemma.
- Includes language-specific lemmatizers (e.g., English, German, French).
🧪 Example
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The children were running faster than their parents.")
for token in doc:
    print(f"{token.text} → {token.lemma_} ({token.pos_})")
Output
The → the (DET)
children → child (NOUN)
were → be (AUX)
running → run (VERB)
faster → fast (ADV)
⚙️ Under the Hood
	- Rules in lemmatizer.py and lookups.py
- Custom exception handling for irregular forms
- Uses lexical attributes and POS mappings from the en_core_web_* model
Pros:
	- Highly accurate
- Multi-language support
- Integrates tightly with other spaCy features like NER and dependency parsing
🧠 2. NLTK: Academic & Educational Toolkit
NLTK provides two lemmatizers: WordNetLemmatizer and ISRI Stemmer (for Arabic).
WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos="v")  # Output: 'run'
lemmatizer.lemmatize("better", pos="a")   # Output: 'good'
How It Works
	- Uses WordNet, a lexical database
- Requires correct POS tag to be most effective
- Doesn’t do any sentence-level analysis
🔁 POS Mapping Needed
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
Pros:
	- Simple and fast
- Good for educational purposes or light NLP tasks
Cons:
	- Not context-aware
- Can produce inaccurate lemmas without correct POS
💬 3. TextBlob: Friendly API on Top of NLTK
TextBlob wraps NLTK and provides a higher-level API.
from textblob import Word
w = Word("running")
print(w.lemmatize("v"))  # Output: 'run'
How It Works
	- Internally uses NLTK’s WordNetLemmatizer
- Provides easier syntax but doesn’t add accuracy
- Good for prototyping, sentiment analysis, or light preprocessing
🧪 4. Other Libraries (Quick Mentions)
🔡 Gensim
	- Not a lemmatizer, but often paired with NLTK for preprocessing
- Tokenizes, filters stop words, but expects external lemmatization
🌐 Stanza (by Stanford NLP)
	- Deep learning-based
- Uses neural models trained on Universal Dependencies
- Provides lemmatization with higher accuracy on many languages
import stanza
stanza.download('en')
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma')
doc = nlp("The children were running.")
for sent in doc.sentences:
    for word in sent.words:
        print(f"{word.text} → {word.lemma}")
🔍 Comparison Table
	
		
			| Feature/Aspect | spaCy | NLTK + WordNet | TextBlob | Stanza | 
	
	
		
			| Context-aware | ✅ Yes | ❌ No | ❌ No | ✅ Yes | 
		
			| POS-sensitive | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | 
		
			| Language support | ✅ Many | ⚠️ Mostly English | ⚠️ Mostly English | ✅ Many (UD models) | 
		
			| Speed | ⚡ Fast | ⚡ Fast | ⚡ Fast | 🐢 Slower (neural) | 
		
			| Accuracy | 🎯 High (rule-based) | 🎯 Medium (lexical) | 🎯 Medium | 🎯 Very High (neural) | 
	
Use Case Best Tool
	
		
			| Real-world applications | spaCy | 
		
			| Educational or lexical exploration | NLTK | 
		
			| Prototyping with simple syntax | TextBlob | 
		
			| High-accuracy, multilingual projects | Stanza | 
	
Building a custom medical lemmatizer
Creating a custom lemmatizer for a niche domain like medical text is a powerful move—especially when general-purpose lemmatizers (like spaCy or WordNet) miss domain-specific terms such as “diagnoses” or “hypoglycemic”.
Here’s a step-by-step guide to building a custom medical lemmatizer, using spaCy, domain-specific vocabulary, and optional ML for edge cases.
🧩 1. Why You Need a Custom Lemmatizer in Medical NLP
	
		
			| Term | General Lemma | Desired Medical Lemma | 
	
	
		
			| diagnoses | diagnose | diagnosis | 
		
			| hypoglycemic | hypoglycem | hypoglycemia | 
		
			| myocardial | myocard | myocardium | 
	
Off-the-shelf lemmatizers can’t handle:
	- Irregular medical inflections
- Abbreviations like “BP” or “CXR”
- Root forms that don’t exist in general English (e.g. “neoplasia” from “neoplastic”)
🏗️ 2. Tools and Frameworks
	- spaCy: for base NLP pipeline
- Custom lookup table: for domain-specific lemma mappings
- (Optional) ML classifier: for ambiguous words
- UMLS or SNOMED CT: for canonical forms (if you have access)
⚙️ 3. Setup a Custom Rule-Based Lemmatizer in spaCy
import spacy
from spacy.language import Language
from spacy.lookups import Lookups
# Step 1: Load base model
nlp = spacy.load("en_core_web_sm")
# Step 2: Create custom lookup table
lookups = Lookups()
custom_lemmas = {
    "diagnoses": "diagnosis",
    "hypoglycemic": "hypoglycemia",
    "myocardial": "myocardium",
    "tachypneic": "tachypnea",
    "dyspneic": "dyspnea"
}
lookups.add_table("lemma_rules", {"noun": custom_lemmas})
# Step 3: Add the custom lemmatizer to spaCy
from spacy.pipeline.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer(lookups)
nlp.get_pipe("lemmatizer").overwrite = True
nlp.get_pipe("lemmatizer").mode = "lookup"
nlp.get_pipe("lemmatizer").lookups = lookups
# Test
doc = nlp("The patient is hypoglycemic and tachypneic.")
for token in doc:
    print(f"{token.text} → {token.lemma_}")
Output
The → the  
patient → patient  
is → be  
hypoglycemic → hypoglycemia  
and → and  
tachypneic → tachypnea  
. → .
📚 4. Extend It with a Medical Lexicon
Build or scrape a lemma dictionary using:
	- UMLS Metathesaurus: Contains synonyms and canonical forms
- SNOMED CT: Ontology for medical terminology
- SciSpacy: Has entity linking and domain-specific models
SciSpacy Example
import scispacy
import spacy
from scispacy.linking import EntityLinker
nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
doc = nlp("The patient presented with dyspnea and tachypnea.")
for ent in doc.ents:
    print(ent.text, "→", ent._.umls_ents)