🎥 Sentiment Analysis on IMDb Using BERT and Hugging Face

Praveen Kumar
9h
248
0
10

Article

Fine-tuning transformer models for real-world NLP tasks is now easier than ever. In this hands-on guide, we’ll walk through building a sentiment analysis model using BERT, train it on the IMDb dataset, visualize how it works under the hood, and even deploy it as a web app with Gradio.

🛠 Step 1: Install Required Libraries

Run the following to get started:

pip install transformers datasets torch scikit-learn

📅 Step 2: Load IMDb Dataset

Download IMDb dataset.

We’ll use the IMDb dataset via Hugging Face's datasets library:

from datasets import load_dataset
dataset = load_dataset("imdb")
print(dataset)

Output

DatasetDict({
    train: Dataset({ features: ['text', 'label'], num_rows: 25000 }),
    test:  Dataset({ features: ['text', 'label'], num_rows: 25000 })
})

🔤 Step 3: Tokenize Text

We’ll tokenize the text using the bert-base-uncased tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)
tokenized_dataset = dataset.map(tokenize, batched=True)

🏷️ Step 4: Prepare DataLoaders

from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_loader = DataLoader(tokenized_dataset['train'], batch_size=16, shuffle=True, collate_fn=data_collator)
test_loader = DataLoader(tokenized_dataset['test'], batch_size=16, collate_fn=data_collator)

🧠 Step 5: Load Pre-trained BERT Model

We use a BERT model with a classification head:

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

🚂 Step 6: Set Up Training

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir="./bert-imdb",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    logging_dir="./logs",
    logging_steps=100
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    data_collator=data_collator
)

🚀 Step 7: Train Your Model

trainer.train()

This step trains your model and logs metrics.

✅ Step 8: Evaluate the Model

trainer.evaluate()

Sample Output

{'eval_loss': 0.34, 'eval_accuracy': 0.87}

🔮 Step 9: Make Predictions

text = "The movie was absolutely fantastic!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=1).item()
print("Positive" if pred == 1 else "Negative")

🎨 Visualize BERT Attention Weights

from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt
import seaborn as sns
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence = "The movie was absolutely fantastic!"
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
attention = outputs.attentions[-1][0, 0].detach().numpy()
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
plt.figure(figsize=(10, 8))
sns.heatmap(attention, xticklabels=tokens, yticklabels=tokens, cmap="viridis")
plt.title("BERT Attention (Last Layer, Head 0)")
plt.show()

🌐 Build a Web App with Gradio

import gradio as gr
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="bert-base-uncased")
def classify_sentiment(text):
    result = classifier(text)[0]
    return f"{result['label']} ({round(result['score'] * 100, 2)}%)"
demo = gr.Interface(fn=classify_sentiment, inputs="text", outputs="text", title="BERT Sentiment Analyzer")
demo.launch()

🚗 Deploy to Hugging Face Spaces

1. Create a new Space at huggingface.co/spaces

SDK: Gradio
Add your app.py and requirements.txt files

app.py

import gradio as gr
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="bert-base-uncased")
def classify_sentiment(text):
    result = classifier(text)[0]
    return f"{result['label']} ({round(result['score'] * 100, 2)}%)"
gr.Interface(fn=classify_sentiment, inputs="text", outputs="text").launch()

requirements.txt

transformers
torch
gradio

Use Git to push:

git lfs install
git clone https://huggingface.co/spaces/your-username/bert-sentiment-analyzer
cd bert-sentiment-analyzer
cp ../app.py .
cp ../requirements.txt .
git add .
git commit -m "Initial commit"
git push

🔄 Recap

Task	Tools Used
Training Sentiment Model	Transformers + IMDb
Visualizing Attention	Matplotlib + Seaborn
Building a Web Interface	Gradio
Cloud Deployment	Hugging Face Spaces

🔧 Conclusion

In this tutorial, you learned how to fine-tune a BERT model for sentiment analysis using the IMDb dataset and Hugging Face Transformers. You saw how to prepare data, train a model, evaluate performance, visualize attention mechanisms, and even deploy your project as a fully functional web app. This workflow can easily be extended to other NLP tasks, domains, or languages. Whether you're building sentiment tools for product reviews, social media, or customer feedback, this approach gives you a production-ready foundation.

C# Corner

C# Corner started as an online community for software developers in 1999.