Introduction
Personality traits, such as extroversion and introversion, shape how individuals interact with their social environments. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to uncovering fascinating insights.
Classification will be attempted using 5 different models.
- Logistic Regression
- Support Vector Machine
- Random Forest Classifier
- XGBoost
- Neural Network
The data will be explored and preprocessed accordingly, then scaled for use in the training process.
Models will be evaluated across four metrics.
- Accuracy Score
- Precision Score
- Recall Score
- F1 Score
Importing Libraries
import math
import warnings
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score
)
from xgboost import XGBClassifier
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torch.optim import Adam, lr_scheduler
# Suppress warnings
warnings.filterwarnings("ignore")
Handling the Data
Now I import the data into the dataset.
import pandas as pd
# Load the dataset
df = pd.read_csv("personality_dataset.csv")
Exploring the Data
df.head()
df.describe()
df.info()
for col in df.columns:
empty_count = df[col].isnull().sum()
empty_percent = (empty_count / len(df)) * 100
print(f"Column [{col}] has {empty_count} empty values - {empty_percent:.2f}% empty")
Data Visualization
Now we Check Our data with a plot.
Pair Plot
sns.pairplot(df)
plt.show()
![Pair Plot]()
Box Plot
![Box Plot]()
Handling Missing Values
Filling Categorical Data with Mode
df['Stage_fear'].fillna(df['Stage_fear'].mode()[0], inplace=True)
df['Drained_after_socializing'].fillna(df['Drained_after_socializing'].mode()[0], inplace=True)
Filling Numerical Data
df['Time_spent_Alone'].fillna(df['Time_spent_Alone'].median(), inplace=True)
df['Friends_circle_size'].fillna(df['Friends_circle_size'].median(), inplace=True)
df['Post_frequency'].fillna(df['Post_frequency'].median(), inplace=True)
df['Social_event_attendance'].fillna(
df['Social_event_attendance'].mean(), inplace=True
)
df['Going_outside'].fillna(
df['Going_outside'].mean(), inplace=True
)
Encoding Categorical Data
df['Stage_fear'] = df['Stage_fear'].map({'Yes': 1.0, 'No': 0.0})
df['Drained_after_socializing'] = df['Drained_after_socializing'].map({'Yes': 1.0, 'No': 0.0})
df['Personality'] = df['Personality'].map({'Introvert': 1.0, 'Extrovert': 0.0})
Data Splitting
X_train, X_test, y_train, y_test = train_test_split(
df[
[
'Time_spent_Alone',
'Stage_fear',
'Social_event_attendance',
'Going_outside',
'Drained_after_socializing',
'Friends_circle_size',
'Post_frequency'
]
],
df['Personality'],
test_size=0.2,
random_state=42
)
Modeling
LogisticRegression
![LogisticRegression]()
SVM
![SVM]()
NeuralNetwork
Converting the Data into Tensors
![Tensors]()
Training the Model
![Training the Model]()
Loading the Best Model
![Best Model]()
Comparing Models
Functions
![Functions]()
Overview
![Overview]()
Conclusion
It was noted that the SVM, XGBoost, and Neural Network models all performed precisely the same. The best model will be determined along factors like: 1) For faster training XGBoost is recomended 2) For good theoretical guarentees SVM is recommended 3) for future scaling with more data NeuralNetwork is recommended.