Data Science  

Personality Classification - By Supervised (Classification Learning)

Introduction

Personality traits, such as extroversion and introversion, shape how individuals interact with their social environments. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to uncovering fascinating insights.

Classification will be attempted using 5 different models.

  • Logistic Regression
  • Support Vector Machine
  • Random Forest Classifier
  • XGBoost
  • Neural Network

The data will be explored and preprocessed accordingly, then scaled for use in the training process.

Models will be evaluated across four metrics.

  • Accuracy Score
  • Precision Score
  • Recall Score
  • F1 Score

Importing Libraries

import math
import warnings
import joblib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

from xgboost import XGBClassifier

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torch.optim import Adam, lr_scheduler

# Suppress warnings
warnings.filterwarnings("ignore")

Handling the Data

Now I import the data into the dataset.

import pandas as pd
# Load the dataset
df = pd.read_csv("personality_dataset.csv")

Exploring the Data

df.head()
df.describe()
df.info()
for col in df.columns:
    empty_count = df[col].isnull().sum()
    empty_percent = (empty_count / len(df)) * 100
    print(f"Column [{col}] has {empty_count} empty values - {empty_percent:.2f}% empty")

Data Visualization

Now we Check Our data with a plot.

Pair Plot

sns.pairplot(df)
plt.show()

Pair Plot

Box Plot

Box Plot

Handling Missing Values

Filling Categorical Data with Mode

df['Stage_fear'].fillna(df['Stage_fear'].mode()[0], inplace=True)
df['Drained_after_socializing'].fillna(df['Drained_after_socializing'].mode()[0], inplace=True)

Filling Numerical Data

df['Time_spent_Alone'].fillna(df['Time_spent_Alone'].median(), inplace=True)
df['Friends_circle_size'].fillna(df['Friends_circle_size'].median(), inplace=True)
df['Post_frequency'].fillna(df['Post_frequency'].median(), inplace=True)
df['Social_event_attendance'].fillna(
    df['Social_event_attendance'].mean(), inplace=True
)
df['Going_outside'].fillna(
    df['Going_outside'].mean(), inplace=True
)

Encoding Categorical Data

df['Stage_fear'] = df['Stage_fear'].map({'Yes': 1.0, 'No': 0.0})
df['Drained_after_socializing'] = df['Drained_after_socializing'].map({'Yes': 1.0, 'No': 0.0})
df['Personality'] = df['Personality'].map({'Introvert': 1.0, 'Extrovert': 0.0})

Data Splitting

X_train, X_test, y_train, y_test = train_test_split(
    df[
        [
            'Time_spent_Alone',
            'Stage_fear',
            'Social_event_attendance',
            'Going_outside',
            'Drained_after_socializing',
            'Friends_circle_size',
            'Post_frequency'
        ]
    ],
    df['Personality'],
    test_size=0.2,
    random_state=42
)

Modeling

LogisticRegression

LogisticRegression

SVM

SVM

NeuralNetwork

Converting the Data into Tensors

Tensors

Training the Model

Training the Model

Loading the Best Model

Best Model

Comparing Models

Functions

Functions

Overview

Overview

Conclusion

It was noted that the SVM, XGBoost, and Neural Network models all performed precisely the same. The best model will be determined along factors like: 1) For faster training XGBoost is recomended 2) For good theoretical guarentees SVM is recommended 3) for future scaling with more data NeuralNetwork is recommended.

CDN Solutions Group a leading development company, started off as a team of four in the year 2000.