Kaggle Titanic

Ojash Shrestha
4y
6k
0
1

Article

In this article, we’ll learn and go through a step by step way to participate in the Kaggle Competition – Titanic Machine Learning from Disaster. We’ll dive into the competition, use our machine learning model to predict which passengers survive the wreck of the Titanic from the dataset we have and later save and submit our result. This result will then be ranked on the global scoreboard for this competition and thus we can compare how we perform. Later, we’ll also go through the process to hyper tune the parameters to provide better output.

Let us participate and perform our machine learning for this competition.

Step 1

First of all, visit the Kaggle Website and search for the Titanic Competition. It falls under the Getting Started category competition which we have discussed about in our previous article, Kaggle Competition.

Step 2

Under Code in Menu, Select new Notebook.

Step 3

Now, before we start our machine learning process, let's explore our dataset. For this, visit the Data section.

We can see the overview details with data dictionary.

Under the data explorer, we can see three files here – gender_submission.csv, test.csv, and train.csv

The gender_submission is basically the sample of the submission file while the train.csv and test.csv are data we use to train and test our machine learning model.

Step 4

Now, let us visit our Notebook.

Here, we can see, the following code already written. This calls all the necessary libraries such as numpy and pandas and also our input files from the folder which will be used for training and testing later.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Training Data Exploration

Step 5

Now, let us explore our training data.

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

Here, we can see, with the code, we listen to some 5 rows of data with the column headings. This gives us a brief overview of the dataset where we can see the PassengerID, Survived Number, Passenger class no, Name of the traveller, Sex, Age, and other details.

Testing Data Exploration

Step 6

Similar to above, here we can see the 5 rows of data of the test.csv files which showcases what values are mainly used to test our model from the dataset. We can see the Survived data column isn’t here as this is what we predict and test our model basically on. This new output data is the one that is presented on the submission file to compete against our global competitors on the leaderboard.

test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

Step 7

Here, we simply explore the data on the % of men that survived the Titanic accident.

men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

We can see that basically, only 18% of men survived the accident. This could be because women and children were given a priority for safe boats and men were in charge to make the way for safety for their counterparts. Furthermore, since it was 1909, we can expect a lot of the crew were basically men and thus would have decreased the % of men who would have survived as the crew members were responsible and were expected to save all the passengers before they themselves left the boat for their personal safety.

Model Training

Step 8

Here, we train our model using the Random Forest Classifier model from the sklearn library. We set the n_estimators which are basically the number of trees, max_depth represent the max level of depth of the tree and the random_state value which is basically a meta estimator.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=1)

Prediction

Step 10

Next, we fit the model and make the prediction from the test data.

model.fit(X, y)
predictions = model.predict(X_test)

Submission File Preparation

Step 11

Finally, we set our output file which is sent for submission to the global leaderboard. We have the PassengerId and Survived value for the passenger.

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Once, the submission file is saved we have the notified output.

Submission

Step 12

Now, on the Right, we can see the Submit button.

With this the file is submitted and the process can be seen on the left-hand side too.

Public Scoring

Now, under the My Submissions we can see our score and our rank on the leaderboard.

Here, we can see the score is 0.77511 which ranks my machine learning model and output at 7549 in global ranking.

Hyper Parameter Tuning

Step 13

Next, we change some of our hyper parameter tuning with the expectation to better our result.

Here for the Random Forest Classifier parameters, let set the n_estimators to 200, max_depth to 10, and random state to 1.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Now with the version of submission, we can there is no significant change in our model ranking. Instead, the score had decreased.

Gradient Boosting Classifier

Step 14

Now, let us explore another algorithm – Gradient Boosting.

from sklearn.ensemble import GradientBoostingClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Here we set the n-estimators for 100, learning rate as 0.1, and max_depth as 1 for the gradient boosting classifier.

As we submit, we obtained the score 0.77751 which is better than the previous best 0.7751 and our rank has increased from 7549 to 4097. Just 0.002 betterment in score lifted up my rank by almost 3500.

Conclusion

Thus, in this article, we learned the process of participating in a Kaggle Competition, explored the data and prepared our machine learning model. We trained our model and then tested it. Then, we created an output file for submission and submitted it to better our score against the global competitors. We then viewed our rank. Next, we performed some hyper parameter tuning with the objective of bettering our machine learning model to rank better. Initially, we couldn’t find much significant change in our global ranking on the leaderboard. We then performed numerous other changes in the hyperparameter tuning and evaluated time and again. As we couldn’t get any better score with simple hyperparameter tuning, we chose another algorithm as scored better with over 3500 upgrade in global ranking.