Unveiling the Secrets of the Titanic Dataset | Episode 1

Unveiling the Secrets of the Titanic Dataset | Episode 1

Welcome to the very first edition of Kaggle Stories! In this series, we'll take you on a thrilling journey through the world of data science, where we'll reveal the stories hidden within various datasets. Our first adventure starts with a dataset that has captured the attention of data enthusiasts all over the globe: the Titanic dataset.

The Titanic Dataset: A Glimpse into History and Machine Learning

The Titanic dataset is a classic in the realm of data science, offering a unique glimpse into the passengers aboard the ill-fated RMS Titanic during its maiden voyage in 1912. This dataset, comprising information about passengers such as age, class, fare, and survival status, challenges us to explore the factors that influenced survival and to build predictive models that can discern who lived and who tragically perished.

Why the Titanic Dataset?

The Titanic dataset serves as an excellent starting point for our Kaggle Stories series for several reasons. First, its historical significance captivates our imagination and invites us to ponder the human stories behind the data points. Second, the dataset's structure presents opportunities for data exploration, preprocessing, and the application of machine learning techniques, making it an ideal canvas for both beginners and seasoned data scientists.

What to Expect in This Series

In each episode of Kaggle Stories, we will dive headfirst into a new dataset, unraveling its intricacies and showcasing the power of data science in transforming raw information into actionable insights. From exploratory data analysis to machine learning model deployment, we'll walk you through the entire data science lifecycle, one story at a time.

So, fasten your seatbelts as we set sail on this data-driven adventure. Join us in deciphering the secrets concealed within the Titanic dataset, and discover how data science can bring history to life.

Bon voyage! 🚢✨


Data Exploration

Acquire data

The Python Pandas packages help us work with our datasets. We start by acquiring the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.

Data Source Link: https://www.kaggle.com/competitions/titanic

train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
combine = [train_df, test_df]
Analyze by describing data

Pandas also helps describe the datasets answering the following questions early in our project.

Which features are available in the dataset?

Noting the feature names for directly manipulating or analyzing these. These feature names are described on the Kaggle data age here.

print(train_df.columns.values)

Which features are categorical?

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things, this helps us select the appropriate plots for visualization.

  • Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.

Which features are numerical?

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or time-series based? Among other things, this helps us select the appropriate plots for visualization.

  • Continuous: Age, Fare. Discrete: SibSp, Parch.
train_df.head()

Which features are mixed data types?

Numerical, alphanumeric data within the same feature. These are candidates for correcting goals.

  • Ticket is a mix of numeric and alphanumeric data types. The cabin is alphanumeric.

Which features may contain errors or typos?

This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.

  • Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.
train_df.tail()

Which features contain blank, null, or empty values?

These will require correcting.

  • Cabin > Age > Embarked features contain several null values in that order for the training dataset.

  • Cabin > Age is incomplete in the case of the test dataset.

What are the data types for various features?

Helping us during converting the goal.

  • Seven features are integers or floats. Six in the case of the test dataset.

  • Five features are strings (objects).

train_df.info()
print('_'*40)
test_df.info()

What is the distribution of numerical feature values across the samples?

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.

  • The total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).

  • Survived is a categorical feature with 0 or 1 values.

  • Around 38% of samples survived representative of the actual survival rate at 32%.

  • Most passengers (> 75%) did not travel with parents or children.

  • Nearly 30% of the passengers had siblings and/or spouses aboard.

  • Fares varied significantly with few passengers (<1%) paying as high as $512.

  • Few elderly passengers (<1%) within the age range 65-80.

train_df.describe()

What is the distribution of categorical features?

  • Names are unique across the dataset (count=unique=891)

  • Sex variable as two possible values with 65% male (top=male, freq=577/count=891).

  • Cabin values have several duplicates across samples. Alternatively, several passengers shared a cabin.

  • Embarked takes three possible values. S port used by most passengers (top=S)

  • The ticket feature has a high ratio (22%) of duplicate values (unique=681).

train_df.describe(include=['O'])


Data Preprocessing

Assumptions based on data analysis

We arrive at the following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

Correlating.

We want to know how well each feature correlates with Survival. We want to do this early in our project and match these quick correlations with modeled correlations later in the project.

Completing.

  1. We may want to complete the Age feature as it is correlated to survival.

  2. We may want to complete the Embarked feature as it may also correlate with survival or another important feature.

Correcting.

  1. The ticket feature may be dropped from our analysis as it contains a high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.

  2. The cabin feature may be dropped as it is highly incomplete or contains many null values both in the training and test datasets.

  3. PassengerId may be dropped from the training dataset as it does not contribute to survival.

  4. The name feature is relatively non-standard, and may not contribute directly to survival, so may be dropped.

Creating.

  1. We may want to create a new feature called Family based on Parch and SibSp to get the total count of family members on board.

  2. We may want to engineer the Name feature to extract Title as a new feature.

  3. We may want to create a new feature for Age bands. This turns a continuous numerical feature into an ordinal categorical feature.

  4. We may also want to create a fair range feature if it helps our analysis.

Classifying.

We may also add to our assumptions based on the problem description noted earlier.

  1. Women (Sex=female) were more likely to have survived.

  2. Children (Age<?) were more likely to have survived.

  3. The upper-class passengers (Pclass=1) were more likely to have survived.

Analyze by pivoting features

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features that do not have any empty values. It also makes sense to do so only for features that are categorical (Sex), ordinal (Pclass), or discrete (SibSp, Parch) type.

  • Pclass We observe a significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decided to include this feature in our model.

  • Sex We confirm the observation during problem definition that Sex=female had a very high survival rate at 74% (classifying #1).

  • SibSp and Parch These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features (creating #1).

train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Analyze by visualizing data

Now we can continue confirming some of our assumptions using visualizations for analyzing the data.

Correlating numerical features

Let us start by understanding correlations between numerical features and our solution goal (Survived).

A histogram chart is useful for analyzing continuous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate the distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have a better survival rate?)

Note that the x-axis in histogram visualizations represents the count of samples or passengers.

Observations.

  • Infants (Age <=4) had a high survival rate.

  • The oldest passengers (Age = 80) survived.

  • A large number of 15-25-year-olds did not survive.

  • Most passengers are in the 15-35 age range.

Decisions.

This simple analysis confirms our assumptions as decisions for subsequent workflow stages.

  • We should consider Age (our assumption classifying #2) in our model training.

  • Complete the Age feature for null values (completing #1).

  • We should bandage groups (creating #3).

g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)

Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.

Observations.

  • Pclass=3 had the most passengers, however, most did not survive. This confirms our classifying assumption #2.

  • Infant passengers in Pclass=2 and Pclass=3 mostly survived. This further qualifies our classifying assumption #2.

  • Most passengers in Pclass=1 survived. This confirms our classifying assumption #3.

  • Pclass varies in terms of Age distribution of passengers.

Decisions.

  • Consider Pclass for model training.
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

Correlating categorical features

Now we can correlate categorical features with our solution goal.

Observations.

  • Female passengers had a much better survival rate than males. Confirms classifying (#1).

  • Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn, Pclass and Survived, not necessarily a direct correlation between Embarked and Survived.

  • Males had a better survival rate in Pclass=3 when compared with Pclass =2 for C and Q ports. Completing (#2).

  • Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating (#1).

Decisions.

  • Add Sex feature to model training.

  • Complete and add Embarked feature to model training.

grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()

Correlating categorical and numerical features

We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), and Fare (Numeric continuous), with Survived (Categorical numeric).

Observations.

  • Higher fare-paying passengers had better survival. This confirms our assumption for creating (#4) fare ranges.

  • Port of embarkation correlates with survival rates. Confirms correlating (#1) and completing (#2).

Decisions.

  • Consider banding Fare feature.
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

Wrangle data

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

Correcting by dropping features

This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.

Based on our assumptions and decisions we want to drop the Cabin (correcting #2) and Ticket (correcting #1) features.

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.

print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

"After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape

Creating new features extracted from existing

We want to analyze if the Name feature can be engineered to extract titles and test the correlation between titles and survival, before dropping Name and PassengerId features.

In the following code, we extract the Title feature using regular expressions. The RegEx pattern (\w+\.) matches the first word which ends with a dot character within the Name feature. The expand=False flag returns a data frame.

Observations.

When we plot Title, Age, and Survived, we note the following observations.

  • Most titles band Age groups accurately. For example The master title has an Age mean of 5 years.

  • Survival among Title Age bands varies slightly.

  • Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

Decision.

  • We decided to retain the new Title feature for model training.
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])

for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
     'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()

Completing a numerical continuous feature

Now we should start estimating and completing features with missing or null values. We will first do this for the Age feature.

We can consider three methods to complete a numerical continuous feature.

  1. A simple way is to generate random numbers between mean and standard deviation.

  2. A more accurate way of guessing missing values is to use other correlated features. In our case, we note a correlation between Age, Gender, and Pclass. Guess Age values using median values for Age across sets of Pclass and Gender feature combinations. So, the median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on...

  3. Combine methods 1 and 2. So instead of guessing age values based on the median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.

Method 1 and 3 will introduce random noise into our models. The results from multiple executions might vary. We will prefer method 2.

for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5

    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']

train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]

for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]

for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_df.head()

Completing a categorical feature

The embarked feature takes S, Q, and C values based on a port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurrence.

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)

train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)

for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]

Final Look

Train data:

Test Data:


Model, predict, and solve

Now we are ready to train a model and predict the required solution. There are 60+ predictive modeling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models that we can evaluate. Our problem is a classification and regression problem. We want to identify the relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also performing a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

  • Logistic Regression

  • KNN or k-Nearest Neighbors

  • Support Vector Machines

  • Naive Bayes classifier

  • Decision Tree

  • Random Forrest

  • Perceptron

  • Artificial neural network

  • RVM or Relevance Vector Machine

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)

# k-Nearest Neighbors algorithm
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

# Perceptron
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)

# Linear SVC
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

# Stochastic Gradient Descent
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

Model evaluation

We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set.

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)


Conclusion: Charting the Course Forward

As we reach the end of our maiden voyage with the Titanic dataset, we've navigated through the turbulent seas of data, uncovering fascinating insights and honing our data science skills along the way. The Titanic dataset, with its poignant tales of survival and loss, has served as a compelling canvas for our exploration into the world of data science.

Key Takeaways

From the first whistle to the final splash, we've delved into the depths of data exploration, revealing the nuances that shaped passengers' fates. Through meticulous data preprocessing, we've cleansed our dataset, ensuring a smoother journey for our machine-learning models. Our predictive models have set sail, and as we gaze upon the performance metrics, we gain a clearer understanding of the factors that influenced survival on that fateful night.

The Journey Continues

As we conclude this chapter in our Kaggle Stories series, our journey is far from over. The Titanic dataset has been a captivating starting point, but countless datasets are awaiting our exploration. We encourage you, dear reader, to embark on your data-driven adventures, armed with the skills and insights gained from our Titanic expedition.

Your Turn to Sail

We invite you to replicate our analysis, tweak the models, and challenge the assumptions. Share your findings, reflections, and questions in the comments section. Kaggle Stories is not just our narrative—it's a collaborative journey where each reader contributes to the unfolding tale of data exploration.

Anchors Aweigh

As we bid adieu to the Titanic dataset, we set our sights on new horizons. Join us in the next chapter of Kaggle Stories, where we'll embark on another exciting journey into the heart of data science. Until then, may your datasets be rich, your insights profound, and your code ever bug-free.

Fair winds and following seas! ⚓🌊


References

This notebook has been created based on great work done solving the Titanic competition and other sources.


By the way…

Call to action

Hi, Everydaycodings— I’m building a newsletter that covers deep topics in the space of engineering. If that sounds interesting, subscribe and don’t miss anything. If you have some thoughts you’d like to share or a topic suggestion, reach out to me via LinkedIn or X.

References

And if you’re interested in diving deeper into these concepts, here are some great starting points:

  • Kaggle Stories - Each episode of Kaggle Stories takes you on a journey behind the scenes of a Kaggle notebook project, breaking down tech stuff into simple stories.

  • Machine Learning - This series covers ML fundamentals & techniques to apply ML to solve real-world problems using Python & real datasets while highlighting best practices & limits.

Did you find this article valuable?

Support NeuralRealm by becoming a sponsor. Any amount is appreciated!