📝 Feature Engineering¶

Introduction¶

Feature engineering is a crucial step in the data preprocessing pipeline, aimed at enhancing the predictive power of machine learning models. For the Titanic dataset, this involves creating new features and modifying existing ones to better capture the underlying patterns that influence passenger survival.

The Titanic dataset includes various columns such as 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', and 'Embarked'. Each of these features holds potential insights into the survival outcomes, but they often require transformation and enrichment to become more effective for predictive modeling.

Feature engineering is an iterative process that involves experimenting with different transformations and evaluating their impact on model performance. By carefully crafting and selecting features, we can significantly improve the accuracy and robustness of predictive models for the Titanic dataset.

Feature Engineering¶

This stage offers numerous opportunities for a deeper analysis, especially when comparing with other columns. However, for our practical purposes, we will follow this approach:

We will remove the 'Name' and 'Ticket' columns, as they do not initially contribute significantly to the model.
For the 'Age' variable, we will fill the missing values with the mean age.
We will address the 'Cabin' column by replacing the missing values with the most frequent value, thus optimizing data integrity.
We will change the data type of the 'Pclass', 'SibSp', and 'Parch' variables.

In [1]:

Copied!





# Libraries
from loguru import logger
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from itables import init_notebook_mode, show

import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
init_notebook_mode(all_interactive=True)
# Libraries
from loguru import logger
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from itables import init_notebook_mode, show

import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
init_notebook_mode(all_interactive=True)

This is the init_notebook_mode cell from ITables v2.1.0
(you should not see this message - is your notebook trusted?)

In [2]:

Copied!





logger.info("Read Data")

# Paths
path_raw = "../../data/raw/"
path_processed = "../../data/processed/"
path_final = "../../data/final/"

# Read data
train = pd.read_csv(path_raw + "train.csv")
test = pd.read_csv(path_raw + "test.csv")
logger.info("Read Data")

# Paths
path_raw = "../../data/raw/"
path_processed = "../../data/processed/"
path_final = "../../data/final/"

# Read data
train = pd.read_csv(path_raw + "train.csv")
test = pd.read_csv(path_raw + "test.csv")

2024-06-10 08:07:27.662 | INFO     | __main__:<module>:1 - Read Data

In [3]:

Copied!





# Get column names by data types
target = 'Survived'

float_columns = [x for x in list(train.select_dtypes(include=['float64']).columns) if x != target]
integer_columns = [x for x in list(train.select_dtypes(include=['int32', 'int64']).columns) if x != target]
object_columns = [x for x in list(train.select_dtypes(include=['object']).columns) if x != target]
# Get column names by data types
target = 'Survived'

float_columns = [x for x in list(train.select_dtypes(include=['float64']).columns) if x != target]
integer_columns = [x for x in list(train.select_dtypes(include=['int32', 'int64']).columns) if x != target]
object_columns = [x for x in list(train.select_dtypes(include=['object']).columns) if x != target]

In [4]:

Copied!

logger.info("Remove variables: 'Name' and 'Ticket'")

cols_delete = ['Name', 'Ticket']

train = train.drop(cols_delete, axis=1)
test = test.drop(cols_delete, axis=1)
logger.info("Remove variables: 'Name' and 'Ticket'")

cols_delete = ['Name', 'Ticket']

train = train.drop(cols_delete, axis=1)
test = test.drop(cols_delete, axis=1)

2024-06-10 08:07:27.699 | INFO     | __main__:<module>:1 - Remove variables: 'Name' and 'Ticket'

In [5]:

Copied!

logger.info("Fill 'Age' with the mean")
age_mean = round(train['Age'].mean())

train['Age'] = train['Age'].fillna(age_mean)
test['Age'] = test['Age'].fillna(age_mean)
logger.info("Fill 'Age' with the mean")
age_mean = round(train['Age'].mean())

train['Age'] = train['Age'].fillna(age_mean)
test['Age'] = test['Age'].fillna(age_mean)

2024-06-10 08:07:27.713 | INFO     | __main__:<module>:1 - Fill 'Age' with the mean

In [6]:

Copied!

logger.info("Modify and fill missing values in 'Cabin'")
train['Cabin'] = train['Cabin'].fillna('N').str[0]
test['Cabin'] = test['Cabin'].fillna('N').str[0]
logger.info("Modify and fill missing values in 'Cabin'")
train['Cabin'] = train['Cabin'].fillna('N').str[0]
test['Cabin'] = test['Cabin'].fillna('N').str[0]

2024-06-10 08:07:27.731 | INFO     | __main__:<module>:1 - Modify and fill missing values in 'Cabin'

In [7]:

Copied!

logger.info("Change data type: 'Pclass', 'SibSp', and 'Parch'")

columns_to_convert = ['Pclass', 'SibSp', 'Parch']
train[columns_to_convert] = train[columns_to_convert].astype(str)
test[columns_to_convert] = test[columns_to_convert].astype(str)
logger.info("Change data type: 'Pclass', 'SibSp', and 'Parch'")

columns_to_convert = ['Pclass', 'SibSp', 'Parch']
train[columns_to_convert] = train[columns_to_convert].astype(str)
test[columns_to_convert] = test[columns_to_convert].astype(str)

2024-06-10 08:07:27.746 | INFO     | __main__:<module>:1 - Change data type: 'Pclass', 'SibSp', and 'Parch'

In [8]:

Copied!

# Display train and test dataset
logger.info("New train data")
show(train, classes="display nowrap compact",maxBytes = 0)
# Display train and test dataset
logger.info("New train data")
show(train, classes="display nowrap compact",maxBytes = 0)

2024-06-10 08:07:27.760 | INFO     | __main__:<module>:2 - New train data

PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
Loading ITables v2.1.0 from the `init_notebook_mode` cell... (need help?)

In [9]:

Copied!

logger.info("New test data")
show(test, classes="display nowrap compact",maxBytes = 0)
logger.info("New test data")
show(test, classes="display nowrap compact",maxBytes = 0)

2024-06-10 08:07:27.784 | INFO     | __main__:<module>:1 - New test data

PassengerId	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
Loading ITables v2.1.0 from the `init_notebook_mode` cell... (need help?)

In [10]:

Copied!

logger.info("Save Results")

train.to_csv(path_processed + 'train.csv', sep=',', index=False)
test.to_csv(path_processed + 'test.csv', sep=',', index=False)
logger.info("Save Results")

train.to_csv(path_processed + 'train.csv', sep=',', index=False)
test.to_csv(path_processed + 'test.csv', sep=',', index=False)

2024-06-10 08:07:27.800 | INFO     | __main__:<module>:1 - Save Results

Conclusion¶

In our feature engineering process for the Titanic dataset, we undertook several steps to prepare the data for effective modeling:

Removal of Non-Contributory Columns: We removed the 'Name' and 'Ticket' columns, as they did not provide significant predictive value for our model.
Handling Missing Values:
- For the 'Age' column, missing values were filled with the mean age to maintain consistency and avoid data loss.
- For the 'Cabin' column, missing values were replaced with the most frequent value ('N'), and only the first letter of the cabin was retained to simplify the data.
Data Type Conversion: The columns 'Pclass', 'SibSp', and 'Parch' were converted from numerical to string type to better capture categorical relationships.
Data Saving: The processed training and test datasets were saved for future modeling and analysis.

These feature engineering steps have improved the quality and usability of the dataset, ensuring that it is well-prepared for subsequent analysis and machine learning tasks. By addressing missing values, simplifying categorical data, and removing unnecessary columns, we have created a more robust and interpretable dataset for predicting passenger survival on the Titanic.