📊 Exploratory Data Analysis¶
Introduction¶
The Titanic dataset is one of the most well-known datasets in the field of data science and machine learning. It contains detailed information about the passengers aboard the Titanic, a British passenger liner that tragically sank in the North Atlantic Ocean on April 15, 1912, after hitting an iceberg. This disaster resulted in the loss of over 1,500 lives and has since become a poignant example of maritime tragedy.
Exploratory Data Analysis (EDA) is a crucial step in any data analysis project. It involves examining the dataset to uncover underlying patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations. For the Titanic dataset, EDA helps us understand the factors that influenced survival rates, such as passenger demographics, socio-economic status, and travel details.
The dataset comprises variables such as passenger age, gender, ticket class, fare paid, and whether or not the passenger survived. By analyzing these variables, we can gain insights into which groups of passengers were more likely to survive and the reasons behind these trends. For instance, we might explore questions like:
- Did gender play a significant role in survival rates?
- Were first-class passengers more likely to survive than those in lower classes?
- How did the age of passengers affect their chances of survival?
Through various visualizations and statistical analyses, EDA provides a foundation for more complex modeling and predictive analysis. It allows us to clean and preprocess the data, handle missing values, and create new features that might improve the performance of machine learning models.
About Data¶
# librerias
from loguru import logger
import pandas as pd
from IPython.display import display
from itables import init_notebook_mode, show
from great_tables import GT, html
from utils import *
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
init_notebook_mode(all_interactive=True)
def gt_tables(table,col,vo):
# Flattening the MultiIndex columns
table.columns = [f'{i}_{j}' for i, j in table.columns]
# Resetting the index to make 'FareRange' a column again
table.reset_index(inplace=True)
# Displaying the DataFrame
gt_table = ((
GT(table)
.tab_header(
title="Count and Percentage Table",
subtitle=f"{col} vs {vo}"
)
.tab_spanner(
label="0",
columns=["Count_0", "Percentage_0"]
)
.tab_spanner(
label="1",
columns=["Count_1", "Percentage_1"]
)
.tab_spanner(
label=vo,
columns=["Count_0", "Percentage_0", "Count_1", "Percentage_1"]
)
.cols_label(
Count_0=html("Count"),
Count_1=html("Count"),
Percentage_0=html("Percentage"),
Percentage_1=html("Percentage")
)
)
)
gt_table = gt_table.fmt_number(columns=["Percentage_0", "Percentage_1"],
decimals=2) # .opt_stylize(style = 1, color = "blue")
gt_table = gt_table.tab_options(
table_background_color="white",
# table_font_color="darkblue",
table_font_style="italic",
table_font_names="Times New Roman",
heading_background_color="skyblue"
)
return gt_table
logger.info("Read Data")
# paths
path_raw = "../../data/raw/"
path_procesed = "../../data/processed/"
path_final = "../../data/final/"
# read data
train = pd.read_csv(path_raw + "train.csv")
test = pd.read_csv(path_raw + "test.csv")
# display data
show(train, classes="display nowrap compact",maxBytes = 0)
2024-06-10 08:38:23.420 | INFO | __main__:<module>:1 - Read Data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.0 from the init_notebook_mode cell...
(need help?) |
# information about the data types and non-null values in each column
print("TRAIN:")
train.info()
TRAIN: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
🔑Note: For now, we will focus on the train dataset, but the same should be done for the test dataset.
# check for duplicates in the dataset
duplicates = train.duplicated()
# count the number of duplicate rows
num_duplicates = duplicates.sum()
print("duplicate rows:", num_duplicates)
duplicate rows: 0
To perform an Exploratory Data Analysis (EDA) with visualizations, whether univariate, bivariate or not, it is essential to consider the type of data we are working with. Additionally, we can perform a deep scan of all columns or an in-depth scan of individual columns, depending on the desired speed and detail of our EDA. The goal is to present all findings in a detailed and clear manner to ensure maximum understanding.
# get column names by data types
target = 'Survived'
float_columns = [x for x in list(train.select_dtypes(include=['float64']).columns) if x != target]
integer_columns = [x for x in list(train.select_dtypes(include=['int32', 'int64']).columns) if x != target]
object_columns = [x for x in list(train.select_dtypes(include=['object']).columns) if x != target]
# display column names by data type
print(f"Target: {target}")
print()
print("Total float columns:", len(float_columns))
print("Float columns:", float_columns)
print()
print("Total integer columns:", len(integer_columns))
print("Integer columns:", integer_columns)
print()
print("Total object columns:", len(object_columns))
print("Object columns:", object_columns)
Target: Survived Total float columns: 2 Float columns: ['Age', 'Fare'] Total integer columns: 4 Integer columns: ['PassengerId', 'Pclass', 'SibSp', 'Parch'] Total object columns: 5 Object columns: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
In this case, the 'Age' column should be an integer, but it contains null values (or NaN
), which automatically converts it to a float
type column. For now, let's treat it as a float
type column.
About EDA¶
There are columns used to identify each individual, typically indicated by the term 'ID' in their name. It is essential that these identifiers do not have a value of zero.
It is important to note that these identifiers should not be duplicated unless there is more than one record due to analysis in relation to other columns (e.g., the period). In these specific cases, duplication may be relevant and is associated with certain analytical contexts involving other variables or time periods.
logger.info("EDA")
logger.info('PassengerId')
total_nulls = train['PassengerId'].isnull().sum()
print(f"Total null values: {total_nulls} ")
2024-06-10 08:38:23.500 | INFO | __main__:<module>:1 - EDA 2024-06-10 08:38:23.500 | INFO | __main__:<module>:2 - PassengerId
Total null values: 0
# Set as index
train = train.set_index('PassengerId')
show(train, classes="display nowrap compact",maxBytes = 0)
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
Loading ITables v2.1.0 from the init_notebook_mode cell...
(need help?) |
Now, we will proceed to work on the remaining columns.
To analyze variables with float values, the approach depends on the distribution of the data. For univariate analysis, a Histogram is usually used as a starting point (First Case). However, if the dataset is extensive or there is a noticeable concentration of values around zero (a common scenario), it is more effective to transform the data into discrete intervals (Second Case).
The definition of these intervals can be done automatically using a function that generates equidistant ranges. However, sometimes it is preferable to define these intervals manually, as automation could create numerous bins, making data interpretation and analysis more difficult. In such cases, a manual approach allows for adjusting the intervals more appropriately according to the specific nature of the data, facilitating better understanding and analysis.
logger.info(f"floats: {float_columns}")
bins = {
'Age': [0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
'Fare': [0, 10, 25, 50, 100, 1000]
}
target = 'Survived'
for col in float_columns:
print(f"Column: {col}\n")
print("Univariate Analysis")
plot_histogram(train, col)
plot_range_distribution(train, col, bins[col], figsize=(10, 5))
print("Bivariate Analysis")
plot_histogram_vo(train, col, target)
plot_range_distribution_vo(train, col, bins[col], target, figsize=(10, 5))
print("Tables")
table = calculate_percentage_vo(train, col, bins[col], target)
table = gt_tables(table,col,target)
display(table)
2024-06-10 08:38:23.553 | INFO | __main__:<module>:1 - floats: ['Age', 'Fare']
Column: Age Univariate Analysis
Bivariate Analysis
Tables
Count and Percentage Table | ||||
---|---|---|---|---|
Age vs Survived | ||||
Survived | ||||
AgeRange | 0 | 1 | ||
Count | Percentage | Count | Percentage | |
[0, 10) | 24.0 | 0.06 | 38.0 | 0.13 |
[10, 20) | 61.0 | 0.14 | 41.0 | 0.14 |
[20, 30) | 143.0 | 0.34 | 77.0 | 0.27 |
[30, 40) | 94.0 | 0.22 | 73.0 | 0.25 |
[40, 50) | 55.0 | 0.13 | 34.0 | 0.12 |
[50, 60) | 28.0 | 0.07 | 20.0 | 0.07 |
[60, 70) | 13.0 | 0.03 | 6.0 | 0.02 |
[70, 80) | 6.0 | 0.01 | 0.0 | 0.00 |
[80, 90) | 0.0 | 0.00 | 1.0 | 0.00 |
Column: Fare Univariate Analysis
Bivariate Analysis
Tables
Count and Percentage Table | ||||
---|---|---|---|---|
Fare vs Survived | ||||
Survived | ||||
FareRange | 0 | 1 | ||
Count | Percentage | Count | Percentage | |
[0, 10) | 269.0 | 0.49 | 67.0 | 0.20 |
[10, 25) | 128.0 | 0.23 | 93.0 | 0.27 |
[25, 50) | 100.0 | 0.18 | 73.0 | 0.21 |
[50, 100) | 38.0 | 0.07 | 70.0 | 0.20 |
[100, 1000) | 14.0 | 0.03 | 39.0 | 0.11 |
To visually represent variables of type int
or object
, it is initially recommended to use the value_counts
method from Pandas to count the unique values in that column. However, different considerations should be taken into account:
- When the number of unique values is small, it is appropriate to use
value_counts
directly for bothint
andobject
type variables. - For
int
type variables with a large number of categories, it is useful to work with value intervals before creating graphical visualizations. - For
object
type variables with multiple categories, it may be helpful to prioritize the most frequent ones and group the rest under a general label such as "others". However, if most values are unique, that variable may not provide relevant information for graphical representation (e.g., information such as addresses, phone numbers, emails, etc.).
logger.info(f"Integers: {integer_columns}")
logger.info(f"Objects: {object_columns}")
for col in ['Pclass', 'SibSp', 'Parch', 'Sex', 'Embarked']:
print(col)
print("Univariate Analysis")
plot_barplot(train, col)
print("Bivariate Analysis")
plot_barplot_vo(train, col, target, figsize=(10, 5))
print("Tables")
table = calculate_percentage_vo_int(train, col, target).fillna(0)
table = gt_tables(table,col,target)
display(table)
2024-06-10 08:38:24.822 | INFO | __main__:<module>:1 - Integers: ['PassengerId', 'Pclass', 'SibSp', 'Parch'] 2024-06-10 08:38:24.823 | INFO | __main__:<module>:2 - Objects: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Pclass Univariate Analysis
Bivariate Analysis
Tables
Count and Percentage Table | ||||
---|---|---|---|---|
Pclass vs Survived | ||||
Survived | ||||
Pclass | 0 | 1 | ||
Count | Percentage | Count | Percentage | |
1 | 80.0 | 0.15 | 136.0 | 0.40 |
2 | 97.0 | 0.18 | 87.0 | 0.25 |
3 | 372.0 | 0.68 | 119.0 | 0.35 |
SibSp Univariate Analysis
Bivariate Analysis
Tables
Count and Percentage Table | ||||
---|---|---|---|---|
SibSp vs Survived | ||||
Survived | ||||
SibSp | 0 | 1 | ||
Count | Percentage | Count | Percentage | |
0 | 398.0 | 0.72 | 210.0 | 0.61 |
1 | 97.0 | 0.18 | 112.0 | 0.33 |
2 | 15.0 | 0.03 | 13.0 | 0.04 |
3 | 12.0 | 0.02 | 4.0 | 0.01 |
4 | 15.0 | 0.03 | 3.0 | 0.01 |
5 | 5.0 | 0.01 | 0.0 | 0.00 |
8 | 7.0 | 0.01 | 0.0 | 0.00 |
Parch Univariate Analysis
Bivariate Analysis
Tables
Count and Percentage Table | ||||
---|---|---|---|---|
Parch vs Survived | ||||
Survived | ||||
Parch | 0 | 1 | ||
Count | Percentage | Count | Percentage | |
0 | 445.0 | 0.81 | 233.0 | 0.68 |
1 | 53.0 | 0.10 | 65.0 | 0.19 |
2 | 40.0 | 0.07 | 40.0 | 0.12 |
3 | 2.0 | 0.00 | 3.0 | 0.01 |
4 | 4.0 | 0.01 | 0.0 | 0.00 |
5 | 4.0 | 0.01 | 1.0 | 0.00 |
6 | 1.0 | 0.00 | 0.0 | 0.00 |
Sex Univariate Analysis
Bivariate Analysis
Tables
Count and Percentage Table | ||||
---|---|---|---|---|
Sex vs Survived | ||||
Survived | ||||
Sex | 0 | 1 | ||
Count | Percentage | Count | Percentage | |
female | 81.0 | 0.15 | 233.0 | 0.68 |
male | 468.0 | 0.85 | 109.0 | 0.32 |
Embarked Univariate Analysis
Bivariate Analysis
Tables
Count and Percentage Table | ||||
---|---|---|---|---|
Embarked vs Survived | ||||
Survived | ||||
Embarked | 0 | 1 | ||
Count | Percentage | Count | Percentage | |
C | 75.0 | 0.14 | 93.0 | 0.27 |
Q | 47.0 | 0.09 | 30.0 | 0.09 |
S | 427.0 | 0.78 | 217.0 | 0.64 |
To perform exploratory analysis on the 'Cabin', 'Name', and 'Ticket' columns in the Titanic dataset, you can follow various approaches depending on the information they contain and the specific objectives of your analysis.
Conclusion¶
In conclusion, we conducted an Exploratory Data Analysis (EDA) on the Titanic dataset, focusing on understanding the characteristics and distribution of various columns. We identified the data types and non-null values, checked for duplicates, and performed univariate and bivariate analyses on numerical and categorical variables. Through graphical representations and interval-based transformations, we gained insights into the data's structure and key factors influencing survival rates. This comprehensive EDA serves as a foundation for further in-depth analysis and modeling.