📊 Exploratory Data Analysis¶

Introduction¶

The Titanic dataset is one of the most well-known datasets in the field of data science and machine learning. It contains detailed information about the passengers aboard the Titanic, a British passenger liner that tragically sank in the North Atlantic Ocean on April 15, 1912, after hitting an iceberg. This disaster resulted in the loss of over 1,500 lives and has since become a poignant example of maritime tragedy.

Exploratory Data Analysis (EDA) is a crucial step in any data analysis project. It involves examining the dataset to uncover underlying patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations. For the Titanic dataset, EDA helps us understand the factors that influenced survival rates, such as passenger demographics, socio-economic status, and travel details.

The dataset comprises variables such as passenger age, gender, ticket class, fare paid, and whether or not the passenger survived. By analyzing these variables, we can gain insights into which groups of passengers were more likely to survive and the reasons behind these trends. For instance, we might explore questions like:

Did gender play a significant role in survival rates?
Were first-class passengers more likely to survive than those in lower classes?
How did the age of passengers affect their chances of survival?

Through various visualizations and statistical analyses, EDA provides a foundation for more complex modeling and predictive analysis. It allows us to clean and preprocess the data, handle missing values, and create new features that might improve the performance of machine learning models.

About Data¶

In [1]:

Copied!





# librerias
from loguru import logger
import pandas as pd
from IPython.display import display
from itables import init_notebook_mode, show
from great_tables import GT, html

from utils import * 

import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
init_notebook_mode(all_interactive=True)
# librerias
from loguru import logger
import pandas as pd
from IPython.display import display
from itables import init_notebook_mode, show
from great_tables import GT, html

from utils import * 

import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
init_notebook_mode(all_interactive=True)

In [2]:

Copied!





def gt_tables(table,col,vo):
    # Flattening the MultiIndex columns
    table.columns = [f'{i}_{j}' for i, j in table.columns]

    # Resetting the index to make 'FareRange' a column again
    table.reset_index(inplace=True)

    # Displaying the DataFrame
    gt_table = ((
        GT(table)
        .tab_header(
            title="Count and Percentage Table",
            subtitle=f"{col} vs {vo}"
        )

        .tab_spanner(
            label="0",
            columns=["Count_0", "Percentage_0"]
        )
        .tab_spanner(
            label="1",
            columns=["Count_1", "Percentage_1"]
        )
        .tab_spanner(
            label=vo,
            columns=["Count_0", "Percentage_0", "Count_1", "Percentage_1"]
        )

        .cols_label(
            Count_0=html("Count"),
            Count_1=html("Count"),
            Percentage_0=html("Percentage"),
            Percentage_1=html("Percentage")

        )
    )
    )
    gt_table = gt_table.fmt_number(columns=["Percentage_0", "Percentage_1"],
                                   decimals=2)  # .opt_stylize(style = 1, color = "blue")

    gt_table = gt_table.tab_options(
        table_background_color="white",
        # table_font_color="darkblue",
        table_font_style="italic",
        table_font_names="Times New Roman",
        heading_background_color="skyblue"
    )
    
    return gt_table
def gt_tables(table,col,vo):
    # Flattening the MultiIndex columns
    table.columns = [f'{i}_{j}' for i, j in table.columns]

    # Resetting the index to make 'FareRange' a column again
    table.reset_index(inplace=True)

    # Displaying the DataFrame
    gt_table = ((
        GT(table)
        .tab_header(
            title="Count and Percentage Table",
            subtitle=f"{col} vs {vo}"
        )

        .tab_spanner(
            label="0",
            columns=["Count_0", "Percentage_0"]
        )
        .tab_spanner(
            label="1",
            columns=["Count_1", "Percentage_1"]
        )
        .tab_spanner(
            label=vo,
            columns=["Count_0", "Percentage_0", "Count_1", "Percentage_1"]
        )

        .cols_label(
            Count_0=html("Count"),
            Count_1=html("Count"),
            Percentage_0=html("Percentage"),
            Percentage_1=html("Percentage")

        )
    )
    )
    gt_table = gt_table.fmt_number(columns=["Percentage_0", "Percentage_1"],
                                   decimals=2)  # .opt_stylize(style = 1, color = "blue")

    gt_table = gt_table.tab_options(
        table_background_color="white",
        # table_font_color="darkblue",
        table_font_style="italic",
        table_font_names="Times New Roman",
        heading_background_color="skyblue"
    )
    
    return gt_table

In [3]:

Copied!





logger.info("Read Data")

# paths
path_raw = "../../data/raw/"
path_procesed = "../../data/processed/"
path_final = "../../data/final/"

# read data
train = pd.read_csv(path_raw + "train.csv")
test = pd.read_csv(path_raw + "test.csv")

# display data
show(train, classes="display nowrap compact",maxBytes = 0)
logger.info("Read Data")

# paths
path_raw = "../../data/raw/"
path_procesed = "../../data/processed/"
path_final = "../../data/final/"

# read data
train = pd.read_csv(path_raw + "train.csv")
test = pd.read_csv(path_raw + "test.csv")

# display data
show(train, classes="display nowrap compact",maxBytes = 0)

2024-06-10 08:38:23.420 | INFO     | __main__:<module>:1 - Read Data

entries per page

Search:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.283	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.925	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.05	NaN	S
6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.458	NaN	Q
7	0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.862	E46	S
8	0	3	Palsson, Master. Gosta Leonard	male	2	3	1	349909	21.075	NaN	S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27	0	2	347742	11.133	NaN	S
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14	1	0	237736	30.071	NaN	C

Showing 1 to 10 of 891 entries

…

In [4]:

Copied!

# information about the data types and non-null values in each column
print("TRAIN:")
train.info()
# information about the data types and non-null values in each column
print("TRAIN:")
train.info()

TRAIN:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

🔑Note: For now, we will focus on the train dataset, but the same should be done for the test dataset.

In [5]:

Copied!





# check for duplicates in the dataset
duplicates = train.duplicated()

# count the number of duplicate rows
num_duplicates = duplicates.sum()
print("duplicate rows:", num_duplicates)
# check for duplicates in the dataset
duplicates = train.duplicated()

# count the number of duplicate rows
num_duplicates = duplicates.sum()
print("duplicate rows:", num_duplicates)

duplicate rows: 0

To perform an Exploratory Data Analysis (EDA) with visualizations, whether univariate, bivariate or not, it is essential to consider the type of data we are working with. Additionally, we can perform a deep scan of all columns or an in-depth scan of individual columns, depending on the desired speed and detail of our EDA. The goal is to present all findings in a detailed and clear manner to ensure maximum understanding.

In [6]:

Copied!





# get column names by data types
target = 'Survived'

float_columns = [x for x in list(train.select_dtypes(include=['float64']).columns) if x != target]
integer_columns = [x for x in list(train.select_dtypes(include=['int32', 'int64']).columns) if x != target]
object_columns = [x for x in list(train.select_dtypes(include=['object']).columns) if x != target]

# display column names by data type
print(f"Target: {target}")
print()
print("Total float columns:", len(float_columns))
print("Float columns:", float_columns)
print()
print("Total integer columns:", len(integer_columns))
print("Integer columns:", integer_columns)
print()
print("Total object columns:", len(object_columns))
print("Object columns:", object_columns)
# get column names by data types
target = 'Survived'

float_columns = [x for x in list(train.select_dtypes(include=['float64']).columns) if x != target]
integer_columns = [x for x in list(train.select_dtypes(include=['int32', 'int64']).columns) if x != target]
object_columns = [x for x in list(train.select_dtypes(include=['object']).columns) if x != target]

# display column names by data type
print(f"Target: {target}")
print()
print("Total float columns:", len(float_columns))
print("Float columns:", float_columns)
print()
print("Total integer columns:", len(integer_columns))
print("Integer columns:", integer_columns)
print()
print("Total object columns:", len(object_columns))
print("Object columns:", object_columns)

Target: Survived

Total float columns: 2
Float columns: ['Age', 'Fare']

Total integer columns: 4
Integer columns: ['PassengerId', 'Pclass', 'SibSp', 'Parch']

Total object columns: 5
Object columns: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In this case, the 'Age' column should be an integer, but it contains null values (or NaN), which automatically converts it to a float type column. For now, let's treat it as a float type column.

About EDA¶

There are columns used to identify each individual, typically indicated by the term 'ID' in their name. It is essential that these identifiers do not have a value of zero.

It is important to note that these identifiers should not be duplicated unless there is more than one record due to analysis in relation to other columns (e.g., the period). In these specific cases, duplication may be relevant and is associated with certain analytical contexts involving other variables or time periods.

In [7]:

Copied!





logger.info("EDA")
logger.info('PassengerId')
total_nulls = train['PassengerId'].isnull().sum()
print(f"Total null values: {total_nulls} ")
logger.info("EDA")
logger.info('PassengerId')
total_nulls = train['PassengerId'].isnull().sum()
print(f"Total null values: {total_nulls} ")

2024-06-10 08:38:23.500 | INFO     | __main__:<module>:1 - EDA
2024-06-10 08:38:23.500 | INFO     | __main__:<module>:2 - PassengerId

Total null values: 0

In [8]:

Copied!

# Set as index
train = train.set_index('PassengerId')
show(train, classes="display nowrap compact",maxBytes = 0)
# Set as index
train = train.set_index('PassengerId')
show(train, classes="display nowrap compact",maxBytes = 0)

entries per page

Search:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.283	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.925	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.05	NaN	S
6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.458	NaN	Q
7	0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.862	E46	S
8	0	3	Palsson, Master. Gosta Leonard	male	2	3	1	349909	21.075	NaN	S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27	0	2	347742	11.133	NaN	S
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14	1	0	237736	30.071	NaN	C

Showing 1 to 10 of 891 entries

…

Now, we will proceed to work on the remaining columns.

To analyze variables with float values, the approach depends on the distribution of the data. For univariate analysis, a Histogram is usually used as a starting point (First Case). However, if the dataset is extensive or there is a noticeable concentration of values around zero (a common scenario), it is more effective to transform the data into discrete intervals (Second Case).

The definition of these intervals can be done automatically using a function that generates equidistant ranges. However, sometimes it is preferable to define these intervals manually, as automation could create numerous bins, making data interpretation and analysis more difficult. In such cases, a manual approach allows for adjusting the intervals more appropriately according to the specific nature of the data, facilitating better understanding and analysis.

In [9]:

Copied!





logger.info(f"floats: {float_columns}")

bins = {
    'Age': [0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
    'Fare':  [0, 10, 25, 50, 100, 1000]
}

target = 'Survived'

for col in float_columns: 
    print(f"Column: {col}\n")
    print("Univariate Analysis")
    plot_histogram(train, col)
    plot_range_distribution(train, col, bins[col], figsize=(10, 5))
    print("Bivariate Analysis")
    plot_histogram_vo(train, col, target)
    plot_range_distribution_vo(train, col, bins[col], target, figsize=(10, 5))
    print("Tables")
    table = calculate_percentage_vo(train, col, bins[col], target)
    table = gt_tables(table,col,target)
    display(table)
logger.info(f"floats: {float_columns}")

bins = {
    'Age': [0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
    'Fare':  [0, 10, 25, 50, 100, 1000]
}

target = 'Survived'

for col in float_columns: 
    print(f"Column: {col}\n")
    print("Univariate Analysis")
    plot_histogram(train, col)
    plot_range_distribution(train, col, bins[col], figsize=(10, 5))
    print("Bivariate Analysis")
    plot_histogram_vo(train, col, target)
    plot_range_distribution_vo(train, col, bins[col], target, figsize=(10, 5))
    print("Tables")
    table = calculate_percentage_vo(train, col, bins[col], target)
    table = gt_tables(table,col,target)
    display(table)

2024-06-10 08:38:23.553 | INFO     | __main__:<module>:1 - floats: ['Age', 'Fare']

Column: Age

Univariate Analysis

No description has been provided for this image

Bivariate Analysis

Tables

Count and Percentage Table
Age vs Survived
	Survived
AgeRange	0		1
AgeRange	Count	Percentage	Count	Percentage
[0, 10)	24.0	0.06	38.0	0.13
[10, 20)	61.0	0.14	41.0	0.14
[20, 30)	143.0	0.34	77.0	0.27
[30, 40)	94.0	0.22	73.0	0.25
[40, 50)	55.0	0.13	34.0	0.12
[50, 60)	28.0	0.07	20.0	0.07
[60, 70)	13.0	0.03	6.0	0.02
[70, 80)	6.0	0.01	0.0	0.00
[80, 90)	0.0	0.00	1.0	0.00

Column: Fare

Univariate Analysis

Bivariate Analysis

Tables

Count and Percentage Table
Fare vs Survived
	Survived
FareRange	0		1
FareRange	Count	Percentage	Count	Percentage
[0, 10)	269.0	0.49	67.0	0.20
[10, 25)	128.0	0.23	93.0	0.27
[25, 50)	100.0	0.18	73.0	0.21
[50, 100)	38.0	0.07	70.0	0.20
[100, 1000)	14.0	0.03	39.0	0.11

To visually represent variables of type int or object, it is initially recommended to use the value_counts method from Pandas to count the unique values in that column. However, different considerations should be taken into account:

When the number of unique values is small, it is appropriate to use value_counts directly for both int and object type variables.
For int type variables with a large number of categories, it is useful to work with value intervals before creating graphical visualizations.
For object type variables with multiple categories, it may be helpful to prioritize the most frequent ones and group the rest under a general label such as "others". However, if most values are unique, that variable may not provide relevant information for graphical representation (e.g., information such as addresses, phone numbers, emails, etc.).

In [10]:

Copied!





logger.info(f"Integers: {integer_columns}")
logger.info(f"Objects: {object_columns}")

for col in ['Pclass', 'SibSp', 'Parch', 'Sex', 'Embarked']:
    print(col)
    print("Univariate Analysis")
    plot_barplot(train, col)
    print("Bivariate Analysis")
    plot_barplot_vo(train, col, target, figsize=(10, 5))
    print("Tables")
    table = calculate_percentage_vo_int(train, col, target).fillna(0)
    table = gt_tables(table,col,target)
    display(table)
logger.info(f"Integers: {integer_columns}")
logger.info(f"Objects: {object_columns}")

for col in ['Pclass', 'SibSp', 'Parch', 'Sex', 'Embarked']:
    print(col)
    print("Univariate Analysis")
    plot_barplot(train, col)
    print("Bivariate Analysis")
    plot_barplot_vo(train, col, target, figsize=(10, 5))
    print("Tables")
    table = calculate_percentage_vo_int(train, col, target).fillna(0)
    table = gt_tables(table,col,target)
    display(table)

2024-06-10 08:38:24.822 | INFO     | __main__:<module>:1 - Integers: ['PassengerId', 'Pclass', 'SibSp', 'Parch']
2024-06-10 08:38:24.823 | INFO     | __main__:<module>:2 - Objects: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

Pclass
Univariate Analysis

Bivariate Analysis

Tables

Count and Percentage Table
Pclass vs Survived
	Survived
Pclass	0		1
Pclass	Count	Percentage	Count	Percentage
1	80.0	0.15	136.0	0.40
2	97.0	0.18	87.0	0.25
3	372.0	0.68	119.0	0.35

SibSp
Univariate Analysis

Bivariate Analysis

Tables

Count and Percentage Table
SibSp vs Survived
	Survived
SibSp	0		1
SibSp	Count	Percentage	Count	Percentage
0	398.0	0.72	210.0	0.61
1	97.0	0.18	112.0	0.33
2	15.0	0.03	13.0	0.04
3	12.0	0.02	4.0	0.01
4	15.0	0.03	3.0	0.01
5	5.0	0.01	0.0	0.00
8	7.0	0.01	0.0	0.00

Parch
Univariate Analysis

Bivariate Analysis

Tables

Count and Percentage Table
Parch vs Survived
	Survived
Parch	0		1
Parch	Count	Percentage	Count	Percentage
0	445.0	0.81	233.0	0.68
1	53.0	0.10	65.0	0.19
2	40.0	0.07	40.0	0.12
3	2.0	0.00	3.0	0.01
4	4.0	0.01	0.0	0.00
5	4.0	0.01	1.0	0.00
6	1.0	0.00	0.0	0.00

Sex
Univariate Analysis

Bivariate Analysis

Tables

Count and Percentage Table
Sex vs Survived
	Survived
Sex	0		1
Sex	Count	Percentage	Count	Percentage
female	81.0	0.15	233.0	0.68
male	468.0	0.85	109.0	0.32

Embarked
Univariate Analysis

Bivariate Analysis

Tables

Count and Percentage Table
Embarked vs Survived
	Survived
Embarked	0		1
Embarked	Count	Percentage	Count	Percentage
C	75.0	0.14	93.0	0.27
Q	47.0	0.09	30.0	0.09
S	427.0	0.78	217.0	0.64

To perform exploratory analysis on the 'Cabin', 'Name', and 'Ticket' columns in the Titanic dataset, you can follow various approaches depending on the information they contain and the specific objectives of your analysis.

Conclusion¶

In conclusion, we conducted an Exploratory Data Analysis (EDA) on the Titanic dataset, focusing on understanding the characteristics and distribution of various columns. We identified the data types and non-null values, checked for duplicates, and performed univariate and bivariate analyses on numerical and categorical variables. Through graphical representations and interval-based transformations, we gained insights into the data's structure and key factors influencing survival rates. This comprehensive EDA serves as a foundation for further in-depth analysis and modeling.