Artificial intelligence April 09 ,2025

Exploratory Data Analysis (EDA) with Python

Exploratory Data Analysis (EDA) is a fundamental step in the data science process. It involves summarizing the main characteristics of a dataset using both visual and quantitative methods. The goal of EDA is to gain insights, detect patterns, identify anomalies, and understand relationships between variables before applying machine learning models.

In this guide, we will perform a comprehensive EDA on the Titanic dataset using Python.

Dataset Used: Titanic (from Seaborn)

We will use the Titanic dataset available through the Seaborn library.

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = sns.load_dataset('titanic')
df.head()

1. Understanding the Dataset

The first step in EDA is to understand the structure of the dataset, including data types, missing values, and summary statistics.

df.info()
df.describe(include='all')

Observations:

Numerical columns: age, fare, sibsp, parch
Categorical columns: sex, embarked, class, who
Target variable: survived
Presence of missing values in columns like age, embarked, and deck

2. Univariate Analysis

This involves analyzing the distribution of individual variables.

a. Age Distribution

plt.figure(figsize=(8, 5))
sns.histplot(df['age'].dropna(), kde=True, bins=30)
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

The distribution shows the age spread of passengers. The KDE curve helps understand skewness and peak points.

b. Passenger Class Distribution

sns.countplot(data=df, x='pclass')
plt.title("Passenger Class Count")
plt.xlabel("Passenger Class")
plt.ylabel("Number of Passengers")
plt.show()

This chart shows how many passengers belonged to each travel class.

3. Bivariate Analysis

This step focuses on examining relationships between two variables.

a. Survival Count by Gender

sns.countplot(data=df, x='sex', hue='survived')
plt.title("Survival Count by Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.legend(labels=["Not Survived", "Survived"])
plt.show()

This visualization reveals gender-based survival differences.

b. Fare Distribution by Class

sns.boxplot(data=df, x='pclass', y='fare')
plt.title("Fare Distribution by Class")
plt.xlabel("Passenger Class")
plt.ylabel("Fare")
plt.show()

This boxplot helps detect variability and outliers in fare across classes.

4. Multivariate Analysis

Involves analyzing interactions between more than two variables.

a. Pairwise Relationships

sns.pairplot(df[['age', 'fare', 'pclass', 'survived']].dropna(), hue='survived')
plt.suptitle("Pairwise Relationships", y=1.02)
plt.show()

This plot shows relationships between numerical variables and how they differ by survival status.

b. Correlation Heatmap

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

The heatmap quantifies the linear correlation between numerical variables.

5. Missing Values Visualization

Visualizing missing data helps identify patterns and plan data cleaning.

import missingno as msno

msno.matrix(df)
plt.show()

msno.heatmap(df)
plt.show()

The matrix and heatmap show which columns have missing values and how those values are related.

6. Outlier Detection

Outliers can distort statistical analyses and modeling results.

Box Plot for Fare

sns.boxplot(x=df['fare'])
plt.title("Boxplot for Fare")
plt.show()

This boxplot reveals that a few passengers paid significantly higher fares than others, which are considered outliers.

7. Distribution Check and Normality Test

Checking if variables like age follow a normal distribution can be useful for some machine learning algorithms.

sns.histplot(df['age'].dropna(), kde=True)
plt.title("Age Distribution")
plt.show()

from scipy.stats import skew, kurtosis

print("Skewness:", skew(df['age'].dropna()))
print("Kurtosis:", kurtosis(df['age'].dropna()))

Skewness measures asymmetry. A skew value > 0 indicates right skew; < 0 indicates left skew.
Kurtosis measures the peakedness of the distribution.

Summary of EDA Tasks and Purposes

EDA Task	Purpose
Univariate Analysis	Understand distribution of individual variables
Bivariate Analysis	Identify relationships between two variables
Multivariate Analysis	Analyze interactions among multiple features
Outlier Detection	Detect and handle anomalies that can skew analysis or models
Missing Value Analysis	Visualize and strategize imputation or removal
Distribution Check	Validate assumptions of modeling techniques such as linear regression

Conclusion

EDA serves as the foundation for any data-driven project. It offers an opportunity to develop a deep understanding of the dataset, identify potential issues, and generate hypotheses for further analysis or modeling. Without thorough EDA, the risk of misinterpreting data or building inaccurate models significantly increases.

By combining visualizations with statistical analysis, you can uncover insights that lead to better decision-making and model performance.

If needed, the next logical step would be to preprocess this dataset, handle missing values, engineer new features, and prepare the data for machine learning.

Next Blog- Feature Engineering and Feature Scaling

Purnima

You must logged in to post comments.

Artificial intelligence

Artificial intelligence

Exploratory Data Analysis (EDA) with Python

Dataset Used: Titanic (from Seaborn)

1. Understanding the Dataset

Observations:

2. Univariate Analysis

a. Age Distribution

b. Passenger Class Distribution

3. Bivariate Analysis

a. Survival Count by Gender

b. Fare Distribution by Class

4. Multivariate Analysis

a. Pairwise Relationships

b. Correlation Heatmap

5. Missing Values Visualization

6. Outlier Detection

Box Plot for Fare

7. Distribution Check and Normality Test

Summary of EDA Tasks and Purposes

Conclusion

Related Blogs

What is Artificial I...

History and Evolutio...

Importance and Appli...

Narrow AI, General A...

AI vs Machine Learni...

Linear Algebra Basic...

Calculus for AI

Probability and Stat...

Probability Distribu...

Graph Theory and AI

What is NLP

Preprocessing Text D...

Sentiment Analysis a...

Word Embeddings (Wor...

Transformer-based Mo...

Building Chatbots wi...

Basics of Computer V...

Image Preprocessing...

Object Detection and...

Face Recognition and...

Applications of Comp...

AI-Powered Chatbot U...

Implementing a Basic...

Implementation of Ob...

Implementation of Ob...

Implementation of Fa...

Deep Reinforcement L...

Deep Reinforcement L...

Deep Reinforcement L...

Introduction to Popu...

Introduction to Popu...

Introduction to Popu...

Introduction to Popu...

Tools for Data Handl...

Tool for Data Handli...

Cloud Platforms for...

Deep Dive into AWS S...

Cloud Platforms for...

Cloud Platforms for...

Visualization Tools...

Data Cleaning and Pr...

Exploratory Data Ana...

Feature Engineering...

Data Visualization w...

Working with Large D...

Understanding Bias i...

Ethics in AI Develop...

Fairness in Machine...

The Role of Regulati...

Responsible AI Pract...

Artificial Intellige...

AI in Finance and Ba...

AI in Autonomous Veh...

AI in Gaming and Ent...

AI in Social Media a...

Building a Spam Emai...

Creating an Image Cl...

Developing a Sentime...