Artificial intelligence April 04 ,2025

Exploratory Data Analysis (EDA) with Python

Exploratory Data Analysis (EDA) is a fundamental step in the data science process. It involves summarizing the main characteristics of a dataset using both visual and quantitative methods. The goal of EDA is to gain insights, detect patterns, identify anomalies, and understand relationships between variables before applying machine learning models.

In this guide, we will perform a comprehensive EDA on the Titanic dataset using Python.

Dataset Used: Titanic (from Seaborn)

We will use the Titanic dataset available through the Seaborn library.

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = sns.load_dataset('titanic')
df.head()

1. Understanding the Dataset

The first step in EDA is to understand the structure of the dataset, including data types, missing values, and summary statistics.

df.info()
df.describe(include='all')

Observations:

  • Numerical columns: age, fare, sibsp, parch
  • Categorical columns: sex, embarked, class, who
  • Target variable: survived
  • Presence of missing values in columns like age, embarked, and deck

2. Univariate Analysis

This involves analyzing the distribution of individual variables.

a. Age Distribution

plt.figure(figsize=(8, 5))
sns.histplot(df['age'].dropna(), kde=True, bins=30)
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

The distribution shows the age spread of passengers. The KDE curve helps understand skewness and peak points.

b. Passenger Class Distribution

sns.countplot(data=df, x='pclass')
plt.title("Passenger Class Count")
plt.xlabel("Passenger Class")
plt.ylabel("Number of Passengers")
plt.show()

This chart shows how many passengers belonged to each travel class.

3. Bivariate Analysis

This step focuses on examining relationships between two variables.

a. Survival Count by Gender

sns.countplot(data=df, x='sex', hue='survived')
plt.title("Survival Count by Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.legend(labels=["Not Survived", "Survived"])
plt.show()

This visualization reveals gender-based survival differences.

b. Fare Distribution by Class

sns.boxplot(data=df, x='pclass', y='fare')
plt.title("Fare Distribution by Class")
plt.xlabel("Passenger Class")
plt.ylabel("Fare")
plt.show()

This boxplot helps detect variability and outliers in fare across classes.

4. Multivariate Analysis

Involves analyzing interactions between more than two variables.

a. Pairwise Relationships

sns.pairplot(df[['age', 'fare', 'pclass', 'survived']].dropna(), hue='survived')
plt.suptitle("Pairwise Relationships", y=1.02)
plt.show()

This plot shows relationships between numerical variables and how they differ by survival status.

b. Correlation Heatmap

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

The heatmap quantifies the linear correlation between numerical variables.

5. Missing Values Visualization

Visualizing missing data helps identify patterns and plan data cleaning.

import missingno as msno

msno.matrix(df)
plt.show()

msno.heatmap(df)
plt.show()

The matrix and heatmap show which columns have missing values and how those values are related.

6. Outlier Detection

Outliers can distort statistical analyses and modeling results.

Box Plot for Fare

sns.boxplot(x=df['fare'])
plt.title("Boxplot for Fare")
plt.show()

This boxplot reveals that a few passengers paid significantly higher fares than others, which are considered outliers.

7. Distribution Check and Normality Test

Checking if variables like age follow a normal distribution can be useful for some machine learning algorithms.

sns.histplot(df['age'].dropna(), kde=True)
plt.title("Age Distribution")
plt.show()

from scipy.stats import skew, kurtosis

print("Skewness:", skew(df['age'].dropna()))
print("Kurtosis:", kurtosis(df['age'].dropna()))
  • Skewness measures asymmetry. A skew value > 0 indicates right skew; < 0 indicates left skew.
  • Kurtosis measures the peakedness of the distribution.

Summary of EDA Tasks and Purposes

EDA TaskPurpose
Univariate AnalysisUnderstand distribution of individual variables
Bivariate AnalysisIdentify relationships between two variables
Multivariate AnalysisAnalyze interactions among multiple features
Outlier DetectionDetect and handle anomalies that can skew analysis or models
Missing Value AnalysisVisualize and strategize imputation or removal
Distribution CheckValidate assumptions of modeling techniques such as linear regression

Conclusion

EDA serves as the foundation for any data-driven project. It offers an opportunity to develop a deep understanding of the dataset, identify potential issues, and generate hypotheses for further analysis or modeling. Without thorough EDA, the risk of misinterpreting data or building inaccurate models significantly increases.

By combining visualizations with statistical analysis, you can uncover insights that lead to better decision-making and model performance.

If needed, the next logical step would be to preprocess this dataset, handle missing values, engineer new features, and prepare the data for machine learning.

 

Next Blog- Feature Engineering and Feature Scaling    

Purnima
0

You must logged in to post comments.

Related Blogs

Artificial intelligence May 05 ,2025
Staying Updated in A...
Artificial intelligence May 05 ,2025
AI Career Opportunit...
Artificial intelligence May 05 ,2025
How to Prepare for A...
Artificial intelligence May 05 ,2025
Building an AI Portf...
Artificial intelligence May 05 ,2025
4 Popular AI Certifi...
Artificial intelligence May 05 ,2025
Preparing for an AI-...
Artificial intelligence May 05 ,2025
AI Research Frontier...
Artificial intelligence May 05 ,2025
The Role of AI in Cl...
Artificial intelligence May 05 ,2025
AI and the Job Marke...
Artificial intelligence May 05 ,2025
Emerging Trends in A...
Artificial intelligence April 04 ,2025
AI for Time Series F...
Artificial intelligence April 04 ,2025
Quantum Computing an...
Artificial intelligence April 04 ,2025
AI for Edge Devices...
Artificial intelligence April 04 ,2025
Explainable AI (XAI)
Artificial intelligence April 04 ,2025
Generative AI: An In...
Artificial intelligence April 04 ,2025
Implementing a Recom...
Artificial intelligence April 04 ,2025
Developing a Sentime...
Artificial intelligence April 04 ,2025
Creating an Image Cl...
Artificial intelligence April 04 ,2025
Building a Spam Emai...
Artificial intelligence April 04 ,2025
AI in Social Media a...
Artificial intelligence April 04 ,2025
AI in Gaming and Ent...
Artificial intelligence April 04 ,2025
AI in Autonomous Veh...
Artificial intelligence April 04 ,2025
AI in Finance and Ba...
Artificial intelligence April 04 ,2025
Artificial Intellige...
Artificial intelligence April 04 ,2025
Responsible AI Pract...
Artificial intelligence April 04 ,2025
The Role of Regulati...
Artificial intelligence April 04 ,2025
Fairness in Machine...
Artificial intelligence April 04 ,2025
Ethics in AI Develop...
Artificial intelligence April 04 ,2025
Understanding Bias i...
Artificial intelligence April 04 ,2025
Working with Large D...
Artificial intelligence April 04 ,2025
Data Visualization w...
Artificial intelligence April 04 ,2025
Feature Engineering...
Artificial intelligence April 04 ,2025
Exploratory Data Ana...
Artificial intelligence April 04 ,2025
Data Cleaning and Pr...
Artificial intelligence April 04 ,2025
Visualization Tools...
Artificial intelligence April 04 ,2025
Cloud Platforms for...
Artificial intelligence April 04 ,2025
Cloud Platforms for...
Artificial intelligence April 04 ,2025
Deep Dive into AWS S...
Artificial intelligence April 04 ,2025
Cloud Platforms for...
Artificial intelligence March 03 ,2025
Tool for Data Handli...
Artificial intelligence March 03 ,2025
Tools for Data Handl...
Artificial intelligence March 03 ,2025
Introduction to Popu...
Artificial intelligence March 03 ,2025
Introduction to Popu...
Artificial intelligence March 03 ,2025
Introduction to Popu...
Artificial intelligence March 03 ,2025
Introduction to Popu...
Artificial intelligence March 03 ,2025
Deep Reinforcement L...
Artificial intelligence March 03 ,2025
Deep Reinforcement L...
Artificial intelligence March 03 ,2025
Deep Reinforcement L...
Artificial intelligence March 03 ,2025
Implementation of Fa...
Artificial intelligence March 03 ,2025
Implementation of Ob...
Artificial intelligence March 03 ,2025
Implementation of Ob...
Artificial intelligence March 03 ,2025
Implementing a Basic...
Artificial intelligence March 03 ,2025
AI-Powered Chatbot U...
Artificial intelligence March 03 ,2025
Applications of Comp...
Artificial intelligence March 03 ,2025
Face Recognition and...
Artificial intelligence March 03 ,2025
Object Detection and...
Artificial intelligence March 03 ,2025
Image Preprocessing...
Artificial intelligence March 03 ,2025
Basics of Computer V...
Artificial intelligence March 03 ,2025
Building Chatbots wi...
Artificial intelligence March 03 ,2025
Transformer-based Mo...
Artificial intelligence March 03 ,2025
Word Embeddings (Wor...
Artificial intelligence March 03 ,2025
Sentiment Analysis a...
Artificial intelligence March 03 ,2025
Preprocessing Text D...
Artificial intelligence March 03 ,2025
What is NLP
Artificial intelligence March 03 ,2025
Graph Theory and AI
Artificial intelligence March 03 ,2025
Probability Distribu...
Artificial intelligence March 03 ,2025
Probability and Stat...
Artificial intelligence March 03 ,2025
Calculus for AI
Artificial intelligence March 03 ,2025
Linear Algebra Basic...
Artificial intelligence March 03 ,2025
AI vs Machine Learni...
Artificial intelligence March 03 ,2025
Narrow AI, General A...
Artificial intelligence March 03 ,2025
Importance and Appli...
Artificial intelligence March 03 ,2025
History and Evolutio...
Artificial intelligence March 03 ,2025
What is Artificial I...
Get In Touch

123 Street, New York, USA

+012 345 67890

techiefreak87@gmail.com

© Design & Developed by HW Infotech