;
Artificial intelligence April 09 ,2025

Feature Engineering and Feature Scaling

Before feeding data into a machine learning or AI model, it's crucial to shape that data into a form the model can learn from effectively. This is where Feature Engineering and Feature Scaling come in — two of the most powerful preprocessing tools in the data science toolbox.

Let’s explore these concepts in detail.

 What is Feature Engineering?

Feature Engineering is the process of selecting, transforming, and creating variables (features) that help machine learning models understand the data better. It is more of an art than a science — rooted in domain knowledge, creativity, and logical thinking.

"A model is only as good as the data you feed it."

Poorly designed features will lead to poor performance, no matter how advanced your model is.

Why is Feature Engineering Important?

  • Makes patterns in the data easier for the model to detect.
  • Reduces noise and redundancy.
  • Improves accuracy, recall, precision, and other performance metrics.
  • Helps models converge faster during training.

 Common Feature Engineering Techniques

Let’s work through the Titanic dataset to apply the most common techniques.

import seaborn as sns
import pandas as pd

# Load Titanic dataset
df = sns.load_dataset('titanic')
df.head()

1. Handling Missing Values

Real-world data is often incomplete. You can either remove rows, fill them with a default value (like mean, median), or use more complex imputation techniques.

# Fill missing 'age' with median
df['age'].fillna(df['age'].median(), inplace=True)

# Drop rows where 'embarked' is missing
df.dropna(subset=['embarked'], inplace=True)

2. Encoding Categorical Variables

ML models work with numbers, not text. So we must convert categories into numerical format.

a. Label Encoding

# Convert 'sex' column into 0 and 1
df['sex'] = df['sex'].map({'male': 0, 'female': 1})

b. One-Hot Encoding

# Convert 'embarked' into multiple binary columns
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)

3. Creating New Features

We can extract new, meaningful information from existing data.

a. Family Size

Combining siblings/spouses (sibsp) and parents/children (parch):

df['family_size'] = df['sibsp'] + df['parch'] + 1

b. Is Child?

Classifying passengers as children based on age:

df['is_child'] = df['age'].apply(lambda x: 1 if x < 18 else 0)

c. Title Extraction (from name)

df['title'] = df['name'].str.extract(' ([A-Za-z]+)\.', expand=False)

Such features often reflect social status and can significantly affect model accuracy.
Absolutely! Let’s expand on Feature Engineering with more advanced techniques and examples, especially in the context of the Titanic dataset and general use-cases in machine learning.

4. Binning (Discretization)

Binning transforms continuous numerical variables into categorical bins. This can help reduce the impact of outliers and reveal non-linear relationships.

a. Age Binning

df['age_bin'] = pd.cut(df['age'], bins=[0, 12, 18, 35, 60, 100], 
                       labels=['Child', 'Teen', 'YoungAdult', 'Adult', 'Senior'])

This converts age into distinct life-stage categories.

5. Interaction Features

Sometimes the combination of two or more features captures information that individual features miss.

a. Age * Pclass

df['age_class'] = df['age'] * df['pclass']

This feature could reflect how younger people in higher classes had better survival chances.

6. Frequency Encoding

Instead of one-hot encoding, you can encode categories based on their frequency in the dataset.

a. Encoding Ticket Frequencies

ticket_freq = df['ticket'].value_counts()
df['ticket_freq'] = df['ticket'].map(ticket_freq)

Passengers with the same ticket number might have been traveling together. This feature may capture group survival patterns.

7. Mean/Target Encoding

Assign a value to a category based on the mean of the target variable (like survival rate) within that category.

Use with caution: this technique can cause data leakage if not done properly (should be applied within cross-validation folds).

Example (not applied directly):

df['cabin_survival_mean'] = df.groupby('cabin')['survived'].transform('mean')

8. Text Features (from Names or Tickets)

You’ve extracted titles from names. Let’s go further.

a. Name Length

df['name_length'] = df['name'].apply(len)

Longer or more formal names could indicate social class or family status.

b. Ticket Prefix

df['ticket_prefix'] = df['ticket'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]).strip().replace('.', '').replace('/', ''))

Ticket prefixes can reflect the company or group, offering potential insight into class and survival.

9. Date and Time Features (for time-based datasets)

When working with datetime fields, extract:

  • Day of week
  • Month
  • Hour
  • Is weekend?
  • Time difference between events (e.g., order → delivery)

Example:

df['order_date'] = pd.to_datetime(df['order_date'])
df['order_dayofweek'] = df['order_date'].dt.dayofweek
df['order_is_weekend'] = df['order_dayofweek'].apply(lambda x: 1 if x >= 5 else 0)

Note: The Titanic dataset doesn’t have datetime fields, but this is common in transactional or time-series data.

10. Log Transformation

To handle skewed features (like fare), apply log transformation:

import numpy as np
df['log_fare'] = np.log1p(df['fare'])  # log1p avoids log(0)

This reduces the impact of extreme outliers and makes distributions more normal.

11. Polynomial Features

Used to capture non-linear relationships between features.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'fare']])

12. Scaling and Normalization (Covered Next)

Scaling numerical features ensures that models aren’t biased toward variables with larger magnitude.

Feature Engineering Summary Table

TechniqueDescriptionExample
BinningConvert continuous to categoricalAge → Age Group
Interaction FeaturesCombine features to reveal hidden patternsAge × Pclass
Frequency EncodingEncode categories based on frequencyTicket frequency
Mean/Target EncodingEncode category by target meanCabin → Mean survival rate
Text FeaturesUse string metrics as featuresName length, ticket prefix
Log TransformationCompress skewed distributionsFare → log(Fare + 1)
Polynomial FeaturesCreate higher-order interactionsAge², Age × Fare

 

What is Feature Scaling?

Feature scaling brings all numeric variables into a comparable range so that no single feature dominates the model due to its scale.

Why Scale Features?

Many models are sensitive to the magnitude of features, such as:

  • Linear Regression
  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Support Vector Machines (SVM)
  • Neural Networks

Models like decision trees and random forests do not require scaling.

 Popular Feature Scaling Techniques

1. Standardization (Z-score Normalization)

Transforms features to have zero mean and unit variance.

Formula:
 

scaler = StandardScaler()
standardized = scaler.fit_transform(numeric_features)

Use When:

  • Data is normally distributed
  • Algorithms assume Gaussian distribution (Logistic Regression, SVM, Neural Networks)

2. Min-Max Scaling

Scales features to a fixed range [0, 1].

Formula:

scaler = MinMaxScaler()
normalized = scaler.fit_transform(numeric_features)

Use When:

  • Data does not follow Gaussian distribution
  • You want to preserve zero values
  • Good for image data (pixel values)

3. Robust Scaling

Scales using median and IQR (interquartile range).
Useful when data contains outliers.

Formula:

scaler = RobustScaler()
robust_scaled = scaler.fit_transform(numeric_features)

Use When:

  • Data has outliers (Titanic fare is a great example)
  • You want scaling that is not sensitive to outliers


 

Summary of Key Learnings

ConceptGoal
Feature EngineeringImprove model’s ability to learn patterns from the data
Missing Value TreatmentHandle incomplete information
EncodingConvert categorical variables into numeric formats
Feature CreationExtract meaningful insights and representations
Feature ScalingNormalize numerical data to a common scale


Next Blog- Data Visualization with AI

 

Purnima
0

You must logged in to post comments.

Related Blogs

What is Ar...
Artificial intelligence March 03 ,2025

What is Artificial I...

History an...
Artificial intelligence March 03 ,2025

History and Evolutio...

Importance...
Artificial intelligence March 03 ,2025

Importance and Appli...

Narrow AI,...
Artificial intelligence March 03 ,2025

Narrow AI, General A...

AI vs Mach...
Artificial intelligence March 03 ,2025

AI vs Machine Learni...

Linear Alg...
Artificial intelligence March 03 ,2025

Linear Algebra Basic...

Calculus f...
Artificial intelligence March 03 ,2025

Calculus for AI

Probabilit...
Artificial intelligence March 03 ,2025

Probability and Stat...

Probabilit...
Artificial intelligence March 03 ,2025

Probability Distribu...

Graph Theo...
Artificial intelligence March 03 ,2025

Graph Theory and AI

What is NL...
Artificial intelligence March 03 ,2025

What is NLP

Preprocess...
Artificial intelligence March 03 ,2025

Preprocessing Text D...

Sentiment...
Artificial intelligence March 03 ,2025

Sentiment Analysis a...

Word Embed...
Artificial intelligence March 03 ,2025

Word Embeddings (Wor...

Transforme...
Artificial intelligence March 03 ,2025

Transformer-based Mo...

Building C...
Artificial intelligence March 03 ,2025

Building Chatbots wi...

Basics of...
Artificial intelligence March 03 ,2025

Basics of Computer V...

Image Prep...
Artificial intelligence March 03 ,2025

Image Preprocessing...

Object Det...
Artificial intelligence March 03 ,2025

Object Detection and...

Face Recog...
Artificial intelligence March 03 ,2025

Face Recognition and...

Applicatio...
Artificial intelligence March 03 ,2025

Applications of Comp...

AI-Powered...
Artificial intelligence March 03 ,2025

AI-Powered Chatbot U...

Implementi...
Artificial intelligence March 03 ,2025

Implementing a Basic...

Implementa...
Artificial intelligence March 03 ,2025

Implementation of Ob...

Implementa...
Artificial intelligence March 03 ,2025

Implementation of Ob...

Implementa...
Artificial intelligence March 03 ,2025

Implementation of Fa...

Deep Reinf...
Artificial intelligence March 03 ,2025

Deep Reinforcement L...

Deep Reinf...
Artificial intelligence March 03 ,2025

Deep Reinforcement L...

Deep Reinf...
Artificial intelligence March 03 ,2025

Deep Reinforcement L...

Introducti...
Artificial intelligence March 03 ,2025

Introduction to Popu...

Introducti...
Artificial intelligence March 03 ,2025

Introduction to Popu...

Introducti...
Artificial intelligence March 03 ,2025

Introduction to Popu...

Introducti...
Artificial intelligence March 03 ,2025

Introduction to Popu...

Tools for...
Artificial intelligence March 03 ,2025

Tools for Data Handl...

Tool for D...
Artificial intelligence March 03 ,2025

Tool for Data Handli...

Cloud Plat...
Artificial intelligence April 04 ,2025

Cloud Platforms for...

Deep Dive...
Artificial intelligence April 04 ,2025

Deep Dive into AWS S...

Cloud Plat...
Artificial intelligence April 04 ,2025

Cloud Platforms for...

Cloud Plat...
Artificial intelligence April 04 ,2025

Cloud Platforms for...

Visualizat...
Artificial intelligence April 04 ,2025

Visualization Tools...

Data Clean...
Artificial intelligence April 04 ,2025

Data Cleaning and Pr...

Explorator...
Artificial intelligence April 04 ,2025

Exploratory Data Ana...

Explorator...
Artificial intelligence April 04 ,2025

Exploratory Data Ana...

Data Visua...
Artificial intelligence April 04 ,2025

Data Visualization w...

Working wi...
Artificial intelligence April 04 ,2025

Working with Large D...

Understand...
Artificial intelligence April 04 ,2025

Understanding Bias i...

Ethics in...
Artificial intelligence April 04 ,2025

Ethics in AI Develop...

Fairness i...
Artificial intelligence April 04 ,2025

Fairness in Machine...

The Role o...
Artificial intelligence April 04 ,2025

The Role of Regulati...

Responsibl...
Artificial intelligence April 04 ,2025

Responsible AI Pract...

Artificial...
Artificial intelligence April 04 ,2025

Artificial Intellige...

AI in Fina...
Artificial intelligence April 04 ,2025

AI in Finance and Ba...

AI in Auto...
Artificial intelligence April 04 ,2025

AI in Autonomous Veh...

AI in Gami...
Artificial intelligence April 04 ,2025

AI in Gaming and Ent...

AI in Soci...
Artificial intelligence April 04 ,2025

AI in Social Media a...

Building a...
Artificial intelligence April 04 ,2025

Building a Spam Emai...

Creating a...
Artificial intelligence April 04 ,2025

Creating an Image Cl...

Developing...
Artificial intelligence April 04 ,2025

Developing a Sentime...

Implementi...
Artificial intelligence April 04 ,2025

Implementing a Recom...

Generative...
Artificial intelligence April 04 ,2025

Generative AI: An In...

Explainabl...
Artificial intelligence April 04 ,2025

Explainable AI (XAI)

AI for Edg...
Artificial intelligence April 04 ,2025

AI for Edge Devices...

Quantum Co...
Artificial intelligence April 04 ,2025

Quantum Computing an...

AI for Tim...
Artificial intelligence April 04 ,2025

AI for Time Series F...

Emerging T...
Artificial intelligence May 05 ,2025

Emerging Trends in A...

AI and the...
Artificial intelligence May 05 ,2025

AI and the Job Marke...

The Role o...
Artificial intelligence May 05 ,2025

The Role of AI in Cl...

AI Researc...
Artificial intelligence May 05 ,2025

AI Research Frontier...

Preparing...
Artificial intelligence May 05 ,2025

Preparing for an AI-...

4 Popular...
Artificial intelligence May 05 ,2025

4 Popular AI Certifi...

Building a...
Artificial intelligence May 05 ,2025

Building an AI Portf...

How to Pre...
Artificial intelligence May 05 ,2025

How to Prepare for A...

AI Career...
Artificial intelligence May 05 ,2025

AI Career Opportunit...

Staying Up...
Artificial intelligence May 05 ,2025

Staying Updated in A...

Part 1-  T...
Artificial intelligence May 05 ,2025

Part 1- Tools for T...

Implementi...
Artificial intelligence May 05 ,2025

Implementing ChatGPT...

Part 2-  T...
Artificial intelligence May 05 ,2025

Part 2- Tools for T...

Part 1- To...
Artificial intelligence May 05 ,2025

Part 1- Tools for Te...

Technical...
Artificial intelligence May 05 ,2025

Technical Implementa...

Part 2- To...
Artificial intelligence May 05 ,2025

Part 2- Tools for Te...

Part 1- To...
Artificial intelligence May 05 ,2025

Part 1- Tools for Te...

Step-by-St...
Artificial intelligence May 05 ,2025

Step-by-Step Impleme...

Part 2 - T...
Artificial intelligence May 05 ,2025

Part 2 - Tools for T...

Part 4- To...
Artificial intelligence May 05 ,2025

Part 4- Tools for Te...

Part 1- To...
Artificial intelligence May 05 ,2025

Part 1- Tools for Te...

Part 2- To...
Artificial intelligence May 05 ,2025

Part 2- Tools for Te...

Part 3- To...
Artificial intelligence May 05 ,2025

Part 3- Tools for Te...

Step-by-St...
Artificial intelligence May 05 ,2025

Step-by-Step Impleme...

Part 1- To...
Artificial intelligence June 06 ,2025

Part 1- Tools for Im...

Implementa...
Artificial intelligence June 06 ,2025

Implementation of D...

Part 2- To...
Artificial intelligence June 06 ,2025

Part 2- Tools for Im...

Part 1- To...
Artificial intelligence June 06 ,2025

Part 1- Tools for Im...

Implementa...
Artificial intelligence June 06 ,2025

Implementation of Ru...

Part 1- To...
Artificial intelligence June 06 ,2025

Part 1- Tools for Im...

Part 2- To...
Artificial intelligence June 06 ,2025

Part 2- Tools for Im...

Step-by-St...
Artificial intelligence June 06 ,2025

Step-by-Step Impleme...

Part 1-Too...
Artificial intelligence June 06 ,2025

Part 1-Tools for Ima...

Part 2- To...
Artificial intelligence June 06 ,2025

Part 2- Tools for Im...

Implementa...
Artificial intelligence June 06 ,2025

Implementation of Pi...

Get In Touch

123 Street, New York, USA

+012 345 67890

techiefreak87@gmail.com

© Design & Developed by HW Infotech