# feature-engineering-handbook
**Repository Path**: bocinfor/feature-engineering-handbook
## Basic Information
- **Project Name**: feature-engineering-handbook
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 1
- **Created**: 2024-01-16
- **Last Updated**: 2025-08-25
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
Feature-Engineering-Handbook
============
Welcome! This repo provides an interactive and complete practical feature engineering tutorial in Jupyter Notebook. It contains three parts: [Data Prepocessing](1.%20Data%20Preprocessing.ipynb), [Feature Selection](2.%20Feature%20Selection.ipynb) and [Dimension Reduction](3.%20Dimension%20Reduction.ipynb). Each part is demonstrated separately in one notebook. Since some feature selection algorithms such as Simulated Annealing and Genetic Algorithm lack complete implementation in python, we also provide corresponding python scripts ([Simulated Annealing](SA.py), [Genetic Algorithm](GA.py)) and cover them in our tutorial for your reference.
Brief Introduction
------------
- [Notebook One](1.%20Data%20Preprocessing.ipynb) covers data preprocessing on static continuous features based on [scikit-learn](https://scikit-learn.org/stable/), on static categorical features based on [Category Encoders](https://contrib.scikit-learn.org/categorical-encoding/), and on time series features based on [Featuretools](https://www.featuretools.com/).
- [Notebook Two](2.%20Feature%20Selection.ipynb) covers feature selection including univariate filter methods based on [scikit-learn](https://scikit-learn.org/stable/), multivariate filter methods based on [scikit-feature](http://featureselection.asu.edu/), deterministic wrapper methods based on [scikit-learn](https://scikit-learn.org/stable/), randomized wrapper methods based on our implementations in python scrips, and embedded methods based on [scikit-learn](https://scikit-learn.org/stable/).
- [Notebook Three](3.%20Dimension%20Reduction.ipynb) covers supervised and unsupervised dimension reduction based on [scikit-learn](https://scikit-learn.org/stable/).
Table of Content
------------
- 1 Data Prepocessing
- 1.1 Static Continuous Variables
- 1.1.1 Discretization
- 1.1.1.1 Binarization
- 1.1.1.2 Binning
- 1.1.2 Scaling
- 1.1.2.1 Stardard Scaling (Z-score standardization)
- 1.1.2.2 MinMaxScaler (Scale to range)
- 1.1.2.3 RobustScaler (Anti-outliers scaling)
- 1.1.2.4 Power Transform (Non-linear transformation)
- 1.1.3 Normalization
- 1.1.4 Imputation of missing values
- 1.1.4.1 Univariate feature imputation
- 1.1.4.2 Multivariate feature imputation
- 1.1.4.3 Marking imputed values
- 1.1.5 Feature Transformation
- 1.1.5.1 Polynomial Transformation
- 1.1.5.2 Custom Transformation
- 1.2 Static Categorical Variables
- 1.2.1 Ordinal Encoding
- 1.2.2 One-hot Encoding
- 1.2.3 Hashing Encoding
- 1.2.4 Helmert Coding
- 1.2.5 Sum (Deviation) Coding
- 1.2.6 Target Encoding
- 1.2.7 M-estimate Encoding
- 1.2.8 James-Stein Encoder
- 1.2.9 Weight of Evidence Encoder
- 1.2.10 Leave One Out Encoder
- 1.2.11 Catboost Encoder
- 1.3 Time Series Variables
- 1.3.1 Time Series Categorical Features
- 1.3.2 Time Series Continuous Features
- 1.3.3 Implementation
- 1.3.3.1 Create EntitySet
- 1.3.3.2 Set up cut-time
- 1.3.3.3 Auto Feature Engineering
- 2 Feature Selection
- 2.1 Filter Methods
- 2.1.1 Univariate Filter Methods
- 2.1.1.1 Variance Threshold
- 2.1.1.2 Pearson Correlation (regression problem)
- 2.1.1.3 Distance Correlation (regression problem)
- 2.1.1.4 F-Score (regression problem)
- 2.1.1.5 Mutual Information (regression problem)
- 2.1.1.6 Chi-squared Statistics (classification problem)
- 2.1.1.7 F-Score (classification problem)
- 2.1.1.8 Mutual Information (classification problem)
- 2.1.2 Multivariate Filter Methods
- 2.1.2.1 Max-Relevance Min-Redundancy (mRMR)
- 2.1.2.2 Correlation-based Feature Selection (CFS)
- 2.1.2.3 Fast Correlation-based Filter (FCBF)
- 2.1.2.4 ReliefF
- 2.1.2.5 Spectral Feature Selection (SPEC)
- 2.2 Wrapper Methods
- 2.2.1 Deterministic Algorithms
- 2.2.1.1 Recursive Feature Elimination (SBS)
- 2.2.2 Randomized Algorithms
- 2.2.2.1 Simulated Annealing (SA)
- 2.2.2.2 Genetic Algorithm (GA)
- 2.3 Embedded Methods
- 2.3.1 Regulization Based Methods
- 2.3.1.1 Lasso Regression (Linear Regression with L1 Norm)
- 2.3.1.2 Logistic Regression (with L1 Norm)
- 2.3.1.3 LinearSVR/ LinearSVC
- 2.3.2 Tree Based Methods
- 3 Dimension Reduction
- 3.1 Unsupervised Methods
- 3.1.1 PCA (Principal Components Analysis)
- 3.2 Supervised Methods
- 3.2.1 LDA (Linear Discriminant Analysis)
Reference
------------
References have been included in each Jupyter Notebook.
Author
------------
[**@Yingxiang Chen**](https://github.com/YC-Coder-Chen)
[**@Zihan Yang**](https://github.com/echoyang48)
Contact
------------
**If there are any mistakes, please feel free to reach out and correct us!**
Yingxiang Chen E-mail: chenyingxiang3526@gmail.com
Zihan Yang E-mai: echoyang48@gmail.com