3.Data Pre-processing tasks using python with Data reduction techniques
Details in datasets are wildly increasing. This might generate problem because of a lot of data main important feature in dataset may buried in useless data. So, this is the reason that Data pre-processing become crucial to any dataset. In this we are going to discuss different data reduction method that reduce unnecessary data from our dataset and make it more efficient for our model to run these datasets.
The SkLearn website listed different feature selection methods. Here, we will see different feature selection methods on the same data set to compare their performances.
The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library
The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.
The dataset now has 8 features now. In that 4 feature are important and another 4 are noise.
Principal Component Analysis (PCA)
Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.
For a lot of machine learning applications it helps to be able to visualize your data. Visualizing 2 or 3 dimensional data is not that challenging. However, even the Iris dataset used in this part of the tutorial is 4 dimensional. You can use PCA to reduce that 4 dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better. we are going to use PCA methods for original data.
PCA Projection to 2D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation.
Concatenating DataFrame along axis = 1. resultant_Df is the final DataFrame before plotting the data.
Now, lets visualize the dataframe:
PCA Projection to 3D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 3 dimensions. The new components are just the three main dimensions of variation
Now lets visualize 3D graph,
Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.
Univariate Feature Selection
Univariate feature selection works by selecting the best features based on univariate statistical tests. We compare each feature to the target variable, to see whether there is any statistically significant relationship between them. It is also called analysis of variance (ANOVA). When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’. Each feature has its test score.
Finally, all the test scores are compared, and the features with top scores will be selected.
Also known as ANOVA,
This score can be used to select the features with the highest values for the test chi-squared statistic from data, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.
Estimate mutual information for a discrete target variable.
Mutual information (MI)between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances.
Recursive Feature Elimination
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. RFE requires a specified number of features to keep, however, it is often not known in advance how many features are valid.
Here, only original columns remains true and all other extra added noise column were shown false.
In this blog, we have seen how to use different feature selection methods on the same data and evaluated their performances.