Practical:3 Data Preprocessing Using Data Reduction Technique In Python

October 28, 2021

Datasets nowadays are very detailed; including more features in the model makes the model more complex, and the model may be overfitting the data. Some features can be noise and potentially damage the model. By removing those unimportant features, the model may generalize better.

We will see other feature selection methods on the same data set to compare their performances. Use SkLearn website for this.

Dataset Used

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library.

Loading Dataset

The Iris dataset is one of the datasets sci-kit-learn comes with that do not require the downloading of any file from some external website. The code below will load the iris dataset. The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

Using variance threshold removes all zero-variance features.

Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate variability found inside a data set into two parts: systematic factors and random factors. The systematic factors have a statistical influence on the given data set, while the random factors do not.

Computing chi-squared stats between each non-negative feature and class.

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.

Principal Component Analysis (PCA)

We can speed up the fitting of a machine learning algorithm by changing the optimization algorithm. A more common way of speeding up a machine learning algorithm is using Principal Component Analysis (PCA). It is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.

For a lot of machine learning applications, it helps to be able to visualize your data. Visualizing two or 3-dimensional data is not that challenging. The Iris dataset used is four-dimensional. We will use PCA to reduce that 4-dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

So, now let’s execute PCA for visualization on Iris Dataset.

PCA Projection to 2D

The original data has four columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the actual data, which is four-dimensional into two dimensions. The new components are just the two main dimensions of variation.

PCA Projection to 3D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). This section projects the original data, which is off our dimensional into 3 dimensions. The new components are just the three main dimensions of variation.

Thank you:)

Search This Blog

Data Science