Data Preprocessing using scikit learn| Dataset- Student Performance

August 19, 2021

What is Data preprocessing?

Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. It was traditionally used as a preliminary step for a data mining process. More recently, these techniques have evolved for training machine learning and AI models and for running inferences against them. Also, these techniques can be used in combination with a variety of data sources, including data stored in files or databases, or being emitted by streaming data systems.

Data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user

What is scikit learn?

Scikit is the most useful and robust library for machine learning in python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionally reduction via a consistence interface in python. This library which is largely written in python, is built upon NumPy, SciPy and Matplotlib.

In this practical, we will take one dataset and performing the following task:

Standardization
Normalization
Encoding
Discretization
Imputation of missing values

The dataset is: https://www.kaggle.com/allexanderspb/studentsperformance

Standardization

Data standardization is the process of bringing data into a uniform format that allows analysts and others to research, analyze, and utilize the data. In statistics, standardization refers to the process of putting different variables on the same scale in order to compare scores between different types of variables. For example, say you need to compare the performance of two different students, one who received a 75 out of 100 and the other who received a 42 out of 50. The result of using Microsoft Excel to standardize data in Excel would demonstrate that the 42 is of higher value, even though it is a lower number.

Normalization

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.

Encoding

Label Encoding

Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

One hot Encoding

One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. In this encoding technique, each category is represented as a one-hot vector.

Discretization

Data discretization is the process of converting continuous data into discrete buckets by grouping them. by doing this we can limit the number of possible states. basically, we convert the numerical features into categorical columns.

There are 3 types of Discretization available in Sci-kit learn. (1) Quantile Discretization Transform (2) Uniform Discretization Transform (3) KMeans Discretization Transform

Missing values

Handling missing values is an important task that every data scientist must have to do. We can handle missing values in two ways.

Remove the data (whole row) which have missing values.
Add the values by using some strategies or using Imputer.

We can remove the missing values when the ratio of the number of missing values and a total number of values is low. So in this particular situation, we can remove missing values using dropna() in pandas. If the ratio is high so we have to Impute the values.

You can understand these all concepts using an example that I have performed on the Student Performance dataset you can see that from the following code.

Click Here : https://github.com/Riya2112/DS_Practicals/tree/main/Practical-2

Search This Blog

Data Science