Practical:6 Data-Preprocessing with orange Tool

A guide to data preprocessing using Orange and how to use Python in Orange.

This blog will help you understand how to perform Data pre-processing using Orange, use the Orange library in Python, and integrate Python Scripts in Orange.

What is Data Preprocessing?

When we talk about data, we usually think of some large datasets with a huge number of rows and columns. While that is a likely scenario, it is not always the case — data could be in so many different forms: Structured Tables, Images, Audio files, Videos, etc..

Machines don’t understand free text, image, or video data as it is, they understand 1s and 0s. So it probably won’t be good enough if we put on a slideshow of all our images and expect our machine learning model to get trained just by that!

In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.

Data discretization 

Continues features in the data can be discretized using a uniform discretization method. Discretization considers only continues features, and replaces them in the new data set with corresponding categorical features:




Continuization

Continuization refers to transformation of discrete (binary or multinominal) variables to continuous. The class described below operates on the entire domain; documentation on Orange.core.transformvalue.rst explains how to treat each variable separately.

  • #python script for Continuization
  • import Orangetitanic = Orange.data.Table(“titanic”)continuizer = Orange.preprocess.Continuize()titanic1 = continuizer(titanic)print(“Before Continuization : “,titanic.domain)print(“After Continuization : “,titanic1.domain)#Data of row 15 in the before and after continuizationprint(“15th row data before : “,titanic[15])print(“15th row data after : “,titanic1[15])




Normalization

Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute.







Random sampling data 

Random sampling is done by constructing a vector of subset indices (e.g. a table of 0’s and 1’s), one corresponding to each instance, and then passing the vector to the table’s Orange.data.Table.select method.

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.



Guide To Use Python Scripts In Orange

Python Script is this mysterious widget most people do not know how to use, even those versed in Python. Python Script is the widget that supplements Orange functionalities with (almost) everything that Python can offer.

We will try to replicate the work for Discretization using Python Script. As shown below, we will create two paths in the workflow; one will use the Discretize Widget and then give output; meanwhile, the other path will go to Python Script, where we will write the logic related to discretization.



Discretization using Python Script

Now, two different tables will be generated — one for each path. ‘Data Table’ has output after passing through the Discretize widget, and ‘Data Table (2)’ has the output after passing through the Python Script Widget.





As you can see above, both the Data Tables are similar. This helps in proving that by using Orange, we can also carry out Script Programming along with Visual Programming.

Thank you:)





Comments

Popular posts from this blog

18IT100_Practical_Exam_Work

Practical:10 Getting started with Neo4j and Gephi Tool

Practical:11 PREDICTING GENDER AND AGE USING IMAGE DATA IN PYTHON