Web Scrapping Using Python

Web Scrapping

Web scraping is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

Reasons for Web Scraping

The world offers other ways to apply that surfer’s mindset! Instead of looking at the job site every day, you can use Python to help automate your job search’s repetitive parts. Automated web scraping can be a solution to speed up the data collection process. You write your code once, and it will get the information you want many times and from many pages.

In contrast, when you try to get the information you want manually, you might spend a lot of time clicking, scrolling, and searching, especially if you need large amounts of data from websites that are regularly updated with new content. Manual web scraping can take a lot of time and repetition.

There’s so much information on the Web, and new information is constantly added. You’ll probably be interested in at least some of that data, and much of it is just out there for the taking. Whether you’re actually on the job hunt or you want to download all the lyrics of your favorite artist, automated web scraping can help you accomplish your goals.

Challenges of Web Scraping

The Web has grown organically out of many sources. It combines many different technologies, styles, and personalities, and it continues to grow to this day. In other words, the Web is a hot mess! Because of this, you’ll run into some challenges when scraping the Web:

Variety: Every website is different. While you’ll encounter general structures that repeat themselves, each website is unique and will need personal treatment if you want to extract the relevant information.

Durability: Websites constantly change. Say you’ve built a shiny new web scraper that automatically cherry-picks what you want from your resource of interest. The first time you run your script, it works flawlessly. But when you run the same script only a short while later, you run into a discouraging and lengthy stack of tracebacks!

Unstable scripts are a realistic scenario, as many websites are in active development. Once the site’s structure has changed, your scraper might not be able to navigate the sitemap correctly or find the relevant information. The good news is that many changes to websites are small and incremental, so you’ll likely be able to update your scraper with only minimal adjustments.

Steps for Scraping using Python

1. Search the URL you want to scrap.

Here, I am using the following data science page for scrapping

https://en.m.wikipedia.org/wiki/Data_science

2. Inspecting the Page

Right-click on the webpage, click Inspect, and you’ll see the inside of that site: its source code, the images and CSS that form its design, the fonts and icons it uses, the Javascript code that powers animations, and many more.

3. Find the data you want to extract

Here, I am extracting the heading present in the div class

4. Code:

Create a python file

(Python file can be created in any of the python interpreter like pycharm, jupyter notebook, google colaboratory, Vs code,etc.)

Here, I am using google Colaboratory

5. Run the code and extract the data

6. Save in the required proper format

Saved it as data.csv file

This is about Web Scarping using python.

Thank You😊

Click here for the Code

Search This Blog

Data Science