How to do web scraping in python - Web scraping tutorial!

What is web scraping - web scraping in python



What is Web scraping?

Web scraping is a method used to extract huge amounts of data from websites if we don't have an API. But, exactly 'What is Web scraping?'. First of all, let's begin by understanding what web scraping is?

Suppose you have a website or you have a link for a website. That containing some information that's getting updated regularly. you wish to store that information either locally or in a database to perform some manipulation on them.

Before we're scraping the data, we usually check if that website has an open API to provide that data. So that you can request the data using a URL end part. But, sometimes these websites either don't have a web API or don't have the data that you want in the form of a web API.

So in that case what you do is you parse the entire HTML content of their website as a string. Then extract information from it that you need you can then store it or manipulate it According to your need. That's what is known as web scraping.

Web scraping is a method used to extract huge amounts of data from websites if we don't have an API.
 

Web scraping in python & web scraping tools python

What is Web scraping - web scraping in python
Photo by Chris Ried on Unsplash

First of all, python is a very fast language to do Web scraping. Python has a lot of packages for web scrapping. so one of the packages that we can use is beautifulsoup

The current version of beautifulsoup 4.9. Another one is scrapy. now scrapy is less for web scraping and more for building web spiders for web crawlers.

Now, this could be considered an entire framework in itself. and it allows you to create two spiders by inheriting classes and then telling it how to parse them. This could be useful when you're creating a web crawler and you wish to crawl multiple pages.

But, if you wish to put all the single pages. If you wish to crawl create a web scraper or getting information from a single page. That's being run executed several times a day such as a stock market website. for example, that's getting updated hourly so you can use beautifulsoup in my opinion.

Now scrapy is also very good but it's more powerful than beautifulsoup. The sense that it is used to create two web spiders or web scrapers or web crawler whatever you call them.


Web scraping tutorial

Now in this web scraping tutorial, we will learn here, how to do web scraping with a simple web scraping example. We'll be using the Beautifulsoup library for Web scraping. Because it's very easy to get started with. It's quite easy to use this one so now let us take a look at what can we do with beautifulsoup.
With beautiful soup, if we have an HTML page offline stored. We can either pass that or we can pass an HTML page from a website. Now parsing an HTML page is from a web address that needs you to get the string content of the HTML page so for instance. The string content of the HTML page will be returned as a string, then you can parse it. Then you can use it with beautiful soup. So, before we get started with beautiful soup. One thing that you need to understand is that you need to install all the dependencies in your project. 1. Install the Beautifulsoup version 4 using the python package manager pip. For that, you need to open CMD & type Command as
Installing beautifulsoup by pip - web scraping in python

pip install beautifulsoup4
2. Another thing that we need to worry about is parsers. So, Parsers is the package used by the beautifulsoup library, to look at the HTML content and creates an HTML tree out of it.
3. To understand the structure of the HTML page so whenever you go to an HTML page in a browser. Then right-click on any page, then click on inspecting. HTML usually works it is a tree of nodes nested into one another in a hierarchy.

4. So, to understand how these relationships and parts all these we use parsers. Now beautifulsoup supports many parsers. It is supposed to 'html. parsers' which is python's inbuilt to the library. so you don't have to install anything.

5. If you wish to use the HTML parser that is inbuilt in Python. You can also use lxml's HTML parser which we will be using in this tutorial. Because it's very fast and very lenient. 

6. Install lxml's HTML parser along with requests. This is a package used to send requests to a website or a web server and get the response in as a string.

Installing beautifulsoup by pip - web scraping in python

pip install lxml requests
7. Open Python IDLE or any code editor like VSCode, PyCharm, etc. Create a python file.

8. First import beautifulsoup and request libraries.

import requests, bs4
9. Now store the request which will get from the URL(Use your own website or another website, first take the permission of the owner).

NOTE: Don't use my URL. It's not permissible to use my website URL.

res1 = request.get('https://www.thetechreview.xyz/')
10. Create an object to extract exact data from the website URL. Pass the object as a parameter here & specify the HTML Parser(as I have told earlier that I'll use lxml for parsing).

soup1 = bs4.BeautifulSoup(res.text, 'lxml')
11. Now create another object to extract the Data from a certain HTML Tag.


title1 = soup1.select('title')

12. Now print the text from the created Array object using getText() method of beautifulsoup library.

title[0].getText()
output:
The Tech Review
13. Whole code will be like
import requests, bs4
res1 = request.get('https://www.thetechreview.xyz/') soup1 = bs4.BeautifulSoup(res.text, 'lxml')
title1 = soup1.select('title')
title[0].getText()

14. For more information, documentation & Web scraping tutorial, visit official website of beautifulsoup


Is web scraping legal?

Is web scraping legal?


Web scraping is used, when we want to extract data from a website. Sometimes we don't have an API to get the required data. In such a condition, we use web scraping to get data or information.

It's kind of Grey area, it's not actually legal to do web scraping on some websites. You have to check that website owner's permissions, policies, terms & conditions, and other stuff. 

Some companies will hire you to do web scraping on their websites. or you may get some freelancing projects to do web scraping for those interested companies.

Google has also removed some apps from the Google play store because they were doing web scraping on websites. And the website owners were not aware of that. So, 'Is web scraping is legal?'. Somewhere it's legal and somewhere it's not!


Post a Comment

0 Comments

–>