For this blog, I created one sample html file to scrap and will follow that throughout this blog of web scraping. For this, we will use Beautiful Soup, request package to parse the website.

You have to install packages if you are on Python and if you are on anaconda then it will automatically comes with the installed libraries.

  1. pip install beautifulsoup4 -> It helps in pulling data out of html and xml files.
  2. pip install lxml/html5lib -> There are parsers for html file as different parsers behave in a different way.
  3. pip install requests -> It is used to fetch the information from web.

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser and we are going to use lxml parser for this tutorial.In [1]:

from bs4 import BeautifulSoup
import requests

In [2]:

## Opening the created html file in the read mode.
with open('web_scraping_sample.html','r') as html_file:
     html = BeautifulSoup(html_file,'lxml')
     ## It will return the whole html file you have written.
     print(html)




Scraping Website


Blog Platform of Robofied

At this platform you will find amazing blogs in a well-defined way related to machine learning.

Blogs Platform

Hiring Platform of Robofied

At this platform you can apply for jobs in different domains.

Hiring Platform

Web Platform of Robofied

At this platform you will find amazing tutorials of web development.

Web development Playform

 

 

How to navigate a tree of html file

In [3]:

## As try to get info from head.
print(html.head)
print()


## If you want title and in that text only you can access with "."
print(html.head.title.string)
print()

## If you want to access the parent tag
print('Parent of tiltle tag is {}'.format(html.title.parent.name))




Sample website for we scraping

Parent of tiltle tag is head

Finding a particular tag

In [4]:

## It will return the first div only.
print(html.div)
print()

## It will return as a list of div's present in site.
print(html.find_all('div'))

 

Blog Platform of Robofied

 

At this platform you will find amazing blogs in a well-defined way related to machine learning.

Blogs Platform

[

 

Blog Platform of Robofied

 

At this platform you will find amazing blogs in a well-defined way related to machine learning.

Blogs Platform

,

 

Hiring Platform of Robofied

 

At this platform you can apply for jobs in different domains.

Hiring Platform

,

 

Web Platform of Robofied

 

At this platform you will find amazing tutorials of web development.

Web development Playform

]

In [5]:

## Checking for the paragraphs and here you can see onr thing that each para is asscociated with different id's.
print(html.find_all('p'))
print()

## If you want to extract id-vise
## It will return the paragraph which has id="ml"
print(html.find('p',id='ml'))

[

At this platform you will find amazing blogs in a well-defined way related to machine learning.

,

At this platform you can apply for jobs in different domains.

,

At this platform you will find amazing tutorials of web development.

]

At this platform you will find amazing blogs in a well-defined way related to machine learning.

 

In [6]:

## It will fetch all the links prsent in the current webpage.

for link in html.find_all('a'):
    
    ## It will fetch all the links
    print(link.get('href'))

http://blog.robofied.com/
http://hiring.robofied.com/
Home

Now, generally a website follows a patterns like multiple divs and inside divs 1 or 2 paragramph, a sub heading. It may vary also. But if it is following a straight approach then we can loop over to get the information. Let’s see how.

Scraping a page

In [7]:

div = html.find_all('div')

## You can see inside div we have one heading and then a para and at last a link. So we will follow this and try to extract
## the info.
print(div)

[

 

Blog Platform of Robofied

 

At this platform you will find amazing blogs in a well-defined way related to machine learning.

Blogs Platform

,

 

Hiring Platform of Robofied

 

At this platform you can apply for jobs in different domains.

Hiring Platform

,

 

Web Platform of Robofied

 

At this platform you will find amazing tutorials of web development.

Web development Playform

]

In [8]:

for platforms in html.find_all('div'):
    
    ## As we have already see the structure of each div then we can access the content easily.
    
    ## Extracting the h2 text
    print(platforms.h2.text)
    
    ## Extracting the para text
    print(platforms.p.text)
    
    ## Extracting the link
    print(platforms.a)
    
    ## For new line
    print()

Blog Platform of Robofied
At this platform you will find amazing blogs in a well-defined way related to machine learning.
Blogs Platform

Hiring Platform of Robofied
At this platform you can apply for jobs in different domains.
Hiring Platform

Web Platform of Robofied
At this platform you will find amazing tutorials of web development.
Web development Playform

If you want to fetch the data from a website on some server.

In [9]:

## Fetvhing the data form web
robofied_source = requests.get('http://blog.robofied.com/')
print(robofied_source)

 

In [10]:

## If you want to see the text
print(robofied_source.text)