How to clean web scraping data using python beautifulsoup

The Data output from web scraping isnt always clean and ready to be used. In this tutorial we will show you out tips and tricks using Python

By Admin @August, 17 2022

how to clean web scraping data using python beautifulsoup

CLEANING OF SCRAPED HTML DATA

Web scraping is an automated method used to extract large amounts of data from websites. The data scraping usually is unstructured. So, cleaning this scraped data is necessary to convert the unstructured data into structured form. From another perspective, spending time cleaning up messy data can fill the large gaps that your processor will experience when waiting for it to be downloaded from its host.

This section covers several useful clean-up operations of scraped data of CSV files using Python programming

Clean HTML
Strip white space
Converting data type of columns
Converting Boolean values: ‘Yes’ -> True
Converting dates to machine-readable formats: “24 June 2004” -> “2004-06-24”
Inconsistencies
Empty cells
Changing reviews from text to numbers
Remove duplicates
Changing the case of text
Converting Unix date/time to date
Separating text into different columns
Changing currencies

1. Cleaning HTML tags

When you get data from a website or from an HTML code. Then you may need to remove tags from the scraped data. Here are the steps you need to follow to do it using python.

Python provides various libraries to perform different operations. The BeautifulSoup library is used generally to deal with HTML pages in python. The beautifulSoup library for versions after python 3 is renamed as beautifulsoup4. Usually the web scrapped data will be placed in an excel sheet or CSV files. To handle those CSV files using python programming a library names pandas are being provided.

In this article, we provided a solution to remove HTML tags without using BeautifulSoup. Let us see what to do to remove the HTML tags from the scraped data. Let us consider sample data as follows which is saved in . CSV format:

If you have data similar to the above figure shown and you are willing to remove those HTML tags and get the data. The following code helps you out.

import re
import csv

#Open and Read the content of the CSV file 
f=open("Book1.csv",'r')
csv_f = csv.reader(f)

f1=open("textfile.txt",'w')

#Identify all the regular expression such as  <*> and substitute with empty space. 
def cleanhtml(raw_html): 
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)   
#Write data into text file 
f1.write(str(cleantext))
  f1.write("\n")
  print(cleantext) 

for row in csv_f:
    cleanhtml(str(row))
f.close()
f1.close()

The re library is used to define the removal of regular expression from the scraped data.
CSV is used to read data and write data into CSV and text files respectively.

After removing the tags the output text file will be as follows:

2. Strip White Spaces

When you get data into any spreadsheet from a website. You are likely to get extra spaces along with important data. There can be leading and trailing spaces, several blanks between words, and thousand separators for numbers.

Replace() function along with regular expression is used to strip the trailing and leading spaces present in the data.

Let the CSV file have the following data.

The following python code is used to remove spaces.

import numpy as np
import pandas as pd
import re


#Load the content of CSV file into python code
def read_csv_regex(data, date_columns=[]):
    df = pd.read_csv(data, quotechar='"', parse_dates=date_columns)


# remove front and ending blank spaces
    df = df.replace({"^\s*|\s*$": ""}, regex=True)


# if there remained only empty string "", change to Nan
    df = df.replace({"": np.nan})
    print(df)
    df.to_csv("removedSpaces.csv")
    return df

columns=["Name","Grade","Percentage","city"]
read_csv_regex("removespaces.csv",columns)

The CSV sheet after removing the trailing and leading spaces will be as follows.

3. Converting a number to number type

Numbers that are stored as text can cause unexpected results. So to convert the numbers that are stored in text format (as String) python provides a function to_numeric(). This function is used to convert the values of columns with numerical text to numerical values.

Let the sample spreadsheet with columns consisting of numerical values in text format be as follows.

The python code to load this CSV file into the python file and convert the numbers in text format to numerical type and store the new sheet is as below.

import pandas as pd

#load the CSV file into python code
df = pd.read_csv(r"Numbertype.csv")
print(df.dtypes) 
#convert to numerical value
df['Price'] = pd.to_numeric(df['Price'])
#write the converted data into CSV file
df.to_csv("NumbertypeConvert.csv")

The resulting CSV file with updated data along with updated data type will be as shown.

4. Converting Boolean values Yes->TRUE and No-> False

If you would like to change yes to TRUE and no to FALSE for the data present in the sheet then the “map()” function can be used. The same is applicable if you want to change Boolean TRUE or FALSE into 1 or 0.

Let us consider same data as follows:

The python code to convert the yes/no to Boolean expression is:

Import pandas “import pandas as pd”

Load the CSV file with the above sheet into the python file

df=pd.read_csv(r"filepath\filename.csv")

Converting Boolean expression

“df['Employement'] = df['Employement'].map({'yes':True ,'no':False})

The updated data frame will be as follows and it is loaded into another CSV file

df.to_csv("yesnoconverted.csv")

The updated CSV content will be as follows

5. Converting dates to machine-readable formats

For converting date details present in the sheet into machine understandable format python library “pandas” provides a function “to_datetime()”. This function is used to convert the date present in the column into machine-readable format.

To implement the date conversion, let us consider a sample spreadsheet with the Name and Date of Birth as follows:

The step-by-step python code to convert the date into machine-readable format is as follows:

Import pandas “import pandas as pd”

Load the CSV file “df=pd.read_csv(filepath\filename.csv)”

Write the expression with to_datetime() function to convert into machine readable format “df['DATE OF Birth'] = pd.to_datetime(df['DATE OF Birth'])”

Load the data “df.to_csv("Formatteddate.csv")”

The final CSV file will be as follows:

6. Finding and correcting inconsistencies

Removing a few columns from the sheet

When dealing with spreadsheets if you want to delete any of the columns temporarily then drop() function of data frames helps out.

Let us consider a CSV file with four columns and you need only the first and last column to use.

The following code helps out

import pandas as pd
df=pd.read_csv(r'C:\Users\91986\Desktop\Fiver\Python scrapping data\excel sheet\removingcolumn.csv')
to_drop=['Marks', 'Percentage']
df.drop(to_drop, inplace=True, axis=1)
print(df)

The resulting sheet has only two columns

7. Removing Empty cells in CSV file

When you are using a sheet that consist of data, if there are any empty cells and you are trying to ready empty content and perform computations it would lead to errors. So it is necessary to remove empty cells present, they should be removed.

Let us consider an example CSV file with the following data

The dropna() function helps to remove empty cells present if any in between the data.

The following python code helps in removing the empty cells:

Import pandas “import pandas as pd”

Load the CSV file “df=pd.read_csv(r’filepath\filename.csv’)”

Write the dropna() function “df.dropna(inplace = True)”

Write the updated data to another CSV file “df.to_csv("emptycellsremoved.csv")”

The emptycellsremoved.csv file data will be as follows

8. Changing reviews

The reviews given in text format can be converted into numbered format reviews using map() function similar to converting yes/no to Boolean conversion.

Let the sample CSV file by:

The following python code is used to change text reviews into numbered reviews.

Import the pandas library “import pandas as pd”

Load the CSV sheet “df=pd.read_csv(r’filepath\filename.csv’)”

Change the reviews from text to numbered format

“df['Reviews'] = df['Reviews'].map(

{'NEUTRAL':2 ,'NOT SATISFIED':1,'SATISFIED':3})”

Write the data into another CSV file

“df.to_csv("numberreviews.csv")”

The resulting numberreviews.csv content will be as follows

9. Removing Duplicates

Python provides a function to remove duplicate entries if any are present in the sheet that you are using.

drop_duplicates() function is useful in identifying the duplicate entries and removing them based on the condition mentioned inside the brackets.

“data.drop_duplicates(keep=False,inplace=True)” – is used to remove the duplicates if all the columns have the same data matching with another row.

“data.drop_duplicates(subset ="Attribute/column name",keep = False, inplace = True)” – is used to remove duplicates by considering only the columns mentioned in the subset field.

Let us consider a dataset with the following data:

The following python code helps in removing duplicates.

Import pandas and load the sheet

“import pandas as pd”

“df=pd.read_csv(r’filepath\filename.CSV)”

Write the expression to remove duplicates

“data.drop_duplicates(keep=False,inplace=True)”

Upload the data to a new sheet.

“data.to_csv("removedduplicates.csv")”

The removedduplicates.csv file will be as shown below

10. Changing the case of text

There are inbuilt functions available in python to change the case of the text as per the requirement. The functions available are:

Series.str.lower()- To convert all the entries of a column to lowercase

Series.str.upper()-To convert all the entries of a column to upper case

Series.str.title() - To convert the first letter to upper case and all the other letters to lowercase for all the entries of a column mentioned.

Let us take an example . CSV spreadsheet.

Python programming involves the following steps

Import pandas and load the datasheet

“import pandas as pd”

“df=pd.read_csv(r’filepath\filename.csv’)”

Convert the “Name” and “Place” to lowercase

data["Place"]= data["Place"].str.lower()

data["Name"]=data["Name"].str.lower()

Convert the “Name” and “Place” to upper case

data["Place"]= data["Place"].str.upper()

data["Name"]=data["Name"].str.upper()

Convert only the first letter to upper case and the rest of the letters to lowercase

data["Place"]= data["Place"].str.title()

data["Name"]=data["Name"].str.title()

11. Convert Unix time to date

The Unix timestamp tracks time as a running count of seconds. The count begins at the "Unix Epoch" on January 1st, 1970, so a Unix timestamp is simply the total seconds between any given date and the Unix Epoch. Since a day contains 86400 seconds (24 hours x 60 minutes x 60 seconds), conversion to Excel time can be done by dividing days by 86400 and adding the date value for January 1st, 1970.

Apart from these calculations python provides a function to_datetime() to directly convert the Unix date into actual date time format.

Let us consider certain Unix time values to verify the above-mentioned function

The python code to convert the Unix date data present in the CSV spreadsheet into normal date time format requires:

Importing pandas and loading the CSV file

“import pandas as pd”

“df=pd.read_csv(r’filepath\filename.csv’)”

Write the expression to convert the Unix time to date time format

date['datetime'] = pd.to_datetime(date['unix'], unit='s')

The best approach to web scraping is with the use of already made scrapers like the Google maps scraper and Zillow homes extractor. The pros of using such ready-made web scrapers are that they make the work a lot easier and faster than the alternative.

The final converted time sheet will be as shown below

12. Separating Text into different columns

When data is scrapped from a webpage it will be in the form of paragraphs sometimes. Then there will be a need to split the text into different columns based on the requirement. The data can be split in python using a general str.split() function which is used to split the content based on the character mentioned inside the split function.

Usually there will be a requirement to split based on the space “ “ . Let us see an example where data from one column is spitted and placed in more than one column.

The sample sheet data is as follows:

Following are the steps to be followed

Import pandas and load the CSV file

“import pandas as pd”

“df=pd.read_csv(r’filepath\filename.csv’)”

Write an expression with split() function to split the data in the column

data[['First','Last']] = data.Name.apply(

lambda x: pd.Series(str(x).split(" ")))

Export the data into new spreadsheet

data.to_csv("splitteddata.csv")

The final sheet with the first name and last name as separate columns is as follows

13. Changing Currencies

While working with sheets and web scraped data, you may need to change the currency symbol or sometimes remove the currency symbol in the column to make the computations easier.

The replace() function of python helps in changing currencies or even removing currency symbols.

Let us take the sample sheet as shown

The following python code helps out in converting currency

Import pandas and load the sheet into python environment

Write the replacement expression to convert the currency from dollar($) to rupee(₹) . clean_currency() function is defined to check whether the dollar symbol is present or not and then if present it will be replaced.

Converted currency format

WEBAUTOMATION.IO PRE-DEFINED EXTRACTORS

We aim to make the process of extracting web data quick and efficient so you can focus your resources on what's truly important, using the data to achieve your business goals. In our marketplace, you can choose from hundreds of pre-defined extractors (PDEs) for the world's biggest websites. These pre-built data extractors turn almost any website into a spreadsheet or API with just a few clicks. The best part? We build and maintain them for you so the data is always in a structured form. .

Related Blog Post

How to Clean Web Scraped data using excel/google sheets