How to Read Html File Into Pandas by Divisions
Introduction
The pandas read_html() function is a quick and convenient way to turn an HTML table into a pandas DataFrame. This function can exist useful for quickly incorporating tables from various websites without figuring out how to scrape the site'southward HTML. However, there tin be some challenges in cleaning and formatting the information before analyzing it. In this article, I will hash out how to apply pandas read_html()
to read and make clean several Wikipedia HTML tables so that you can utilise them for further numeric analysis.
Bones Usage
For the starting time example, we will attempt to parse this tabular array from the Politics section on the Minnesota wiki page.
The bones usage is of pandas read_html
is pretty simple and works well on many Wikipedia pages since the tables are not complicated. To get started, I am including some actress imports we will use for data cleaning for more complicated examples:
import pandas equally pd import numpy as np import matplotlib.pyplot as plt from unicodedata import normalize table_MN = pd . read_html ( 'https://en.wikipedia.org/wiki/Minnesota' )
The unique point here is that table_MN
is a list of all the tables on the page:
impress ( f 'Total tables: { len ( table_MN ) } ' )
With 38 tables, it can be challenging to find the one you need. To make the tabular array option easier, utilise the friction match
parameter to select a subset of tables. We can use the caption "Election results from statewide races" to select the table:
table_MN = pd . read_html ( 'https://en.wikipedia.org/wiki/Minnesota' , match = 'Election results from statewide races' ) len ( table_MN )
df = table_MN [ 0 ] df . head ()
Year | Office | GOP | DFL | Others | |
---|---|---|---|---|---|
0 | 2018 | Governor | 42.iv% | 53.nine% | 3.7% |
1 | 2018 | Senator | 36.2% | 60.three% | 3.four% |
2 | 2018 | Senator | 42.4% | 53.0% | 4.6% |
three | 2016 | President | 44.ix% | 46.4% | 8.six% |
iv | 2014 | Governor | 44.5% | fifty.1% | v.4% |
Pandas makes it piece of cake to read in the tabular array and also handles the year column that spans multiple rows. This is an example where it is easier to use pandas than to endeavor to scrape information technology all yourself.
Overall, this looks ok until we expect at the data types with df.info()
:
<grade 'pandas.core.frame.DataFrame'> RangeIndex: 24 entries, 0 to 23 Information columns (full v columns): # Cavalcade Non-Cypher Count Dtype --- ------ -------------- ----- 0 Year 24 not-null int64 1 Function 24 not-null object two GOP 24 not-null object 3 DFL 24 non-goose egg object 4 Others 24 non-naught object dtypes: int64( ane ), object( iv ) memory usage: 1.one+ KB
We demand to convert the GOP, DFL and Other columns to numeric values if nosotros want to do whatsoever assay.
If we try:
df [ 'GOP' ] . astype ( 'bladder' )
We get an error:
ValueError : could not convert cord to float : '42.4%'
The nearly probable culprit is the %
. Nosotros can get rid of it using pandas replace()
part. I covered this in some particular in a previous commodity.
df [ 'GOP' ] . replace ({ '%' : '' }, regex = True ) . astype ( 'float' )
Which looks good:
0 42.iv 1 36.2 2 42.4 3 44.9 <...> 21 63.three 22 49.1 23 31.9 Name : GOP , dtype : float64
Note, that I had to use the regex=True
parameter for this to work since the %
is a function of the string and not the full cord value.
Now, we can telephone call replace all the %
values and catechumen to numbers using pd.to_numeric()
and apply()
df = df . replace ({ '%' : '' }, regex = True ) df [[ 'GOP' , 'DFL' , 'Others' ]] = df [[ 'GOP' , 'DFL' , 'Others' ]] . use ( pd . to_numeric ) df . info ()
<course 'pandas.core.frame.DataFrame'> RangeIndex: 24 entries, 0 to 23 Data columns (full five columns): # Column Non-Zero Count Dtype --- ------ -------------- ----- 0 Yr 24 non-null int64 i Office 24 not-cipher object 2 GOP 24 not-nix float64 3 DFL 24 non-null float64 four Others 24 non-null float64 dtypes: float64( 3 ), int64( 1 ), object( 1 ) retentiveness usage: 1.one+ KB
Year | Part | GOP | DFL | Others | |
---|---|---|---|---|---|
0 | 2018 | Governor | 42.4 | 53.9 | three.7 |
1 | 2018 | Senator | 36.2 | lx.3 | 3.4 |
2 | 2018 | Senator | 42.4 | 53.0 | 4.6 |
3 | 2016 | President | 44.ix | 46.4 | 8.6 |
iv | 2014 | Governor | 44.5 | 50.1 | v.iv |
This basic process works well. The adjacent example is a piddling trickier.
More than Advanced Data Cleaning
The previous example showed the bones concepts. Frequently more cleaning is needed. Here is an example that was a footling trickier. This instance continues to use Wikipedia only the concepts apply to any site that has information in an HTML tabular array.
What if nosotros wanted to parse the US Gdp tabular array show below?
This one was a little harder to use match to get only one tabular array but matching on 'Nominal Gross domestic product' gets the tabular array we want as the first one in the list.
table_GDP = pd . read_html ( 'https://en.wikipedia.org/wiki/Economy_of_the_United_States' , friction match = 'Nominal Gdp' ) df_GDP = table_GDP [ 0 ] df_GDP . info ()
<form 'pandas.core.frame.DataFrame'> RangeIndex: 41 entries, 0 to 40 Data columns (full 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Twelvemonth 41 non-naught object 1 Nominal Gdp(in bil. US-Dollar) 41 non-null float64 2 GDP per capita(in US-Dollar) 41 not-null int64 iii Gdp growth(real) 41 non-goose egg object 4 Inflation rate(in percent) 41 non-null object 5 Unemployment (in percent) 41 non-null object 6 Budget balance(in % of Gross domestic product)[ 107 ] 41 non-nothing object vii Government debt held by public(in % of GDP)[ 108 ] 41 non-zilch object 8 Electric current business relationship balance(in % of GDP) 41 non-cipher object dtypes: float64( ane ), int64( 1 ), object( 7 ) memory usage: 3.0+ KB
Not surprisingly nosotros accept some cleanup to exercise. We can effort to remove the %
like we did terminal time:
df_GDP [ 'Gdp growth(existent)' ] . replace ({ '%' : '' }, regex = Truthful ) . astype ( 'bladder' )
Unfortunately nosotros go this error:
ValueError : could not catechumen string to float : '−5.9 \xa0 '
The issue here is that we have a subconscious character, xa0
that is causing some errors. This is a "non-breaking Latin1 (ISO 8859-1) space".
One option I played around with was straight removing the value using supplant
. It worked simply I worried nearly whether or non it would break with other characters in the hereafter.
Afterwards going downward the unicode rabbit hole, I decided to utilize normalize
to clean this value. I encourage yous to read this commodity for more details on the rationale for my approach.
I also have found bug with extra spaces getting into the data in some of the other tables. I built a pocket-sized function to make clean all the text values. I hope others will find this helpful:
from unicodedata import normalize def clean_normalize_whitespace ( x ): if isinstance ( 10 , str ): return normalize ( 'NFKC' , x ) . strip () else : return x
I can run this function on the entire DataFrame using applymap
:
df_GDP = df_GDP . applymap ( clean_normalize_whitespace )
applymap
performance
Be cautious near using applymap
This role is very slow and so yous should be judicious in using it.
The applymap
function is a very inefficient pandas office. You should not employ it very oft merely in this case, the DataFrame is pocket-sized and cleaning like this is tricky then I think it is a useful merchandise-off.
One thing that applymap
misses is the columns. Let's look at one column in more item:
'Government debt held by public(in\xa0% of GDP)[108]'
We have that dreaded xa0%
in the column names. There are a couple of ways we could go nigh cleaning the columns but I'm going to use clean_normalize_whitespace()
on the columns past converting the column to a series and using apply
to run the function. Futurity versions of pandas may brand this a little easier.
df_GDP . columns = df_GDP . columns . to_series () . apply ( clean_normalize_whitespace ) df_GDP . columns [ 7 ]
'Government debt held past public(in % of Gdp)[108]'
At present we have some of the subconscious characters cleaned out. What next?
Let's endeavor it out again:
df_GDP [ 'Gdp growth(real)' ] . supplant ({ '%' : '' }, regex = Truthful ) . astype ( 'bladder' )
ValueError : could not convert string to float : '−5.ix '
This 1 is actually tricky. If you lot wait really closely, you might be able to tell that the −
looks a footling different than the -
. It's hard to see but there is actually a difference betwixt the unicode dash and minus. Ugh.
Fortunately, we can apply supplant
to clean that up too:
df_GDP [ 'Gdp growth(real)' ] . supercede ({ '%' : '' , '−' : '-' }, regex = True ) . astype ( 'float' )
0 - v.ix 1 two.2 2 3.0 3 ii.3 iv ane.7 <...> 38 - 1.viii 39 ii.vi 40 - 0.2 Proper noun : Gdp growth ( real ), dtype : float64
One other column nosotros need to look at is the Twelvemonth
column. For 2020, it contains "2020 (est)" which we desire to get rid of. And then convert the column to an int. I can add together to the dictionary simply have to escape the parentheses since they are special characters in a regular expression:
df [ 'Year' ] . replace ({ '%' : '' , '−' : '-' , '\(est\)' : '' }, regex = Truthful ) . astype ( 'int' )
0 2020 one 2019 2 2018 3 2017 4 2016 <...> twoscore 1980 Name : Year , dtype : int64
Earlier we wrap it up and assign these values back to our DataFrame, there is 1 other detail to discuss. Some of these columns should exist integers and some are floats. If we utilise pd.numeric()
nosotros don't have that much flexibility. Using astype()
we can control the numeric type simply we don't desire to accept to manually type this for each column.
The astype()
function can have a dictionary of column names and data types. This is really useful and I did not know this until I wrote this article. Here is how nosotros can define the cavalcade data type mapping:
col_type = { 'Yr' : 'int' , 'Nominal Gross domestic product(in bil. US-Dollar)' : 'bladder' , 'Gross domestic product per capita(in US-Dollar)' : 'int' , 'GDP growth(real)' : 'float' , 'Inflation rate(in percent)' : 'bladder' , 'Unemployment (in percent)' : 'bladder' , 'Budget balance(in % o f Gdp)[107]' : 'bladder' , 'Government debt held past public(in % o f Gross domestic product)[108]' : 'float' , 'Current account balance(in % o f GDP)' : 'float' }
Here's a quick hint. Typing this dictionary is slow. Apply this shortcut to build up a lexicon of the columns with float
as the default value:
dict . fromkeys ( df_GDP . columns , 'float' )
{ 'Twelvemonth' : 'float' , 'Nominal Gdp(in bil. US-Dollar)' : 'float' , 'GDP per capita(in U.s.a.-Dollar)' : 'bladder' , 'GDP growth(existent)' : 'float' , 'Inflation charge per unit(in per centum)' : 'bladder' , 'Unemployment (in per centum)' : 'bladder' , 'Budget residuum(in % o f GDP)[107]' : 'float' , 'Government debt held by public(in % o f GDP)[108]' : 'bladder' , 'Current account balance(in % o f Gdp)' : 'float' }
I also created a unmarried dictionary with the values to replace:
clean_dict = { '%' : '' , '−' : '-' , '\(est\)' : '' }
Now nosotros tin can telephone call replace on this DataFrame, convert to the desired type and get our clean numeric values:
df_GDP = df_GDP . replace ( clean_dict , regex = True ) . replace ({ '-northward/a ' : np . nan }) . astype ( col_type ) df_GDP . info ()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41 entries, 0 to 40 Data columns (full ix columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 41 not-null int64 1 Nominal GDP(in bil. U.s.-Dollar) 41 non-null float64 2 GDP per capita(in Us-Dollar) 41 non-cypher int64 3 GDP growth(existent) 41 non-nix float64 four Inflation rate(in percentage) 41 non-null float64 5 Unemployment (in percent) 41 non-zip float64 6 Upkeep residue(in % of Gdp)[ 107 ] 40 non-goose egg float64 7 Government debt held by public(in % of GDP)[ 108 ] 41 non-cypher float64 viii Current account residual(in % of GDP) 40 non-nada float64 dtypes: float64( seven ), int64( 2 ) memory usage: 3.0 KB
Which looks like this now:
Year | Nominal Gdp(in bil. Us-Dollar) | GDP per capita(in US-Dollar) | GDP growth(real) | Inflation charge per unit(in percent) | Unemployment (in percentage) | Upkeep residuum(in % of GDP)[107] | Government debt held by public(in % of GDP)[108] | Current account balance(in % of Gross domestic product) | |
---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 20234.0 | 57589 | -5.ix | 0.62 | 11.1 | NaN | 79.9 | NaN |
ane | 2019 | 21439.0 | 64674 | ii.two | ane.lxxx | 3.5 | -iv.half dozen | 78.9 | -2.5 |
ii | 2018 | 20580.two | 62869 | 3.0 | 2.twoscore | three.9 | -3.8 | 77.8 | -2.four |
3 | 2017 | 19519.4 | 60000 | ii.three | two.x | iv.4 | -three.four | 76.one | -ii.three |
4 | 2016 | 18715.0 | 57878 | 1.vii | i.thirty | 4.9 | -three.ane | 76.4 | -two.3 |
Merely to evidence information technology works, we can plot the information also:
plt . style . use ( 'seaborn-whitegrid' ) df_clean . plot . line ( x = 'Year' , y = [ 'Inflation rate(in percent)' , 'Unemployment (in percent)' ])
If y'all are closely following along, you lot may take noticed the use of a chained supplant
call:
. replace ({ '-northward/a ' : np . nan })
The reason I put that in there is that I could not figure out how to go the n/a
cleaned using the first dictionary supplant
. I remember the result is that I could non predict the order in which this information would become cleaned and so I decided to execute the replace in two stages.
I'thousand confident that if there is a better fashion someone volition bespeak it out in the comments.
Full Solution
Here is a compact instance of everything nosotros have done. Hopefully this is useful to others that try to ingest data from HTML tables and apply them in a pandas DataFrame:
import pandas equally pd import numpy every bit np from unicodedata import normalize def clean_normalize_whitespace ( x ): """ Normalize unicode characters and strip trailing spaces """ if isinstance ( x , str ): return normalize ( 'NFKC' , x ) . strip () else : return x # Read in the Wikipedia page and get the DataFrame table_GDP = pd . read_html ( 'https://en.wikipedia.org/wiki/Economy_of_the_United_States' , match = 'Nominal GDP' ) df_GDP = table_GDP [ 0 ] # Clean upwardly the DataFrame and Columns df_GDP = df_GDP . applymap ( clean_normalize_whitespace ) df_GDP . columns = df_GDP . columns . to_series () . apply ( clean_normalize_whitespace ) # Determine numeric types for each column col_type = { 'Year' : 'int' , 'Nominal GDP(in bil. US-Dollar)' : 'float' , 'Gdp per capita(in U.s.-Dollar)' : 'int' , 'Gross domestic product growth(existent)' : 'float' , 'Aggrandizement rate(in percent)' : 'bladder' , 'Unemployment (in percentage)' : 'float' , 'Budget balance(in % o f GDP)[107]' : 'bladder' , 'Government debt held by public(in % o f GDP)[108]' : 'float' , 'Electric current account residue(in % o f GDP)' : 'float' } # Values to replace clean_dict = { '%' : '' , '−' : '-' , '\(est\)' : '' } # Supervene upon values and convert to numeric values df_GDP = df_GDP . replace ( clean_dict , regex = True ) . replace ({ '-n/a ' : np . nan }) . astype ( col_type )
Summary
The pandas read_html()
function is useful for quickly parsing HTML tables in pages - especially in Wikipedia pages. Past the nature of HTML, the data is frequently not going to exist as clean as you might need and cleaning upward all the stray unicode characters can be time consuming. This article showed several techniques you can utilise to clean the data and convert information technology to the proper numeric format. If you find yourself needing to scrape some Wikipedia or other HTML tables, these tips should relieve you some time.
If this is helpful to y'all or you have other tips, feel free to let me know in the comments.
Source: https://pbpython.com/pandas-html-table.html
0 Response to "How to Read Html File Into Pandas by Divisions"
Post a Comment