How to Read Html File Into Pandas by Divisions

Introduction

The pandas read_html() function is a quick and convenient way to turn an HTML table into a pandas DataFrame. This function can exist useful for quickly incorporating tables from various websites without figuring out how to scrape the site'southward HTML. However, there tin be some challenges in cleaning and formatting the information before analyzing it. In this article, I will hash out how to apply pandas read_html() to read and make clean several Wikipedia HTML tables so that you can utilise them for further numeric analysis.

Bones Usage

For the starting time example, we will attempt to parse this tabular array from the Politics section on the Minnesota wiki page.

MN Voting History

The bones usage is of pandas read_html is pretty simple and works well on many Wikipedia pages since the tables are not complicated. To get started, I am including some actress imports we will use for data cleaning for more complicated examples:

                        import            pandas            equally            pd            import            numpy            as            np            import            matplotlib.pyplot            as            plt            from            unicodedata            import            normalize            table_MN            =            pd            .            read_html            (            'https://en.wikipedia.org/wiki/Minnesota'            )          

The unique point here is that table_MN is a list of all the tables on the page:

                        impress            (            f            'Total tables:                        {            len            (            table_MN            )            }            '            )          

With 38 tables, it can be challenging to find the one you need. To make the tabular array option easier, utilise the friction match parameter to select a subset of tables. We can use the caption "Election results from statewide races" to select the table:

                        table_MN            =            pd            .            read_html            (            'https://en.wikipedia.org/wiki/Minnesota'            ,            match            =            'Election results from statewide races'            )            len            (            table_MN            )          
                        df            =            table_MN            [            0            ]            df            .            head            ()          
Year Office GOP DFL Others
0 2018 Governor 42.iv% 53.nine% 3.7%
1 2018 Senator 36.2% 60.three% 3.four%
2 2018 Senator 42.4% 53.0% 4.6%
three 2016 President 44.ix% 46.4% 8.six%
iv 2014 Governor 44.5% fifty.1% v.4%

Pandas makes it piece of cake to read in the tabular array and also handles the year column that spans multiple rows. This is an example where it is easier to use pandas than to endeavor to scrape information technology all yourself.

Overall, this looks ok until we expect at the data types with df.info() :

            <grade            'pandas.core.frame.DataFrame'> RangeIndex:            24            entries,            0            to            23            Information columns            (full            v            columns):            #   Cavalcade  Non-Cypher Count  Dtype            ---  ------  --------------  -----            0            Year            24            not-null     int64            1            Function            24            not-null     object            two            GOP            24            not-null     object            3            DFL            24            non-goose egg     object            4            Others            24            non-naught     object dtypes: int64(            ane            ), object(            iv            )            memory usage:            1.one+ KB          

We demand to convert the GOP, DFL and Other columns to numeric values if nosotros want to do whatsoever assay.

If we try:

                        df            [            'GOP'            ]            .            astype            (            'bladder'            )          

We get an error:

                        ValueError            :            could            not            convert            cord            to            float            :            '42.4%'          

The nearly probable culprit is the % . Nosotros can get rid of it using pandas replace() part. I covered this in some particular in a previous commodity.

                        df            [            'GOP'            ]            .            replace            ({            '%'            :            ''            },            regex            =            True            )            .            astype            (            'float'            )          

Which looks good:

                        0            42.iv            1            36.2            2            42.4            3            44.9            <...>            21            63.three            22            49.1            23            31.9            Name            :            GOP            ,            dtype            :            float64          

Note, that I had to use the regex=True parameter for this to work since the % is a function of the string and not the full cord value.

Now, we can telephone call replace all the % values and catechumen to numbers using pd.to_numeric() and apply()

                        df            =            df            .            replace            ({            '%'            :            ''            },            regex            =            True            )            df            [[            'GOP'            ,            'DFL'            ,            'Others'            ]]            =            df            [[            'GOP'            ,            'DFL'            ,            'Others'            ]]            .            use            (            pd            .            to_numeric            )            df            .            info            ()          
            <course            'pandas.core.frame.DataFrame'> RangeIndex:            24            entries,            0            to            23            Data columns            (full            five            columns):            #   Column  Non-Zero Count  Dtype            ---  ------  --------------  -----            0            Yr            24            non-null     int64            i            Office            24            not-cipher     object            2            GOP            24            not-nix     float64            3            DFL            24            non-null     float64            four            Others            24            non-null     float64 dtypes: float64(            3            ), int64(            1            ), object(            1            )            retentiveness usage:            1.one+ KB          
Year Part GOP DFL Others
0 2018 Governor 42.4 53.9 three.7
1 2018 Senator 36.2 lx.3 3.4
2 2018 Senator 42.4 53.0 4.6
3 2016 President 44.ix 46.4 8.6
iv 2014 Governor 44.5 50.1 v.iv

This basic process works well. The adjacent example is a piddling trickier.

More than Advanced Data Cleaning

The previous example showed the bones concepts. Frequently more cleaning is needed. Here is an example that was a footling trickier. This instance continues to use Wikipedia only the concepts apply to any site that has information in an HTML tabular array.

What if nosotros wanted to parse the US Gdp tabular array show below?

US GDP Table

This one was a little harder to use match to get only one tabular array but matching on 'Nominal Gross domestic product' gets the tabular array we want as the first one in the list.

                        table_GDP            =            pd            .            read_html            (            'https://en.wikipedia.org/wiki/Economy_of_the_United_States'            ,            friction match            =            'Nominal Gdp'            )            df_GDP            =            table_GDP            [            0            ]            df_GDP            .            info            ()          
            <form            'pandas.core.frame.DataFrame'> RangeIndex:            41            entries,            0            to            40            Data columns            (full            9            columns):            #   Column                                            Non-Null Count  Dtype            ---  ------                                            --------------  -----            0            Twelvemonth            41            non-naught     object            1            Nominal Gdp(in bil. US-Dollar)            41            non-null     float64            2            GDP per capita(in US-Dollar)            41            not-null     int64            iii            Gdp growth(real)            41            non-goose egg     object            4            Inflation rate(in percent)            41            non-null     object            5            Unemployment            (in percent)            41            non-null     object            6            Budget balance(in % of Gross domestic product)[            107            ]            41            non-nothing     object            vii            Government debt held by public(in % of GDP)[            108            ]            41            non-zilch     object            8            Electric current business relationship balance(in % of GDP)            41            non-cipher     object dtypes: float64(            ane            ), int64(            1            ), object(            7            )            memory usage:            3.0+ KB          

Not surprisingly nosotros accept some cleanup to exercise. We can effort to remove the % like we did terminal time:

                        df_GDP            [            'Gdp growth(existent)'            ]            .            replace            ({            '%'            :            ''            },            regex            =            Truthful            )            .            astype            (            'bladder'            )          

Unfortunately nosotros go this error:

                        ValueError            :            could            not            catechumen            string            to            float            :            '−5.9            \xa0            '          

The issue here is that we have a subconscious character, xa0 that is causing some errors. This is a "non-breaking Latin1 (ISO 8859-1) space".

One option I played around with was straight removing the value using supplant . It worked simply I worried nearly whether or non it would break with other characters in the hereafter.

Afterwards going downward the unicode rabbit hole, I decided to utilize normalize to clean this value. I encourage yous to read this commodity for more details on the rationale for my approach.

I also have found bug with extra spaces getting into the data in some of the other tables. I built a pocket-sized function to make clean all the text values. I hope others will find this helpful:

                        from            unicodedata            import            normalize            def            clean_normalize_whitespace            (            x            ):            if            isinstance            (            10            ,            str            ):            return            normalize            (            'NFKC'            ,            x            )            .            strip            ()            else            :            return            x          

I can run this function on the entire DataFrame using applymap :

                        df_GDP            =            df_GDP            .            applymap            (            clean_normalize_whitespace            )          

applymap performance

Be cautious near using applymap This role is very slow and so yous should be judicious in using it.

The applymap function is a very inefficient pandas office. You should not employ it very oft merely in this case, the DataFrame is pocket-sized and cleaning like this is tricky then I think it is a useful merchandise-off.

One thing that applymap misses is the columns. Let's look at one column in more item:

                        'Government debt held by public(in\xa0% of GDP)[108]'          

We have that dreaded xa0% in the column names. There are a couple of ways we could go nigh cleaning the columns but I'm going to use clean_normalize_whitespace() on the columns past converting the column to a series and using apply to run the function. Futurity versions of pandas may brand this a little easier.

                        df_GDP            .            columns            =            df_GDP            .            columns            .            to_series            ()            .            apply            (            clean_normalize_whitespace            )            df_GDP            .            columns            [            7            ]          
                        'Government debt held past public(in % of Gdp)[108]'          

At present we have some of the subconscious characters cleaned out. What next?

Let's endeavor it out again:

                        df_GDP            [            'Gdp growth(real)'            ]            .            supplant            ({            '%'            :            ''            },            regex            =            Truthful            )            .            astype            (            'bladder'            )          
                        ValueError            :            could            not            convert            string            to            float            :            '−5.ix '          

This 1 is actually tricky. If you lot wait really closely, you might be able to tell that the looks a footling different than the - . It's hard to see but there is actually a difference betwixt the unicode dash and minus. Ugh.

Fortunately, we can apply supplant to clean that up too:

                        df_GDP            [            'Gdp growth(real)'            ]            .            supercede            ({            '%'            :            ''            ,            '−'            :            '-'            },            regex            =            True            )            .            astype            (            'float'            )          
                        0            -            v.ix            1            two.2            2            3.0            3            ii.3            iv            ane.7            <...>            38            -            1.viii            39            ii.vi            40            -            0.2            Proper noun            :            Gdp            growth            (            real            ),            dtype            :            float64          

One other column nosotros need to look at is the Twelvemonth column. For 2020, it contains "2020 (est)" which we desire to get rid of. And then convert the column to an int. I can add together to the dictionary simply have to escape the parentheses since they are special characters in a regular expression:

                        df            [            'Year'            ]            .            replace            ({            '%'            :            ''            ,            '−'            :            '-'            ,            '\(est\)'            :            ''            },            regex            =            Truthful            )            .            astype            (            'int'            )          
                        0            2020            one            2019            2            2018            3            2017            4            2016            <...>            twoscore            1980            Name            :            Year            ,            dtype            :            int64          

Earlier we wrap it up and assign these values back to our DataFrame, there is 1 other detail to discuss. Some of these columns should exist integers and some are floats. If we utilise pd.numeric() nosotros don't have that much flexibility. Using astype() we can control the numeric type simply we don't desire to accept to manually type this for each column.

The astype() function can have a dictionary of column names and data types. This is really useful and I did not know this until I wrote this article. Here is how nosotros can define the cavalcade data type mapping:

                        col_type            =            {            'Yr'            :            'int'            ,            'Nominal Gross domestic product(in bil. US-Dollar)'            :            'bladder'            ,            'Gross domestic product per capita(in US-Dollar)'            :            'int'            ,            'GDP growth(real)'            :            'float'            ,            'Inflation rate(in percent)'            :            'bladder'            ,            'Unemployment (in percent)'            :            'bladder'            ,            'Budget balance(in                        % o            f Gdp)[107]'            :            'bladder'            ,            'Government debt held past public(in                        % o            f Gross domestic product)[108]'            :            'float'            ,            'Current account balance(in                        % o            f GDP)'            :            'float'            }          

Here's a quick hint. Typing this dictionary is slow. Apply this shortcut to build up a lexicon of the columns with float as the default value:

                        dict            .            fromkeys            (            df_GDP            .            columns            ,            'float'            )          
                        {            'Twelvemonth'            :            'float'            ,            'Nominal Gdp(in bil. US-Dollar)'            :            'float'            ,            'GDP per capita(in U.s.a.-Dollar)'            :            'bladder'            ,            'GDP growth(existent)'            :            'float'            ,            'Inflation charge per unit(in per centum)'            :            'bladder'            ,            'Unemployment (in per centum)'            :            'bladder'            ,            'Budget residuum(in                        % o            f GDP)[107]'            :            'float'            ,            'Government debt held by public(in                        % o            f GDP)[108]'            :            'bladder'            ,            'Current account balance(in                        % o            f Gdp)'            :            'float'            }          

I also created a unmarried dictionary with the values to replace:

                        clean_dict            =            {            '%'            :            ''            ,            '−'            :            '-'            ,            '\(est\)'            :            ''            }          

Now nosotros tin can telephone call replace on this DataFrame, convert to the desired type and get our clean numeric values:

                        df_GDP            =            df_GDP            .            replace            (            clean_dict            ,            regex            =            True            )            .            replace            ({            '-northward/a '            :            np            .            nan            })            .            astype            (            col_type            )            df_GDP            .            info            ()          
            <class            'pandas.core.frame.DataFrame'> RangeIndex:            41            entries,            0            to            40            Data columns            (full            ix            columns):            #   Column                                            Non-Null Count  Dtype            ---  ------                                            --------------  -----            0            Year            41            not-null     int64            1            Nominal GDP(in bil. U.s.-Dollar)            41            non-null     float64            2            GDP per capita(in Us-Dollar)            41            non-cypher     int64            3            GDP growth(existent)            41            non-nix     float64            four            Inflation rate(in percentage)            41            non-null     float64            5            Unemployment            (in percent)            41            non-zip     float64            6            Upkeep residue(in % of Gdp)[            107            ]            40            non-goose egg     float64            7            Government debt held by public(in % of GDP)[            108            ]            41            non-cypher     float64            viii            Current account residual(in % of GDP)            40            non-nada     float64 dtypes: float64(            seven            ), int64(            2            )            memory usage:            3.0 KB          

Which looks like this now:

Year Nominal Gdp(in bil. Us-Dollar) GDP per capita(in US-Dollar) GDP growth(real) Inflation charge per unit(in percent) Unemployment (in percentage) Upkeep residuum(in % of GDP)[107] Government debt held by public(in % of GDP)[108] Current account balance(in % of Gross domestic product)
0 2020 20234.0 57589 -5.ix 0.62 11.1 NaN 79.9 NaN
ane 2019 21439.0 64674 ii.two ane.lxxx 3.5 -iv.half dozen 78.9 -2.5
ii 2018 20580.two 62869 3.0 2.twoscore three.9 -3.8 77.8 -2.four
3 2017 19519.4 60000 ii.three two.x iv.4 -three.four 76.one -ii.three
4 2016 18715.0 57878 1.vii i.thirty 4.9 -three.ane 76.4 -two.3

Merely to evidence information technology works, we can plot the information also:

                        plt            .            style            .            use            (            'seaborn-whitegrid'            )            df_clean            .            plot            .            line            (            x            =            'Year'            ,            y            =            [            'Inflation rate(in percent)'            ,            'Unemployment (in percent)'            ])          

US GDP Chart

If y'all are closely following along, you lot may take noticed the use of a chained supplant call:

                        .            replace            ({            '-northward/a '            :            np            .            nan            })          

The reason I put that in there is that I could not figure out how to go the n/a cleaned using the first dictionary supplant . I remember the result is that I could non predict the order in which this information would become cleaned and so I decided to execute the replace in two stages.

I'thousand confident that if there is a better fashion someone volition bespeak it out in the comments.

Full Solution

Here is a compact instance of everything nosotros have done. Hopefully this is useful to others that try to ingest data from HTML tables and apply them in a pandas DataFrame:

                        import            pandas            equally            pd            import            numpy            every bit            np            from            unicodedata            import            normalize            def            clean_normalize_whitespace            (            x            ):            """ Normalize unicode characters and strip trailing spaces                          """            if            isinstance            (            x            ,            str            ):            return            normalize            (            'NFKC'            ,            x            )            .            strip            ()            else            :            return            x            # Read in the Wikipedia page and get the DataFrame            table_GDP            =            pd            .            read_html            (            'https://en.wikipedia.org/wiki/Economy_of_the_United_States'            ,            match            =            'Nominal GDP'            )            df_GDP            =            table_GDP            [            0            ]            # Clean upwardly the DataFrame and Columns            df_GDP            =            df_GDP            .            applymap            (            clean_normalize_whitespace            )            df_GDP            .            columns            =            df_GDP            .            columns            .            to_series            ()            .            apply            (            clean_normalize_whitespace            )            # Determine numeric types for each column            col_type            =            {            'Year'            :            'int'            ,            'Nominal GDP(in bil. US-Dollar)'            :            'float'            ,            'Gdp per capita(in U.s.-Dollar)'            :            'int'            ,            'Gross domestic product growth(existent)'            :            'float'            ,            'Aggrandizement rate(in percent)'            :            'bladder'            ,            'Unemployment (in percentage)'            :            'float'            ,            'Budget balance(in                        % o            f GDP)[107]'            :            'bladder'            ,            'Government debt held by public(in                        % o            f GDP)[108]'            :            'float'            ,            'Electric current account residue(in                        % o            f GDP)'            :            'float'            }            # Values to replace            clean_dict            =            {            '%'            :            ''            ,            '−'            :            '-'            ,            '\(est\)'            :            ''            }            # Supervene upon values and convert to numeric values            df_GDP            =            df_GDP            .            replace            (            clean_dict            ,            regex            =            True            )            .            replace            ({            '-n/a '            :            np            .            nan            })            .            astype            (            col_type            )          

Summary

The pandas read_html() function is useful for quickly parsing HTML tables in pages - especially in Wikipedia pages. Past the nature of HTML, the data is frequently not going to exist as clean as you might need and cleaning upward all the stray unicode characters can be time consuming. This article showed several techniques you can utilise to clean the data and convert information technology to the proper numeric format. If you find yourself needing to scrape some Wikipedia or other HTML tables, these tips should relieve you some time.

If this is helpful to y'all or you have other tips, feel free to let me know in the comments.

robinsonoffily.blogspot.com

Source: https://pbpython.com/pandas-html-table.html

0 Response to "How to Read Html File Into Pandas by Divisions"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel