Building a Python Web Scraping Project with AWS DocumentDB and Cloud9


,

In today’s data-driven world, web scraping has become an essential technique for extracting valuable information from websites. In this blog post, we will explore a Python-based web scraping project that collects baseball player data from an ESPN MLB stats website. We’ll also dive into how we leveraged AWS DocumentDB for data storage and utilized Cloud9 as our integrated development environment.

Project Overview

The task was to select two related data sources that you find interesting. One should be a public API, and the other can be any data source of your choice, such as another public API, web scraping target, or static file. Ensure that the chosen data sources provide relevant information for solving a problem.

As a sports fan, I would love to stay up to date with the latest action but unfortunately I’m too busy to watch sports all day. So I usually end up checking scores and highlights online the next day. I thought this project could provide a helpful solution that I would actually want to use.

My goal was to create a web scraping application that retrieves player names, summaries, and statistics from the ESPN MLB stats website. The application is built using Python and employs the popular BeautifulSoup library for parsing HTML. We used the Cloud9 IDE provided by Amazon Web Services (AWS) for development and testing.

Data Collection

The project consists of two main functions: scrape_player_names() and scrape_player_stats(). The scrape_player_names() function scrapes the website to extract player names and then hits the Wikipedia API to retrieve the player summaries. The scrape_player_stats() function loops through each player and scrapes the row of stats for that player, adding them to the dictionary. For a more performant solution, I would combine these functions but I liked keeping them separate for testing and readability. It utilizes the BeautifulSoup library to parse the HTML and retrieve the desired data. See full project here.

    def scrape_player_names(self):
        # ensure the parsed html exists and declare list that will hold all players names
        self.scrape_and_parse_html()
        player_names = []

        # Extract player names and summaries
        name_rows = self.parsed_html.select("#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > table > tbody > tr")
        stat_categories = self.parsed_html.select("#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > div > div.Table__Scroller > table > thead > tr > th.Table__TH")

        for row in name_rows:
            name = row.select('a.AnchorLink')[0].text
            team = row.select('span.athleteCell__teamAbbrev')[0].text

            try:
                summary = wk.summary(f"{name}, baseball player")  # Search Summary
                player_names.append({
                    "name": name,
                    "summary": summary
                })
            except Exception as e:
                player_names.append({
                    "name": name,
                    "summary": 'Sorry - no summary is available.'
                })
                print(f"Unable to retrieve summary for {name}: {e}")
                continue

        return player_names
        
    def scrape_player_stats(self, player_names):
      current_date = datetime.datetime.now()

      stat_categories = self.parsed_html.select("#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > div > div.Table__Scroller > table > thead > tr > th.Table__TH")
        # first looping thru the player names 
      for index, player in enumerate(player_names):
          # then looping thru the row of stats and adding each one to the player's object, along with a rank and date.
          for stat, num in enumerate(stat_categories):
              try:
                  # temp_stats_for_player is each cell
                  temp_stats_for_player = self.parsed_html.select(f"#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > div > div.Table__Scroller > table > tbody > tr:nth-child({index + 1}) > td")
                  player_names[index][num.text] = temp_stats_for_player[stat].text
                  player_names[index]['date'] = current_date.strftime('%Y-%m-%d %H:%M:%S')
                  player_names[index]['rank'] = index + 1
              except Exception as e:
                  print(f"Error scraping player stats: {str(e)}")
                  logging.error('Error scraping player stats', exc_info=True)
                  continue
          try:
              # adding the dict to the DB
            result = self.collection.insert_one(player_names[index])
            print(f"{index}. RESULT: {result} end \n")
          except Exception as e:
            print(f'{index}. Failure adding result to database: {e}. See player: {player}')
            continue

      return player_names

Data Storage with AWS DocumentDB

To store the collected player data, I opted for AWS DocumentDB, a fully managed MongoDB-compatible document database service. DocumentDB offers the scalability, flexibility, and reliability required for our web scraping application. We established a connection to DocumentDB using the pymongo library, allowing seamless interaction with the database.

Utilizing Cloud9 IDE

Cloud9 provided a robust and user-friendly IDE for our development needs. It offered a pre-configured environment with all the necessary tools and dependencies readily available. We leveraged Cloud9’s integrated terminal and created a cron job to run the app every 24 hours.

Building a Python web scraping project with AWS DocumentDB and Cloud9 provided a robust and scalable solution for collecting and storing baseball player data. By leveraging the power of web scraping techniques, the flexibility of AWS DocumentDB, and the convenience of Cloud9 IDE, we were able to develop and deploy a reliable application. This project showcases the seamless integration of various AWS services, empowering developers to build efficient and scalable web scraping applications.

Conclusion

Whether you’re a sports enthusiast, data analyst, or developer looking to harness the power of web scraping, this project serves as a solid foundation. By following the steps outlined here along with this guide on connecting your AWS environment, you can create your own data collection application and leverage AWS services to enhance its functionality and scalability. Check out the github link here and happy scraping!