In today’s data-driven world, web scraping has become an essential technique for extracting valuable information from websites. In this blog post, we will explore a Python-based web scraping project that collects baseball player data from an ESPN MLB stats website. We’ll also dive into how we leveraged AWS DocumentDB for data storage and utilized Cloud9 as our integrated development environment.
Project Overview
The task was to select two related data sources that you find interesting. One should be a public API, and the other can be any data source of your choice, such as another public API, web scraping target, or static file. Ensure that the chosen data sources provide relevant information for solving a problem.
As a sports fan, I would love to stay up to date with the latest action but unfortunately I’m too busy to watch sports all day. So I usually end up checking scores and highlights online the next day. I thought this project could provide a helpful solution that I would actually want to use.
My goal was to create a web scraping application that retrieves player names, summaries, and statistics from the ESPN MLB stats website. The application is built using Python and employs the popular BeautifulSoup library for parsing HTML. We used the Cloud9 IDE provided by Amazon Web Services (AWS) for development and testing.
Data Collection
The project consists of two main functions: scrape_player_names() and scrape_player_stats(). The scrape_player_names() function scrapes the website to extract player names and then hits the Wikipedia API to retrieve the player summaries. The scrape_player_stats() function loops through each player and scrapes the row of stats for that player, adding them to the dictionary. For a more performant solution, I would combine these functions but I liked keeping them separate for testing and readability. It utilizes the BeautifulSoup library to parse the HTML and retrieve the desired data. See full project here.
def scrape_player_names(self):
# ensure the parsed html exists and declare list that will hold all players names
self.scrape_and_parse_html()
player_names = []
# Extract player names and summaries
name_rows = self.parsed_html.select("#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > table > tbody > tr")
stat_categories = self.parsed_html.select("#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > div > div.Table__Scroller > table > thead > tr > th.Table__TH")
for row in name_rows:
name = row.select('a.AnchorLink')[0].text
team = row.select('span.athleteCell__teamAbbrev')[0].text
try:
summary = wk.summary(f"{name}, baseball player") # Search Summary
player_names.append({
"name": name,
"summary": summary
})
except Exception as e:
player_names.append({
"name": name,
"summary": 'Sorry - no summary is available.'
})
print(f"Unable to retrieve summary for {name}: {e}")
continue
return player_names
def scrape_player_stats(self, player_names):
current_date = datetime.datetime.now()
stat_categories = self.parsed_html.select("#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > div > div.Table__Scroller > table > thead > tr > th.Table__TH")
# first looping thru the player names
for index, player in enumerate(player_names):
# then looping thru the row of stats and adding each one to the player's object, along with a rank and date.
for stat, num in enumerate(stat_categories):
try:
# temp_stats_for_player is each cell
temp_stats_for_player = self.parsed_html.select(f"#fittPageContainer > div:nth-child(3) > div > div > section > div > div:nth-child(4) > div.ResponsiveTable.ResponsiveTable--fixed-left.mt4.Table2__title--remove-capitalization > div > div > div.Table__Scroller > table > tbody > tr:nth-child({index + 1}) > td")
player_names[index][num.text] = temp_stats_for_player[stat].text
player_names[index]['date'] = current_date.strftime('%Y-%m-%d %H:%M:%S')
player_names[index]['rank'] = index + 1
except Exception as e:
print(f"Error scraping player stats: {str(e)}")
logging.error('Error scraping player stats', exc_info=True)
continue
try:
# adding the dict to the DB
result = self.collection.insert_one(player_names[index])
print(f"{index}. RESULT: {result} end \n")
except Exception as e:
print(f'{index}. Failure adding result to database: {e}. See player: {player}')
continue
return player_names
Data Storage with AWS DocumentDB
To store the collected player data, I opted for AWS DocumentDB, a fully managed MongoDB-compatible document database service. DocumentDB offers the scalability, flexibility, and reliability required for our web scraping application. We established a connection to DocumentDB using the pymongo library, allowing seamless interaction with the database.
Utilizing Cloud9 IDE
Cloud9 provided a robust and user-friendly IDE for our development needs. It offered a pre-configured environment with all the necessary tools and dependencies readily available. We leveraged Cloud9’s integrated terminal and created a cron job to run the app every 24 hours.
Building a Python web scraping project with AWS DocumentDB and Cloud9 provided a robust and scalable solution for collecting and storing baseball player data. By leveraging the power of web scraping techniques, the flexibility of AWS DocumentDB, and the convenience of Cloud9 IDE, we were able to develop and deploy a reliable application. This project showcases the seamless integration of various AWS services, empowering developers to build efficient and scalable web scraping applications.
Conclusion
Whether you’re a sports enthusiast, data analyst, or developer looking to harness the power of web scraping, this project serves as a solid foundation. By following the steps outlined here along with this guide on connecting your AWS environment, you can create your own data collection application and leverage AWS services to enhance its functionality and scalability. Check out the github link here and happy scraping!