Tag: BeautifulSoup

Web Scraping of O’Reilly Software Architecture Conference 2020 New York Using Python

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The Software Architecture Conference covers the full range of topics in the software architecture discipline. Those topics include leadership and business skills, product management, and domain-driven design. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process. The Python script ran in the Google Colaboratory environment and can be adapted to run in any Python environment without the Colab-specific configuration.

Starting URLs: https://conferences.oreilly.com/software-architecture/sa-ny/public/schedule/proceedings

The source code and HTML output can be found here on GitHub.

Web Scraping of Data.gov Dataset Catalog Using Python and BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the BeautifulSoup module.

INTRODUCTION: Data.gov is a government data repository website managed and hosted by the U.S. General Services Administration. The purpose of this exercise is to practice web scraping by gathering the dataset entries from Data.gov’s web pages. This iteration of the script automatically traverses the web pages to capture all dataset entries and store all captured information in a JSON output file.

Starting URLs: https://catalog.data.gov/dataset

The source code and HTML output can be found here on GitHub.

Web Scraping of AWS re:Invent 2019 Using Python and BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The AWS re:Invent is a learning conference featuring keynote announcements, training and certification opportunities, a partner expo, and access to more than 2,500 technical sessions. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://aws.amazon.com/events/events-content/?awsf.filter-series=event-series%23reinvent

The source code and HTML output can be found here on GitHub.

Web Scraping of NeurIPS Conference 2019 Using Python and BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The Conference on Neural Information Processing Systems (NeurIPS) covers a wide range of topics in neural information processing systems and research for the biological, technological, mathematical, and theoretical applications. Neural information processing is a field that benefits from a combined view of biological, physical, mathematical, and computational sciences. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019

The source code and HTML output can be found here on GitHub.

Web Scraping of O’Reilly Software Architecture Conference 2019 Berlin Using Python

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The Software Architecture Conference covers the full range of topics in the software architecture discipline. Those topics include leadership and business skills, product management, and domain-driven design. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process. The Python script ran in the Google Colaboratory environment and can be adapted to run in any Python environment without the Colab-specific configuration.

Starting URLs: https://conferences.oreilly.com/software-architecture/sa-eu-2019/public/schedule/proceedings

The source code and HTML output can be found here on GitHub.