Project #1: Bookstore Web Scraper

Project #1: Bookstore Web Scraper

You need actual data to work on data and web scraping is one of the ways you can collect data.

In this post, I write about a Python web scraper I coded that collects data from a fictional bookstore.

via GIPHY

Resources used

For this project, I used the following Python libraries,

  • Beautiful soup 4
  • Pandas
  • Requests

I followed Corey Schafer's Python web scraper tutorial but I tweaked the code a little and my code can be found on Github.

This is what the output looks like.

books.png

Knowledge gained

In the course of learning to code this web scraper, I learned that,

  • Web scrapers are used to extract data from websites while web crawlers are used to find links on the web.

  • Some websites are unfriendly to web scrapers. To check if a website blocks a web scraper, type "/robots.txt" after the URL. For example, to check if IMDB, a popular movie rating and review website, will block web scrapers,

imdb.com/robots.txt

The result

#robots.txt for https://www.imdb.com properties
User-agent: *
Disallow: /OnThisDay
Disallow: /ads/
Disallow: /ap/
Disallow: /mymovies/
Disallow: /r/
Disallow: /register
Disallow: /registration/
Disallow: /search/name-text
Disallow: /search/title-text
Disallow: /find
Disallow: /find$
Disallow: /find/
Disallow: /tvschedule
Disallow: /updates
Disallow: /watch/_ajax/option
Disallow: /_json/video/mon
Disallow: /_json/getAdsForMediaViewer/
Disallow: /list/ls*/_ajax
Disallow: /*/*/rg*/mediaviewer/rm*/tr
Disallow: /*/rg*/mediaviewer/rm*/tr
Disallow: /*/mediaviewer/*/tr
Disallow: /title/tt*/mediaviewer/rm*/tr
Disallow: /name/nm*/mediaviewer/rm*/tr
Disallow: /gallery/rg*/mediaviewer/rm*/tr
Disallow: /tr/
Disallow: /title/tt*/watchoptions
Disallow: /search/title/?title_type=feature,tv_movie,tv_miniseries,documentary,short,video,tv_short&release_date=,2020-12-31&lists=%21ls538187658,%21ls539867036,%21ls538186228&view=simple&sort=num_votes,asc&aft

User-agent: Baiduspider
Disallow: /list/*
Disallow: /user/*

Yeah, I don't think web scrapers are allowed. I suggest reading this article on Scrape.do to know how best to go about web scraping.

  • Static sites are easier to build a web scraper for than dynamic websites because information on dynamic websites is updated often.

  • Some big websites have APIs that let you scrape the websites like Tweepy, a Python web scraping tool for Twitter.

  • Beautiful Soup and Python's requests library can help you build a web scraper for a static site in minutes.

  • HTML parsers are used to analyze and convert web pages to HTML document trees. For web scraping, this makes accessing objects on the web page easier.

Web scraping is a fine way to collect data especially as the data can be converted to various formats using the right tools.

Have you built a web scraper before? Are my takeaways similar to yours?