Architecture

The architecture and data flow of the ProBoards Forum Scraper are presented at a high level in the figure below.

_images/pb_scraper_diagram.png

Database class

The proboards_scraper.database.Database class serves as an interface for the SQLite database. It provides a number of convenient methods for querying the database and inserting items into the database (only insert methods are shown in the figure).

For example, proboards_scraper.database.Database.insert_board() accepts a dictionary containing parameters corresponding to a board, which it uses to instantiate a proboards_scraper.database.Board object. It queries the database to determine if the board already exists in the database. If it doesn’t, the new record is inserted into the database. The object is then returned to the caller. This allows interacting with the database without worrying about low level sqlalchemy implementation details (or even lower level SQL statements).

ScraperManager class

The proboards_scraper.ScraperManager class contains asynchronous methods to grab the HTML page source of a URL (proboards_scraper.ScraperManager.get_source()) and download an image (proboards_scraper.ScraperManager.download_image()). It also contains an asynchronous method proboards_scraper.ScraperManager.run() that pops items from user queue and content queue and adds them and inserts them into the database.

Why encapsulate these methods in a class instead of allowing them to be standalone functions?

For starters, incorporating them into the ScraperManager class enables us to keep track of the number of HTTP requests made and add delays between HTTP requests to avoid request throttling by the server.

The ScraperManager.run() method also ensures that all users are processed (via the user queue) and added to the database before all other site content, since most other site content references users in some way. For instance, a board might have moderators (i.e., users), a thread is started by a user, posts are made/edited by users, and polls are voted in by users. Each of these database tables contains a reference (or references) to objects in the users table; therefore, the users need to exist first before we can populate those other tables.

Finally, having a ScraperManager class allows us to store the aiohttp session, selenium driver session, and a reference to the Database class instance in a single place. This way, we only need to pass around the ScraperManager class instance instead of these other objects, and let it determine which object should be used for a given task and how to use it. For example, the caller doesn’t need to worry about which Database insert method to use. It only needs to put a dictionary containing the necessary database object parameters into the queue. The run() method inspects it and determines which Database method is needed to insert it into the database.

The ScraperManager class also contains two methods that break from the content queue and run() pattern: proboards_scraper.ScraperManager.insert_guest() and proboards_scraper.ScraperManager.insert_image(). The reason for this will be explained below.

Scraper module

The proboards_scraper.scraper module contains several asynchronous functions that scrape the site by calling the relevant ScraperManager methods and parsing/processing the HTML page source. There is a dedicated function for scraping all users, proboards_scraper.scraper.scrape_users(), and there are other functions for grabbing all other site content. proboards_scraper.scraper.scrape_forum() grabs all shoutbox posts and post smileys (via functions not shown in the figure above), then calls proboards_scraper.scraper.scrape_board() on all boards on the main page. The scrape_board() recursively scrapes any sub-boards, as well as all threads belonging to the board via proboards_scraper.scraper.scrape_thread() which, in turn, scrapes a thread (including a poll, if one is associated with the thread, and all the poll’s options and voters) and the thread’s posts.

In the figure, the arrows pointing to/from the dashed line representing the scraper module represents the data flow for each of these functions. In other words, each function gets the page source (via proboards_scraper.ScraperManager.get_source()), parses it for relevant information, and adds the appropriate item(s) to the appropriate queue.

Each function can be called individually, even if some of them are recursive. For example, scrape_thread can be called with a single thread’s URL; it doesn’t need to be recursively called by scrape_board.

Guests

Guests can be considered a special case of user. Guests are users who aren’t registered on the site (or may be formerly registered users who have been deleted). There’s no user profile associated with a guest, but there can be posts made by or threads started by guests. Because they aren’t registered users, their profile can’t be scraped alongside registered users from the forum’s members page before all other site content is scraped, as mentioned above.

In other words, guests can be encountered at any time while scraping boards, threads, posts, etc. To account for this, the ScraperManager class has a function specifically for querying and inserting guests into the database, bypassing the async content queue. If, for instance, a post made by a guest is encountered by proboards_scraper.scraper.scrape_thread() while scraping a thread, proboards_scraper.ScraperManager.insert_guest() is called with the guest’s username. If a guest with that username already exists in the database, their id is retrieved and returned; if the guest does not already exist, they’re inserted into the database and assigned an id, which is then returned. scrape_thread can then proceed, assigning the post to the correct user id (from the User table—see proboards_scraper.database.User)

Since guests aren’t registered and don’t have an actual user id on the forum, we assign them negative user ids for the purpose of the database. The first guest encountered is assigned -1, the next -2, and so on.

Images

Image metadata is stored in the database Image table (see proboards_scraper.database.Image). Images are unique in that an image item in the database may also have an image file, saved on disk, associated with it. The location of the file (if any) on disk is stored in the filename attribute of the Image object. To facilitate scraping, the ScraperManager class has two methods, proboards_scraper.ScraperManager.download_image() and proboards_scraper.ScraperManager.insert_image(), that can be called to download an image from a URL and insert it into the database, respectively.

This is mainly useful for scraping user profiles. A user’s avatar is part of their profile. While scraping a profile, the avatar is downloaded by calling the aforementioned ScraperManager.download_image method, and information about the file (like its path on disk, its MD5 hash, and its filesize) is returned. This information is used to construct an Image object and insert it into the database via ScraperManager.insert_image, which returns the id of the image. This id can be linked to an avatar (see proboards_scraper.database.Avatar) and user when they’re added to the content queue per the normal workflow.