proboards_scraper.scraper¶

async proboards_scraper.scraper.scrape_board(url, manager)[source]¶

Scrape a board, including all sub-boards (recursively) and all threads, and add them to the content queue for insertion into the database.

Parameters

url (str) – Board page URL.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Return type

None

async proboards_scraper.scraper.scrape_forum(url, manager)[source]¶

Recursively scrape the site beginning at the homepage (main forum page), including all categories, boards, smileys, and the shoutbox. These items are added to the ScraperManager content queue for insertion into the database.

Parameters

url (str) – Forum homepage URL.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Note

This function does NOT scrape user profiles. User profiles must be scraped in a separate async task via scrape_users().

Return type: None

async proboards_scraper.scraper.scrape_poll(thread_id, poll_container, voters_container, manager)[source]¶

Helper function for scrape_thread() that parses poll HTML and adds the poll, poll options, and poll voters and related metadata to the ScraperManager content queue for insertion into the database.

Parameters

thread_id (int) – Thread ID of the thread to which this poll belongs. Since any given thread can have, at most, one poll, a thread ID can be used to uniquely identify a corresponding poll.
poll_container (bs4.element.Tag) – BeautifulSoup HTML container for the poll.
voters_container (bs4.element.Tag) – BeautifulSoup HTML container for poll voters.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Return type

None

async proboards_scraper.scraper.scrape_shoutbox(shoutbox_container, manager)[source]¶

Scrape the shoutbox on the home page and add all shoutbox posts to the content queue for insertion into the database.

Parameters

shoutbox_container (bs4.element.Tag) – BeautifulSoup HTML corresponding to the shoutbox.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Return type

None

async proboards_scraper.scraper.scrape_smileys(smiley_menu, manager)[source]¶

Helper function for scrape_forum() that grabs all smileys available in the post editor form, downloading the images and adding them to the content queue for insertion into the database. The description for each smiley, which is represented as an image in the Image table in the database, is the word “smiley” followed by the emoticon it represents, e.g., “smiley :)”.

Parameters

smiley_menu (bs4.element.Tag) – BeautifulSoup HTML source corresponding to the smiley menu from a post editor form.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Return type

None

async proboards_scraper.scraper.scrape_thread(url, manager)[source]¶

Scrape all pages of a thread, including poll (if any) and all posts, and add them to the content queue for insertion into the database.

Parameters

url (str) – Thread URL.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Return type

None

async proboards_scraper.scraper.scrape_user(url, manager)[source]¶

Scrape a user profile and add the user to the ScraperManager’s user queue (from which the user will be inserted into the database), as well as download the user’s avatar and insert the image into the database.

Parameters

url (str) – User profile page URL.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Return type

None

async proboards_scraper.scraper.scrape_users(url, manager)[source]¶

Asynchronously iterate over all user profile pages and add them to the the ScraperManager user queue for insertion into the database.

Parameters

url (str) – Main members page URL, e.g., https://yoursite.proboards.com/members.
manager (proboards_scraper.ScraperManager) – ScraperManager instance.

Return type

None

proboards_scraper.scraper.split_url(url)[source]¶

Given a forum page URL like, e.g., https://yoursite.proboards.com/board/3/board-name, return the base URL (https://yoursite.proboards.com) and resource path component (board/3/board-name).

Site/page URLs take the following forms:

Homepage: https://yoursite.proboards.com/
Board: https://yoursite.proboards.com/board/3/board-name
Thread: https://yoursite.proboards.com/thread/123/thread-title
Users: https://yoursite.proboards.com/members
User: https://yoursite.proboards.com/user/10

Parameters

url (str) – URL to a forum page.

Return type

Tuple[str, str]

Returns

(base_url, path)

The base URL and resource path URL component (or None if url is just the base/homepage URL).