proboards_scraper.scraper¶
- async proboards_scraper.scraper.scrape_board(url, manager)[source]¶
Scrape a board, including all sub-boards (recursively) and all threads, and add them to the content queue for insertion into the database.
- Parameters
url (
str
) – Board page URL.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
- Return type
None
- async proboards_scraper.scraper.scrape_forum(url, manager)[source]¶
Recursively scrape the site beginning at the homepage (main forum page), including all categories, boards, smileys, and the shoutbox. These items are added to the ScraperManager content queue for insertion into the database.
- Parameters
url (
str
) – Forum homepage URL.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
Note
This function does NOT scrape user profiles. User profiles must be scraped in a separate
async
task viascrape_users()
.- Return type
None
- async proboards_scraper.scraper.scrape_poll(thread_id, poll_container, voters_container, manager)[source]¶
Helper function for
scrape_thread()
that parses poll HTML and adds the poll, poll options, and poll voters and related metadata to the ScraperManager content queue for insertion into the database.- Parameters
thread_id (
int
) – Thread ID of the thread to which this poll belongs. Since any given thread can have, at most, one poll, a thread ID can be used to uniquely identify a corresponding poll.poll_container (
bs4.element.Tag
) – BeautifulSoup HTML container for the poll.voters_container (
bs4.element.Tag
) – BeautifulSoup HTML container for poll voters.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
- Return type
None
- async proboards_scraper.scraper.scrape_shoutbox(shoutbox_container, manager)[source]¶
Scrape the shoutbox on the home page and add all shoutbox posts to the content queue for insertion into the database.
- Parameters
shoutbox_container (
bs4.element.Tag
) – BeautifulSoup HTML corresponding to the shoutbox.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
- Return type
None
- async proboards_scraper.scraper.scrape_smileys(smiley_menu, manager)[source]¶
Helper function for
scrape_forum()
that grabs all smileys available in the post editor form, downloading the images and adding them to the content queue for insertion into the database. The description for each smiley, which is represented as an image in the Image table in the database, is the word “smiley” followed by the emoticon it represents, e.g., “smiley :)”.- Parameters
smiley_menu (
bs4.element.Tag
) – BeautifulSoup HTML source corresponding to the smiley menu from a post editor form.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
- Return type
None
- async proboards_scraper.scraper.scrape_thread(url, manager)[source]¶
Scrape all pages of a thread, including poll (if any) and all posts, and add them to the content queue for insertion into the database.
- Parameters
url (
str
) – Thread URL.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
- Return type
None
- async proboards_scraper.scraper.scrape_user(url, manager)[source]¶
Scrape a user profile and add the user to the ScraperManager’s user queue (from which the user will be inserted into the database), as well as download the user’s avatar and insert the image into the database.
- Parameters
url (
str
) – User profile page URL.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
- Return type
None
- async proboards_scraper.scraper.scrape_users(url, manager)[source]¶
Asynchronously iterate over all user profile pages and add them to the the ScraperManager user queue for insertion into the database.
- Parameters
url (
str
) – Main members page URL, e.g., https://yoursite.proboards.com/members.manager (
proboards_scraper.ScraperManager
) – ScraperManager instance.
- Return type
None
- proboards_scraper.scraper.split_url(url)[source]¶
Given a forum page URL like, e.g., https://yoursite.proboards.com/board/3/board-name, return the base URL (https://yoursite.proboards.com) and resource path component (board/3/board-name).
Site/page URLs take the following forms:
Homepage: https://yoursite.proboards.com/
Thread: https://yoursite.proboards.com/thread/123/thread-title
- Parameters
url (
str
) – URL to a forum page.- Return type
Tuple[str, str]
- Returns
(base_url, path)
The base URL and resource path URL component (or
None
ifurl
is just the base/homepage URL).