Python API¶
Subpackages¶
Module contents¶
- class proboards_scraper.ScraperManager(db, client_session, content_queue=None, driver=None, image_dir=None, user_queue=None, request_threshold=15, short_delay_time=1.5, long_delay_time=20.0)[source]¶
Bases:
object
This class has three purposes: 1) to store references to objects that will be used in the process of scraping, 2) to serve as an abstraction layer between the scraper functionality and the database, and 3) to handle HTTP requests (adding delays between requests as needed to avoid throttling) and process the queues (popping items from the queues in the necessary order and inserting them into the database).
- Parameters
db (
proboards_scraper.database.Database
) – Database handle.client_session (
aiohttp.ClientSession
) –aiohttp
session.content_queue (
Optional[asyncio.Queue]
) – Queue to which all content (excluding users) should be added for insertion into the database.driver (
Optional[selenium.webdriver.Chrome]
) – Selenium Chrome driver.image_dir (
Optional[pathlib.Path]
) – Directory to which downloaded images should be saved.user_queue (
Optional[asyncio.Queue]
) – Queue to which users should be added for insertion into the database.request_threshold (
int
) – After everyrequest_threshold
calls toScraperManager.get_source()
, waitlong_delay_time
seconds before continuing. This is to prevent request throttling due to a large number of consecutive requests.short_delay_time (
float
) – Number of seconds to wait after each call toScraperManager.get_source()
(to help prevent request throttling).long_delay_time (
float
) – Seerequest_threshold
.
- async download_image(url)[source]¶
Download an image to
image_dir
.- Parameters
url (
str
) – URL of the image to be downloaded.- Return type
dict
- Returns
Image download status and metadata; see
proboards_scraper.download_image()
.
- async get_source(url)[source]¶
Wrapper around
proboards_scraper.get_source()
with an added short delay via call totime.sleep()
before each request, and a longer delay after everyself.request_threshold
calls toScraperManager.get_source()
. This rate-limiting is performed to help avoid request throttling by the server, which may result from a large number of requests in a short period of time.- Parameters
url (
str
) – URL whose page source to retrieve.- Return type
bs4.BeautifulSoup
- Returns
BeautifulSoup page source object.
- insert_guest(name)[source]¶
Insert a guest user into the database.
- Parameters
name (
str
) – The guest’s username.- Return type
int
- Returns
The user ID of the guest returned by
proboards_scraper.database.Database.insert_guest()
.
- insert_image(image)[source]¶
Insert an image entry into the database.
- Parameters
image (
dict
) – A dict representing the image entry.- Return type
int
- Returns
The image ID of the image returned by
proboards_scraper.database.Database.insert_image()
.
- async run()[source]¶
Run the scraper, first processing the user queue and then processing the content queue, calling the appropriate database insert/query methods as needed, and closing the Selenium and aiohttp sessions upon completion.
Because all content (threads, posts, etc.) is associated with users, the content queue is not processed until all users have been added from the user queue (the end of which is marked by a sentinel value). Guest users are an exception, since they are not present in the site’s member list; instead, guests are added/queried as they are encountered by calling
ScraperManager.insert_guest()
.- Return type
None
- async proboards_scraper.download_image(url, session, dst_dir)[source]¶
Attempt to download the image at
url
to the directory specified bydst_dir
. The downloaded file is named after its MD5 hash to ensure uniqueness. If a file already exists on disk (i.e., has been previously downloaded), it is not rewritten.- Parameters
url (
str
) – Image URL.session (
aiohttp.ClientSession
) –aiohttp
session.dst_dir (
pathlib.Path
) – Directory to which the image should be downloaded.
- Return type
dict
- Returns
A dict containing information on the download attempt and, if download was successful, image metadata:
{ "status": { "get": HTTP response code, "exists": whether the image already exists on disk (bool), "valid": whether the file is a valid image file, }, "image": { "url": image download URL, "filename": downloaded image filename, "md5_hash": file MD5 hash, "size": filesize on disk, }, }
- proboards_scraper.get_chrome_driver()[source]¶
Returns an instance of a Selenium Chrome driver with the headless option set to
True
.- Return type
selenium.webdriver.Chrome
- Returns
Headless Chrome driver.
- proboards_scraper.get_login_cookies(home_url, username, password, driver=None, page_load_wait=1)[source]¶
Logs in to a Proboards account using Selenium and returns the cookies from the authenticated login session.
- Parameters
home_url (
str
) – URL for the Proboards forum homepage.username (
str
) – Login username.password (
str
) – Login password.driver (
Optional[selenium.webdriver.Chrome]
) – Selenium Chrome driver (optional).page_load_wait (
int
) – Time (in seconds) to wait to allow the page to load.
- Return type
List[dict]
- Returns
A list of dicts, where each dict corresponds to a cookie, from the Selenium Chrome driver.
- proboards_scraper.get_login_session(cookies)[source]¶
Get an authenticated
aiohttp
session using the cookies provided.This is achieved by converting cookies from a Selenium driver session to
http
module Morsels (see http.cookies.Morsel), which can be added to theaiohttp
session’s cookie jar.- Parameters
cookies (
List[dict]
) – A list of dicts as returned byget_login_cookies()
, i.e., from a Selenium driver session.- Return type
aiohttp.ClientSession
- Returns
An
aiohttp
session with the given cookies in its cookie jar.
- async proboards_scraper.get_source(url, session)[source]¶
Get page source of a URL.
- Parameters
url (
str
) – URL to visit.session (
aiohttp.ClientSession
) –aiohttp
session.
- Return type
bs4.BeautifulSoup
- Returns
Page source.
- proboards_scraper.run_scraper(url, dst_dir='site', username=None, password=None, skip_users=False, no_delay=False)[source]¶
Main function that runs the scraper and calls the appropriate async functions/methods. This is the only function that needs to be called to actually run the scraper (with all the default settings).
- Parameters
url (
str
) –URL of the the page to scrape.
If the URL is that of the forum homepage (e.g., https://yoursite.proboards.com/), the entire site (including users, shoutbox, category/board/thread/post content, etc.) will be scraped.
If it is the URL for the members page (e.g., https://yoursite.proboards.com/members), only the users will be scraped.
If it is the URL for a specific user profile (e.g., https://yoursite.proboards.com/user/10), only that particular user will be scraped.
If it is the URL for a board (e.g., https://yoursite.proboards.com/board/3/board-name), only that particular board and its threads/posts/sub-boards will be scraped.
If it is the URL for a thread (e.g., https://yoursite.proboards.com/thread/1234/thread-title) only that particular thread and its posts will be scraped.
dst_dir (
pathlib.Path
) – Directory in which to place the resulting files. The database file is written to<dst_dir>/forum.db
and image files are saved to<dst_dir>/images
.username (
Optional[str]
) – Username for login.password (
Optional[str]
) – Password for login.skip_users (
bool
) – Skip scraping/adding users from the forum members page (only applies if the forum homepage is provided forurl
.no_delay (
bool
) – Do not add a delay between subsequent requests (seeScraperManager
for more information). Note that this may result in request throttling.
- Return type
None