Python API

Module contents

class proboards_scraper.ScraperManager(db, client_session, content_queue=None, driver=None, image_dir=None, user_queue=None, request_threshold=15, short_delay_time=1.5, long_delay_time=20.0)[source]

Bases: object

This class has three purposes: 1) to store references to objects that will be used in the process of scraping, 2) to serve as an abstraction layer between the scraper functionality and the database, and 3) to handle HTTP requests (adding delays between requests as needed to avoid throttling) and process the queues (popping items from the queues in the necessary order and inserting them into the database).

Parameters
  • db (proboards_scraper.database.Database) – Database handle.

  • client_session (aiohttp.ClientSession) – aiohttp session.

  • content_queue (Optional[asyncio.Queue]) – Queue to which all content (excluding users) should be added for insertion into the database.

  • driver (Optional[selenium.webdriver.Chrome]) – Selenium Chrome driver.

  • image_dir (Optional[pathlib.Path]) – Directory to which downloaded images should be saved.

  • user_queue (Optional[asyncio.Queue]) – Queue to which users should be added for insertion into the database.

  • request_threshold (int) – After every request_threshold calls to ScraperManager.get_source(), wait long_delay_time seconds before continuing. This is to prevent request throttling due to a large number of consecutive requests.

  • short_delay_time (float) – Number of seconds to wait after each call to ScraperManager.get_source() (to help prevent request throttling).

  • long_delay_time (float) – See request_threshold.

async download_image(url)[source]

Download an image to image_dir.

Parameters

url (str) – URL of the image to be downloaded.

Return type

dict

Returns

Image download status and metadata; see proboards_scraper.download_image().

async get_source(url)[source]

Wrapper around proboards_scraper.get_source() with an added short delay via call to time.sleep() before each request, and a longer delay after every self.request_threshold calls to ScraperManager.get_source(). This rate-limiting is performed to help avoid request throttling by the server, which may result from a large number of requests in a short period of time.

Parameters

url (str) – URL whose page source to retrieve.

Return type

bs4.BeautifulSoup

Returns

BeautifulSoup page source object.

insert_guest(name)[source]

Insert a guest user into the database.

Parameters

name (str) – The guest’s username.

Return type

int

Returns

The user ID of the guest returned by proboards_scraper.database.Database.insert_guest().

insert_image(image)[source]

Insert an image entry into the database.

Parameters

image (dict) – A dict representing the image entry.

Return type

int

Returns

The image ID of the image returned by proboards_scraper.database.Database.insert_image().

async run()[source]

Run the scraper, first processing the user queue and then processing the content queue, calling the appropriate database insert/query methods as needed, and closing the Selenium and aiohttp sessions upon completion.

Because all content (threads, posts, etc.) is associated with users, the content queue is not processed until all users have been added from the user queue (the end of which is marked by a sentinel value). Guest users are an exception, since they are not present in the site’s member list; instead, guests are added/queried as they are encountered by calling ScraperManager.insert_guest().

Return type

None

async proboards_scraper.download_image(url, session, dst_dir)[source]

Attempt to download the image at url to the directory specified by dst_dir. The downloaded file is named after its MD5 hash to ensure uniqueness. If a file already exists on disk (i.e., has been previously downloaded), it is not rewritten.

Parameters
  • url (str) – Image URL.

  • session (aiohttp.ClientSession) – aiohttp session.

  • dst_dir (pathlib.Path) – Directory to which the image should be downloaded.

Return type

dict

Returns

A dict containing information on the download attempt and, if download was successful, image metadata:

{
    "status": {
        "get": HTTP response code,
        "exists": whether the image already exists on disk (bool),
        "valid": whether the file is a valid image file,
    },
    "image": {
        "url": image download URL,
        "filename": downloaded image filename,
        "md5_hash": file MD5 hash,
        "size": filesize on disk,
    },
}

proboards_scraper.get_chrome_driver()[source]

Returns an instance of a Selenium Chrome driver with the headless option set to True.

Return type

selenium.webdriver.Chrome

Returns

Headless Chrome driver.

proboards_scraper.get_login_cookies(home_url, username, password, driver=None, page_load_wait=1)[source]

Logs in to a Proboards account using Selenium and returns the cookies from the authenticated login session.

Parameters
  • home_url (str) – URL for the Proboards forum homepage.

  • username (str) – Login username.

  • password (str) – Login password.

  • driver (Optional[selenium.webdriver.Chrome]) – Selenium Chrome driver (optional).

  • page_load_wait (int) – Time (in seconds) to wait to allow the page to load.

Return type

List[dict]

Returns

A list of dicts, where each dict corresponds to a cookie, from the Selenium Chrome driver.

proboards_scraper.get_login_session(cookies)[source]

Get an authenticated aiohttp session using the cookies provided.

This is achieved by converting cookies from a Selenium driver session to http module Morsels (see http.cookies.Morsel), which can be added to the aiohttp session’s cookie jar.

Parameters

cookies (List[dict]) – A list of dicts as returned by get_login_cookies(), i.e., from a Selenium driver session.

Return type

aiohttp.ClientSession

Returns

An aiohttp session with the given cookies in its cookie jar.

async proboards_scraper.get_source(url, session)[source]

Get page source of a URL.

Parameters
  • url (str) – URL to visit.

  • session (aiohttp.ClientSession) – aiohttp session.

Return type

bs4.BeautifulSoup

Returns

Page source.

proboards_scraper.run_scraper(url, dst_dir='site', username=None, password=None, skip_users=False, no_delay=False)[source]

Main function that runs the scraper and calls the appropriate async functions/methods. This is the only function that needs to be called to actually run the scraper (with all the default settings).

Parameters
  • url (str) –

    URL of the the page to scrape.

    • If the URL is that of the forum homepage (e.g., https://yoursite.proboards.com/), the entire site (including users, shoutbox, category/board/thread/post content, etc.) will be scraped.

    • If it is the URL for the members page (e.g., https://yoursite.proboards.com/members), only the users will be scraped.

    • If it is the URL for a specific user profile (e.g., https://yoursite.proboards.com/user/10), only that particular user will be scraped.

    • If it is the URL for a board (e.g., https://yoursite.proboards.com/board/3/board-name), only that particular board and its threads/posts/sub-boards will be scraped.

    • If it is the URL for a thread (e.g., https://yoursite.proboards.com/thread/1234/thread-title) only that particular thread and its posts will be scraped.

  • dst_dir (pathlib.Path) – Directory in which to place the resulting files. The database file is written to <dst_dir>/forum.db and image files are saved to <dst_dir>/images.

  • username (Optional[str]) – Username for login.

  • password (Optional[str]) – Password for login.

  • skip_users (bool) – Skip scraping/adding users from the forum members page (only applies if the forum homepage is provided for url.

  • no_delay (bool) – Do not add a delay between subsequent requests (see ScraperManager for more information). Note that this may result in request throttling.

Return type

None