Python API¶
Subpackages¶
Module contents¶
- class proboards_scraper.ScraperManager(db, client_session, content_queue=None, driver=None, image_dir=None, user_queue=None, request_threshold=15, short_delay_time=1.5, long_delay_time=20.0)[source]¶
- Bases: - object- This class has three purposes: 1) to store references to objects that will be used in the process of scraping, 2) to serve as an abstraction layer between the scraper functionality and the database, and 3) to handle HTTP requests (adding delays between requests as needed to avoid throttling) and process the queues (popping items from the queues in the necessary order and inserting them into the database). - Parameters
- db ( - proboards_scraper.database.Database) – Database handle.
- client_session ( - aiohttp.ClientSession) –- aiohttpsession.
- content_queue ( - Optional[asyncio.Queue]) – Queue to which all content (excluding users) should be added for insertion into the database.
- driver ( - Optional[selenium.webdriver.Chrome]) – Selenium Chrome driver.
- image_dir ( - Optional[pathlib.Path]) – Directory to which downloaded images should be saved.
- user_queue ( - Optional[asyncio.Queue]) – Queue to which users should be added for insertion into the database.
- request_threshold ( - int) – After every- request_thresholdcalls to- ScraperManager.get_source(), wait- long_delay_timeseconds before continuing. This is to prevent request throttling due to a large number of consecutive requests.
- short_delay_time ( - float) – Number of seconds to wait after each call to- ScraperManager.get_source()(to help prevent request throttling).
- long_delay_time ( - float) – See- request_threshold.
 
 - async download_image(url)[source]¶
- Download an image to - image_dir.- Parameters
- url ( - str) – URL of the image to be downloaded.
- Return type
- dict
- Returns
- Image download status and metadata; see - proboards_scraper.download_image().
 
 - async get_source(url)[source]¶
- Wrapper around - proboards_scraper.get_source()with an added short delay via call to- time.sleep()before each request, and a longer delay after every- self.request_thresholdcalls to- ScraperManager.get_source(). This rate-limiting is performed to help avoid request throttling by the server, which may result from a large number of requests in a short period of time.- Parameters
- url ( - str) – URL whose page source to retrieve.
- Return type
- bs4.BeautifulSoup
- Returns
- BeautifulSoup page source object. 
 
 - insert_guest(name)[source]¶
- Insert a guest user into the database. - Parameters
- name ( - str) – The guest’s username.
- Return type
- int
- Returns
- The user ID of the guest returned by - proboards_scraper.database.Database.insert_guest().
 
 - insert_image(image)[source]¶
- Insert an image entry into the database. - Parameters
- image ( - dict) – A dict representing the image entry.
- Return type
- int
- Returns
- The image ID of the image returned by - proboards_scraper.database.Database.insert_image().
 
 - async run()[source]¶
- Run the scraper, first processing the user queue and then processing the content queue, calling the appropriate database insert/query methods as needed, and closing the Selenium and aiohttp sessions upon completion. - Because all content (threads, posts, etc.) is associated with users, the content queue is not processed until all users have been added from the user queue (the end of which is marked by a sentinel value). Guest users are an exception, since they are not present in the site’s member list; instead, guests are added/queried as they are encountered by calling - ScraperManager.insert_guest().- Return type
- None
 
 
- async proboards_scraper.download_image(url, session, dst_dir)[source]¶
- Attempt to download the image at - urlto the directory specified by- dst_dir. The downloaded file is named after its MD5 hash to ensure uniqueness. If a file already exists on disk (i.e., has been previously downloaded), it is not rewritten.- Parameters
- url ( - str) – Image URL.
- session ( - aiohttp.ClientSession) –- aiohttpsession.
- dst_dir ( - pathlib.Path) – Directory to which the image should be downloaded.
 
- Return type
- dict
- Returns
- A dict containing information on the download attempt and, if download was successful, image metadata: - { "status": { "get": HTTP response code, "exists": whether the image already exists on disk (bool), "valid": whether the file is a valid image file, }, "image": { "url": image download URL, "filename": downloaded image filename, "md5_hash": file MD5 hash, "size": filesize on disk, }, } 
 
- proboards_scraper.get_chrome_driver()[source]¶
- Returns an instance of a Selenium Chrome driver with the headless option set to - True.- Return type
- selenium.webdriver.Chrome
- Returns
- Headless Chrome driver. 
 
- proboards_scraper.get_login_cookies(home_url, username, password, driver=None, page_load_wait=1)[source]¶
- Logs in to a Proboards account using Selenium and returns the cookies from the authenticated login session. - Parameters
- home_url ( - str) – URL for the Proboards forum homepage.
- username ( - str) – Login username.
- password ( - str) – Login password.
- driver ( - Optional[selenium.webdriver.Chrome]) – Selenium Chrome driver (optional).
- page_load_wait ( - int) – Time (in seconds) to wait to allow the page to load.
 
- Return type
- List[dict]
- Returns
- A list of dicts, where each dict corresponds to a cookie, from the Selenium Chrome driver. 
 
- proboards_scraper.get_login_session(cookies)[source]¶
- Get an authenticated - aiohttpsession using the cookies provided.- This is achieved by converting cookies from a Selenium driver session to - httpmodule Morsels (see http.cookies.Morsel), which can be added to the- aiohttpsession’s cookie jar.- Parameters
- cookies ( - List[dict]) – A list of dicts as returned by- get_login_cookies(), i.e., from a Selenium driver session.
- Return type
- aiohttp.ClientSession
- Returns
- An - aiohttpsession with the given cookies in its cookie jar.
 
- async proboards_scraper.get_source(url, session)[source]¶
- Get page source of a URL. - Parameters
- url ( - str) – URL to visit.
- session ( - aiohttp.ClientSession) –- aiohttpsession.
 
- Return type
- bs4.BeautifulSoup
- Returns
- Page source. 
 
- proboards_scraper.run_scraper(url, dst_dir='site', username=None, password=None, skip_users=False, no_delay=False)[source]¶
- Main function that runs the scraper and calls the appropriate async functions/methods. This is the only function that needs to be called to actually run the scraper (with all the default settings). - Parameters
- url ( - str) –- URL of the the page to scrape. - If the URL is that of the forum homepage (e.g., https://yoursite.proboards.com/), the entire site (including users, shoutbox, category/board/thread/post content, etc.) will be scraped. 
- If it is the URL for the members page (e.g., https://yoursite.proboards.com/members), only the users will be scraped. 
- If it is the URL for a specific user profile (e.g., https://yoursite.proboards.com/user/10), only that particular user will be scraped. 
- If it is the URL for a board (e.g., https://yoursite.proboards.com/board/3/board-name), only that particular board and its threads/posts/sub-boards will be scraped. 
- If it is the URL for a thread (e.g., https://yoursite.proboards.com/thread/1234/thread-title) only that particular thread and its posts will be scraped. 
 
- dst_dir ( - pathlib.Path) – Directory in which to place the resulting files. The database file is written to- <dst_dir>/forum.dband image files are saved to- <dst_dir>/images.
- username ( - Optional[str]) – Username for login.
- password ( - Optional[str]) – Password for login.
- skip_users ( - bool) – Skip scraping/adding users from the forum members page (only applies if the forum homepage is provided for- url.
- no_delay ( - bool) – Do not add a delay between subsequent requests (see- ScraperManagerfor more information). Note that this may result in request throttling.
 
- Return type
- None