Why?
Web scrapping has made my life SO MUCH EASIER. Yet, the process for actually extracting content from websites which lock their content down using proprietary systems is never really mentioned. This makes it extremely difficult if not impossible to reformat information into a desirable format. Over a few years, I've found several (nearly) failproof techniques to help me out, and now I'd like to pass them on.
I'm going to walk you through the process of converting a web-only book to a pdf. Feel free to replicate/modify this for use in other circumstances! If you have any other tricks (or even useful scripts) for tasks like these, make sure to let me know, as creating these life-hack scripts is an interesting hobby!
Reproducibility/Applicability?
The example I'm outlining is from a website which provides study guides for a monthly fee (to protect their security I'm excluding specific URL's). Despite the potential lack of reproducibility, this guide should stay quite useful, as I'm outlining several flaws/hickups that come up for any similar project along the way!
Mistakes to Make?
I've made several mistakes when trying to web scrape for limited access information. Each mistake consumed large amounts of time and energy, so here they are:
- Using AutoHotKey or similar to directly affect the mouse/keyboard
- This seems effective, however, it is extremely dodgy and isn't reproducible
- Load all pages and then export a HAR file
- HAR files don't contain any actual data
- HAR files take ages to load in any text editor
- Attempt to use GET/HEAD requests
- Majority of pages will randomly assign tokens and other authorization approaches which are incredibly hard to reverse engineer
- Code can never be reproduced/a different approach will be needed
Slow Progress
It seems easy/quick to write a 300 line short script for web scrapping these websites, but they are always more difficult than that. Here are potential hurdles with solutions:
- Browser profile used by Selenium changing
- Programmatically find the profile
- Not knowing how long to wait for a link to load
- Detect when the link isn't equal to the current one
- Or use browser JavaScript (where possible, described more below)
- Needing to find information about the current web page's content
- Look at potential JavaScript functions and URL's
- Restarting a long script when it fails
- Reduce the number of lookups for files
- Copy files to predictable locations
- Before beginning doing anything complex check these files
- Not knowing what a long script is up to
- Print any necessary output (only for that which takes considerable time and doesn't have another metric)
Code
Preperation
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from PIL import Image
from natsort import natsorted
import time
import re
import os
import shutil
import img2pdf
import hashlib
driver = webdriver.Firefox()
cacheLocation = driver.capabilities['moz:profile'] + '/cache2/entries/'
originalPath = os.getcwd()
baseURL = 'https://edunlimited.com'
Loading Book
driver.get(loginURL)
driver.get(bookURL)
wait.until(lambda driver: driver.current_url != loginURL)
Get Metadata
Quite often it is possible to find JavaScript functions which are used to provide useful information. There are a few ways you may go about doing this:
- View the page's HTML source (right-click 'View Page Source')
- Use the web console
bookTitle = driver.execute_script('return app.book')
bookPages = driver.execute_script('return app.pageTotal')
bookID = driver.execute_script('return app.book_id')
Organize Files
Scripts often don't perform as expected, and can sometimes take long periods of time to complete. Therefore it's quite liberating to preserve progress throughout the script's iterations. One good method to achieve this is keeping organized!
if not os.path.exists(bookTitle):
os.mkdir(bookTitle)
if len(os.listdir(bookTitle)) == 0:
start = 0
else:
start = int(natsorted(os.listdir(bookTitle), reverse=True)[0].replace('.jpg', ''))
driver.execute_script('app.gotoPage(' + str(start) + ')')
os.chdir(bookTitle)
Loop Through the Book
Images are always stored in the cache, so when all else fails, just use this to your advantage!
This isn't easy though, first of we need to load the page and then we need to somehow recover it!
To make sure we always load the entire page, there are two safety measures in place:
- Waiting for the current page to load before moving to the next
- Reloading the page if it fails to load
Getting these two to work requires functions which guarantee completion (JavaScript or browser responses), and fail-safe waiting timespans. Safe time spans are trial and error, but they usually seem to work best between 0.5 to 5 seconds.
Recovering specific data directly from the hard drive's cache is a relatively obscure topic. The key is to first locate a download link (normally easy as it doesn't have to work). Then run SHA1, Hex Digest and a capitalizing function on the URL, which produces the final filename (it isn't just one of the above security algorithms, as older sources lead you to believe, but both).
On a final note, make sure to clean your data (removing the alpha channel from PNG images here) now instead of afterwards, as it reduces the number of loops used in the code!
for currentPage in range(start, bookPages - 1):
# Books take variable amounts of loading time
while driver.execute_script('return app.loading') == True:
time.sleep(0.5)
# The service is sometimes unpredictable
# So pages may fail to load
while (driver.execute_script('return app.pageImg') == '/pagetemp.jpg'):
driver.execute_script('app.loadPage()')
time.sleep(4)
location = driver.execute_script('return app.pageImg')
# Cache is temporary
pageURL = baseURL + location
fileName = hashlib.sha1((":" + pageURL).encode('utf-8')).hexdigest().upper()
Image.open(cacheLocation + fileName).convert('RGB').save(str(currentPage) + '.jpg')
driver.execute_script('app.nextPage()')
Convert to PDF
We can finally get that one convenient PDF file
finalPath = originalPath + '/' + bookTitle + '.pdf'
# Combine into a single PDF
with open(finalPath, 'wb') as f:
f.write(img2pdf.convert([i for i in natsorted(os.listdir('.')) if i.endswith(".jpg")]))
Remove Excess Images
os.chdir(originalPath)
shutil.rmtree(bookTitle)
Cover image sourced from here
Thanks for READING!
This is basically the first code-centric post I've made on my blog, so I hope it has been useful!
--- Until next time, I'm signing out