Using Real Browsers
The technique shown in the previous chapter involved using requests
to download HTML and BeautifulSoup
to parse it. You'll find that that technique frequently fails, for one of two reasons:
- Many websites load content using JavaScript after the initial HTML loads.
requests
can only load that initial HTML and isn't able to execute JavaScript. - The website you're trying to scrape might have some measures in place to block bots.
One method to (sometimes) get around these issues is to automate a real browser like Chrome or Firefox. This means that rather than just downloading the HTML content of a site, you actually open Chrome or Firefox on your computer and control it with a script. This tends to be a bit slower than downloading and parsing HTML, but gives you the ability to scrape otherwise-difficult sites, and allows you to do fun things like take screenshots and fill out forms.
The guide covers two browser automation libraries: selenium and puppeteer.
Selenium
Selenium is a library that allows you to control Chrome, Firefox and Safari from a variety of programming languages.
To install:
pip3 install selenium
You also need to install a "driver" for the browser you intend to automate. This is a kind of bridge application that allows selenium to communicate with a given browser.
For Chrome (on Mac):
brew cask install chromedriver
For Firefox:
brew install geckodriver
Scraping with selenium is very similar to BeautifulSoup. You create a webdriver
object, visit a website, and then use css selectors to extract text and attributes from HTML elements.
Note that the method to query by selector is find_elements_by_css_selector
for multiple elements, and find_element_by_css_selector
for just one element. To extract attributes like href
or src
use the get_attribute
method.
Here's a simple example script that gets product names from Alibaba, searching fro the phrase "labor camp":
# import the selenium
from selenium import webdriver
# open chrome
driver = webdriver.Chrome()
# visit alibaba
driver.get("https://www.alibaba.com/products/labor_camp.html?IndexArea=product_en&page=1")
# select h4 elements
items = driver.find_elements_by_css_selector("h4")
for i in items:
print(i.text)
driver.close()
Please note that you must explicitly tell selenium to close the browser when you are done!
Headless Mode
Selenium can also operate in "headless" mode, which means that the browser will run without a graphic interface and never appear on your screen. I find that I prefer this mode once I've debugged my code and everything is working as intended. To use headless mode, you must instantiate your webdriver object with some additional parameters:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
Screenshots
Selenium allows you to change the window size of the browser and take screenshots:
# take a square screenshot
driver.set_window_size(1000, 1000)
driver.save_screenshot("screenshot.png")
Forms
You can also fill out forms, which can allow you to log in to websites.
# type "karl" into a "username" input
user = driver.find_element_by_css_selector(".username")
user.send_keys("karl")
# type "capit@l" into a password field
passw = driver.find_element_by_css_selector(".password")
passw.send_keys("capit@l")
# click on the submit button
submit = driver.find_element_by_css_selector(".submit")
submit.click()
JavaScript
Finally, selenium can execute JavaScript code.
# javascript to replace all h2s with "lol"
script = """
let headlines = document.querySelectorAll("h1, h2, h3, h4, h5");
for (let h of headlines) {
h.textContent = "Lol"
}
"""
# execute the script
driver.execute_script(script)
# save a screenshot
driver.save_screenshot("lol.png")
Selenium Examples
- Alibaba Scraper: scrapes Alibaba for a search query, downloads product images, and saves product information into a json file called
product.json
- Fox News Lol: replaces headlines with "Lol" on foxnews.com and saves a screenshot
- Breitbart Comments: scrapes Breitbart for headlines and sorts them by total user comments
Puppeteer
Puppeteer is a Node.js library written by Google for automating Chrome. There is also an (unofficial) python version of it called pyppeteer.
Since the rest of this guide uses python, I'll cover pyppeteer here (although you are most welcome to the original nodejs version if you prefer).
Scraping with pyppeteer involves a slightly different workflow than selenium. Notably, it makes use of asynchronous function calls to load pages, and requires that you execute javascript functions to extract content.
Install pyppeteer with pip:
pip3 install pyppeteer
Here's a basic example that scrapes product names from Alibaba matching the search term "labor camp":
import asyncio
from pyppeteer import launch
async def main():
# create a new browser object and open a blank page
browser = await launch()
page = await browser.newPage()
# visit a url
url = 'https://www.alibaba.com/products/labor_camp.html?IndexArea=product_en&page=1'
await page.goto(url)
# querySelectorAll() selects elements matching a css query
items = await page.querySelectorAll(".organic-gallery-offer-outter")
# loop over elements
for product in items:
# find h4 tags inside item listings
name_element = await product.querySelector("h4")
# extract the text content
name = await page.evaluate('(element) => element.textContent', name_element)
print(name)
# close the browser
await browser.close()
# run the main function
asyncio.get_event_loop().run_until_complete(main())
Note that to actually extract text or attributes from elements you must use the page.evaluate
function to execute javascript inside the browser, passing in elements that you have selected using querySelector
or querySelectorAll
. On line 23, '(element) => element.textContent'
is a JavaScript function that returns the textContent
attribute (the text) of the passed element, in this case the name_element
.
requests_html
requests_html is a convenient library that combines requests
, pyquery
and pyppeteer
. It provides less control than just using pyppeteer directly, but is extremely convenient for certain use cases.
(Examples to come...)