Using Real Browsers
The technique shown in the previous chapter involved using
requests to download HTML and
BeautifulSoup to parse it. You'll find that that technique frequently fails, for one of two reasons:
- The website you're trying to scrape might have some measures in place to block bots.
One method to (sometimes) get around these issues is to automate a real browser like Chrome or Firefox. This means that rather than just downloading the HTML content of a site, you actually open Chrome or Firefox on your computer and control it with a script. This tends to be a bit slower than downloading and parsing HTML, but gives you the ability to scrape otherwise-difficult sites, and allows you to do fun things like take screenshots and fill out forms.
Selenium is a library that allows you to control Chrome, Firefox and Safari from a variety of programming languages.
pip3 install selenium
You also need to install a "driver" for the browser you intend to automate. This is a kind of bridge application that allows selenium to communicate with a given browser.
For Chrome (on Mac):
brew cask install chromedriver
brew install geckodriver
Scraping with selenium is very similar to BeautifulSoup. You create a
webdriver object, visit a website, and then use css selectors to extract text and attributes from HTML elements.
Note that the method to query by selector is
find_elements_by_css_selector for multiple elements, and
find_element_by_css_selector for just one element. To extract attributes like
src use the
Here's a simple example script that gets product names from Alibaba, searching fro the phrase "labor camp":
# import the selenium from selenium import webdriver # open chrome driver = webdriver.Chrome() # visit alibaba driver.get("https://www.alibaba.com/products/labor_camp.html?IndexArea=product_en&page=1") # select h4 elements items = driver.find_elements_by_css_selector("h4") for i in items: print(i.text) driver.close()
Please note that you must explicitly tell selenium to close the browser when you are done!
Selenium can also operate in "headless" mode, which means that the browser will run without a graphic interface and never appear on your screen. I find that I prefer this mode once I've debugged my code and everything is working as intended. To use headless mode, you must instantiate your webdriver object with some additional parameters:
from selenium import webdriver from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_argument("--headless") driver = webdriver.Chrome(options=chrome_options)
Selenium allows you to change the window size of the browser and take screenshots:
# take a square screenshot driver.set_window_size(1000, 1000) driver.save_screenshot("screenshot.png")
You can also fill out forms, which can allow you to log in to websites.
# type "karl" into a "username" input user = driver.find_element_by_css_selector(".username") user.send_keys("karl") # type "capit@l" into a password field passw = driver.find_element_by_css_selector(".password") passw.send_keys("capit@l") # click on the submit button submit = driver.find_element_by_css_selector(".submit") submit.click()
- Alibaba Scraper: scrapes Alibaba for a search query, downloads product images, and saves product information into a json file called
- Fox News Lol: replaces headlines with "Lol" on foxnews.com and saves a screenshot
- Breitbart Comments: scrapes Breitbart for headlines and sorts them by total user comments
Since the rest of this guide uses python, I'll cover pyppeteer here (although you are most welcome to the original nodejs version if you prefer).
Install pyppeteer with pip:
pip3 install pyppeteer
Here's a basic example that scrapes product names from Alibaba matching the search term "labor camp":
import asyncio from pyppeteer import launch async def main(): # create a new browser object and open a blank page browser = await launch() page = await browser.newPage() # visit a url url = 'https://www.alibaba.com/products/labor_camp.html?IndexArea=product_en&page=1' await page.goto(url) # querySelectorAll() selects elements matching a css query items = await page.querySelectorAll(".organic-gallery-offer-outter") # loop over elements for product in items: # find h4 tags inside item listings name_element = await product.querySelector("h4") # extract the text content name = await page.evaluate('(element) => element.textContent', name_element) print(name) # close the browser await browser.close() # run the main function asyncio.get_event_loop().run_until_complete(main())
Note that to actually extract text or attributes from elements you must use the
querySelectorAll. On line 23,
textContent attribute (the text) of the passed element, in this case the
requests_html is a convenient library that combines
pyppeteer. It provides less control than just using pyppeteer directly, but is extremely convenient for certain use cases.
(Examples to come...)