Web scraping with a headless browser in Python

Python
Web scraping
Javascript
Author

Catherine Moez

Published

June 28, 2023

The Headless Browser

While R works fairly well for scraping basic html (packages rvest, httr, etc.), you may run into issues when trying to scrape material from websites with Javascript.

One approach is to create a ‘headless browser’ that navigates through sites as if it is a real human operating a browser.

I am working with Selenium on a Windows computer – some steps may be different on Mac.


First installation:

The first time you run this, you will need to download some packages for Python.

  1. Make sure “pip” is installed (via computer’s command shell).

  2. Download ‘selenium’ module, via command shell. https://selenium-python.readthedocs.io/installation.html.

e.g., on Windows I would open the command shell and run:

cd C:\Users\katie\AppData\Local\Programs\Python\Python39\Scripts

To set the directory, and

pip install selenium

To install Selenium.

  1. Go back to computer’s command shell and install PhantomJS: https://phantomjs.org/download.html. (Web download and then run the installation).

This creates test .js script ‘hellophantom.js’ in same folder. Close and re-open command shell, then reset cd (set directory), then run ‘phantomjs hellophantom.js’ to run it.

phantomjs hellophantom.js


Running the browser in Python:

Then switch to a Python notebook.

For the basics of opening the browser window:

from selenium import webdriver 
import chromedriver_binary 
driver = webdriver.Chrome() #open the headless browser window. Important-Leave it open. 
import pandas as pd #optional--for storing any scraped results to dataframes
from selenium.webdriver.common.by import By #for button selection

Navigate the ‘driver’ to a page:

myurl = "https://www.cbc.ca/search" 
driver.get(myurl)

Use the browser to find text fields and input text:

# input text to text field
field1 = driver.find_element(By.ID, "gn-search") # 'field1' can be any name of your choice 
field1.send_keys("we charity") # input text with send_keys()

Use the browser to find and click on buttons:

#navigate to a new page: 
myurl = "https://www.proquest.com/canadiannews/results/43229EBC6BBF4EB0PQ/1?accountid=14771" 
driver.get(myurl)
#Click the 'accept all' button, based on its ID in the HTML:
acceptbutton = driver.find_element(By.ID, "onetrust-accept-btn-handler")
acceptbutton.click()
# navigate to a different page based on link text
# (Make sure the browser window is large enough to fully show the link text)
button2 = driver.find_element(By.LINK_TEXT, "Go to start page") 
button2.click()

This is essentially it. You can search around to find how to select other items, like check boxes/radio buttons. (See below for how to get the full list of options).

To get a page element’s ID, or Xpath, or HTML tag or attribute, you can use the Inspector Gadget extension for Chrome, or right-click and ‘View page source’ to see the full HTML.

Capitalization must match exactly when selecting a web element (an element on the page). It seems that Python or Selenium periodically update function names without documenting the changes, so try substitution “.” for “_” or try different capitalization options if older code solutions are not working.


To see all options for navigating the driver:

print(dir(driver))

Get() (navigate to a page) and send_keys() (input text to an element) are my most frequently used options.

To see all the button/element selection options, check:

print(dir(By))

By “ID”, “__class__”, “XPATH”, “LINK_TEXT”, and “PARTIAL_LINK_TEXT” are my most frequently used options here.


Scraping a list of links

You will probably need to create a custom loop, or several (nested or sequential), to navigate through longer lists of individual page links, such as those found on several pages of search results.

The basic code once you are on a page is:

all_links_on_pg = driver.find_elements(By.TAG_NAME, "a") # grab all links (html tag 'a')
link_hrefs = [x.get_attribute("href") for x in all_links_on_pg] #urls
link_hrefs2 = ['False' if v is None else v for v in link_hrefs] #remove None's and unlist

Or, looped (You will first need a list of search results pages’ URLs, not shown):

#searchpg_urls is a list of strings with search page URLs. Not shown. e.g.,
searchpg_urls = ["https://nationalpost.com/news/","https://www.theglobeandmail.com/"]
linklist123 = [] # blank list to save results from loop

for x in searchpg_urls:
    driver.get(x)
    #all_links_on_pg = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "a"))) 
    # Optional waiting by seconds or until element is fully loaded
    all_links_on_pg = driver.find_elements(By.TAG_NAME, "a")
    link_hrefs = [x.get_attribute("href") for x in all_links_on_pg] #urls
    link_hrefs2 = ['False' if v is None else v for v in link_hrefs] #rm None's
    linklist123.append(link_hrefs2) # append to list
# You can then do any subsetting on your full list of links, using grepl-type commands

# custom function
import numpy as np
fn2 = np.vectorize(int) #coerce each element to integer

# the original list:
link_hrefs = [item for sublist in linklist123 for item in sublist] #unlist linklist123

# Keep only URLs containing "fulltext"
link_grepls = ["fulltext" in x for x in link_hrefs]
link_nos = np.where(np.array(link_grepls) == True) #which ones = T
link_ints = [fn2(x) for x in link_nos] #np.vectorize each int
link_ints_auto = link_ints[0].tolist()
doclinks = [link_hrefs[i] for i in link_ints_auto]

# require 'docview' in the link url
link_grepls = ["docview" in x for x in doclinks]
link_nos = np.where(np.array(link_grepls) == True) #which = T #array
link_ints_auto = link_nos[0].tolist() #to list of integers
doclinks2 = [doclinks[i] for i in link_ints_auto]

# remove links containing 'fulltextPDF'
link_grepls = ["fulltextPDF" in x for x in doclinks2]
link_nos = np.where(np.array(link_grepls) == False) #now keep FALSE
link_ints_auto = link_nos[0].tolist() #to list of integers
doclinks1 = [doclinks2[i] for i in link_ints_auto] #unlist

Optionally, you can save your URLs to file now:

linkdf = pd.DataFrame(doclinks1) #1 col
linkdf.to_csv("currentlinks.csv", encoding='utf-8')

If you already have the URLs in a file, you can re-open it with:

import csv

with open('currentlinks.csv', newline='') as f:
    reader = csv.reader(f)
    doclinks1 = [tuple(row) for row in reader] #tuples as list would be nested
  
doclist = [x[1] for x in doclinks1] #2nd element of each tuple to list
doclist = doclist[1:] #cut out header
doclinks1 = doclist


Scraping text off of each webpage:

# "doclinks1" is a list of urls to be scraped (list of string entries)

# Loop through doc links and save the full text (tag 'body' in this instance)
tuple1 = [("url1", "text1")] #create a blank row
df = pd.DataFrame(tuple1) #create a blank df to append results to

for x in doclinks1: #list of urls
    driver.get(x) # navigate to the url in the browser
    el = driver.find_element(By.TAG_NAME, 'body') # select all text
    pagetext = el.text #save as a python object
    doclink = str(x) # to string format
    doc_tuple = [(doclink, pagetext)] # this document's url and text
    df = df.append(doc_tuple) # append to dataframe

To save these results to file:

df.to_csv("articles1.csv", encoding = 'utf-8')

That’s it!



Some more complications:

To input a username and password:

You can type them as a string:

uoft_user = "username123" 
uoft_pw = "password1234"

And input when relevant:

user1 = driver.find_element(By.ID, "username") 
pwfield = driver.find_element(By.ID, "password") 
login1 = driver.find_element(By.CLASS_NAME, "btn") 
user1.send_keys(uoft_user) #defined at start # fill in the text box
pwfield.send_keys(uoft_pw) # fill in the text box
login1.click() # click on the button

Or, for more security, save the username and password to your computer’s environment variables, and then load them in Python with:

import os from dotenv 
import load_dotenv 
load_dotenv() 
uoft_user = os.environ.get('uoft_usernm') #its name in computer's environment variables 
uoft_pw = os.environ.get('uoft_pw')

You can then input the strings in the exact same way:

user1 = driver.find_element(By.ID, "username") 
pwfield = driver.find_element(By.ID, "password") 
login1 = driver.find_element(By.CLASS_NAME, "btn") 
user1.send_keys(uoft_user) #defined at start # fill in the text box
pwfield.send_keys(uoft_pw) # fill in the text box
login1.click() # click on the button


To make the web driver wait for an element to fully load, or for a time-based wait:

from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

and, e.g.:

#Navigate to search page here
driver.get("https://www.proquest.com/canadiannews/advanced")
elem = WebDriverWait(driver, 30).until( EC.presence_of_element_located((By.ID, "SourceType_Newspapers")) 
  # wait 30 seconds or until element is loaded


To select a checkbox/radio button:

fulltextbx = driver.find_element(By.ID, "fullTextLimit") 
fulltextbx.click()

To select a checkbox, if a condition is met (if box is not already checked):

fulltextbx = driver.find_element(By.ID, "fullTextLimit") 
if(fulltextbx.is_selected() == False): 
  fulltextbx.click()

Hope this helps!