from selenium import webdriver
import chromedriver_binary
= webdriver.Chrome() #open the headless browser window. Important-Leave it open.
driver import pandas as pd #optional--for storing any scraped results to dataframes
from selenium.webdriver.common.by import By #for button selection
While R works fairly well for scraping basic html (packages rvest, httr, etc.), you may run into issues when trying to scrape material from websites with Javascript.
One approach is to create a ‘headless browser’ that navigates through sites as if it is a real human operating a browser.
I am working with Selenium on a Windows computer – some steps may be different on Mac.
First installation:
The first time you run this, you will need to download some packages for Python.
Make sure “pip” is installed (via computer’s command shell).
Download ‘selenium’ module, via command shell. https://selenium-python.readthedocs.io/installation.html.
e.g., on Windows I would open the command shell and run:
cd C:\Users\katie\AppData\Local\Programs\Python\Python39\Scripts
To set the directory, and
pip install selenium
To install Selenium.
- Go back to computer’s command shell and install PhantomJS: https://phantomjs.org/download.html. (Web download and then run the installation).
This creates test .js script ‘hellophantom.js’ in same folder. Close and re-open command shell, then reset cd (set directory), then run ‘phantomjs hellophantom.js’ to run it.
phantomjs hellophantom.js
Running the browser in Python:
Then switch to a Python notebook.
For the basics of opening the browser window:
Navigate the ‘driver’ to a page:
= "https://www.cbc.ca/search"
myurl driver.get(myurl)
Use the browser to find text fields and input text:
# input text to text field
= driver.find_element(By.ID, "gn-search") # 'field1' can be any name of your choice
field1 "we charity") # input text with send_keys() field1.send_keys(
Use the browser to find and click on buttons:
#navigate to a new page:
= "https://www.proquest.com/canadiannews/results/43229EBC6BBF4EB0PQ/1?accountid=14771"
myurl driver.get(myurl)
#Click the 'accept all' button, based on its ID in the HTML:
= driver.find_element(By.ID, "onetrust-accept-btn-handler")
acceptbutton acceptbutton.click()
# navigate to a different page based on link text
# (Make sure the browser window is large enough to fully show the link text)
= driver.find_element(By.LINK_TEXT, "Go to start page")
button2 button2.click()
This is essentially it. You can search around to find how to select other items, like check boxes/radio buttons. (See below for how to get the full list of options).
To get a page element’s ID, or Xpath, or HTML tag or attribute, you can use the Inspector Gadget extension for Chrome, or right-click and ‘View page source’ to see the full HTML.
Capitalization must match exactly when selecting a web element (an element on the page). It seems that Python or Selenium periodically update function names without documenting the changes, so try substitution “.” for “_” or try different capitalization options if older code solutions are not working.
To see all options for navigating the driver:
print(dir(driver))
Get() (navigate to a page) and send_keys() (input text to an element) are my most frequently used options.
To see all the button/element selection options, check:
print(dir(By))
By “ID”, “__class__”, “XPATH”, “LINK_TEXT”, and “PARTIAL_LINK_TEXT” are my most frequently used options here.
Scraping a list of links
You will probably need to create a custom loop, or several (nested or sequential), to navigate through longer lists of individual page links, such as those found on several pages of search results.
The basic code once you are on a page is:
= driver.find_elements(By.TAG_NAME, "a") # grab all links (html tag 'a')
all_links_on_pg = [x.get_attribute("href") for x in all_links_on_pg] #urls
link_hrefs = ['False' if v is None else v for v in link_hrefs] #remove None's and unlist link_hrefs2
Or, looped (You will first need a list of search results pages’ URLs, not shown):
#searchpg_urls is a list of strings with search page URLs. Not shown. e.g.,
= ["https://nationalpost.com/news/","https://www.theglobeandmail.com/"]
searchpg_urls = [] # blank list to save results from loop
linklist123
for x in searchpg_urls:
driver.get(x)#all_links_on_pg = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "a")))
# Optional waiting by seconds or until element is fully loaded
= driver.find_elements(By.TAG_NAME, "a")
all_links_on_pg = [x.get_attribute("href") for x in all_links_on_pg] #urls
link_hrefs = ['False' if v is None else v for v in link_hrefs] #rm None's
link_hrefs2 # append to list linklist123.append(link_hrefs2)
# You can then do any subsetting on your full list of links, using grepl-type commands
# custom function
import numpy as np
= np.vectorize(int) #coerce each element to integer
fn2
# the original list:
= [item for sublist in linklist123 for item in sublist] #unlist linklist123
link_hrefs
# Keep only URLs containing "fulltext"
= ["fulltext" in x for x in link_hrefs]
link_grepls = np.where(np.array(link_grepls) == True) #which ones = T
link_nos = [fn2(x) for x in link_nos] #np.vectorize each int
link_ints = link_ints[0].tolist()
link_ints_auto = [link_hrefs[i] for i in link_ints_auto]
doclinks
# require 'docview' in the link url
= ["docview" in x for x in doclinks]
link_grepls = np.where(np.array(link_grepls) == True) #which = T #array
link_nos = link_nos[0].tolist() #to list of integers
link_ints_auto = [doclinks[i] for i in link_ints_auto]
doclinks2
# remove links containing 'fulltextPDF'
= ["fulltextPDF" in x for x in doclinks2]
link_grepls = np.where(np.array(link_grepls) == False) #now keep FALSE
link_nos = link_nos[0].tolist() #to list of integers
link_ints_auto = [doclinks2[i] for i in link_ints_auto] #unlist doclinks1
Optionally, you can save your URLs to file now:
= pd.DataFrame(doclinks1) #1 col
linkdf "currentlinks.csv", encoding='utf-8') linkdf.to_csv(
If you already have the URLs in a file, you can re-open it with:
import csv
with open('currentlinks.csv', newline='') as f:
= csv.reader(f)
reader = [tuple(row) for row in reader] #tuples as list would be nested
doclinks1
= [x[1] for x in doclinks1] #2nd element of each tuple to list
doclist = doclist[1:] #cut out header
doclist = doclist doclinks1
Scraping text off of each webpage:
# "doclinks1" is a list of urls to be scraped (list of string entries)
# Loop through doc links and save the full text (tag 'body' in this instance)
= [("url1", "text1")] #create a blank row
tuple1 = pd.DataFrame(tuple1) #create a blank df to append results to
df
for x in doclinks1: #list of urls
# navigate to the url in the browser
driver.get(x) = driver.find_element(By.TAG_NAME, 'body') # select all text
el = el.text #save as a python object
pagetext = str(x) # to string format
doclink = [(doclink, pagetext)] # this document's url and text
doc_tuple = df.append(doc_tuple) # append to dataframe df
To save these results to file:
"articles1.csv", encoding = 'utf-8') df.to_csv(
That’s it!
Some more complications:
To input a username and password:
You can type them as a string:
= "username123"
uoft_user = "password1234" uoft_pw
And input when relevant:
= driver.find_element(By.ID, "username")
user1 = driver.find_element(By.ID, "password")
pwfield = driver.find_element(By.CLASS_NAME, "btn")
login1 #defined at start # fill in the text box
user1.send_keys(uoft_user) # fill in the text box
pwfield.send_keys(uoft_pw) # click on the button login1.click()
Or, for more security, save the username and password to your computer’s environment variables, and then load them in Python with:
import os from dotenv
import load_dotenv
load_dotenv() = os.environ.get('uoft_usernm') #its name in computer's environment variables
uoft_user = os.environ.get('uoft_pw') uoft_pw
You can then input the strings in the exact same way:
= driver.find_element(By.ID, "username")
user1 = driver.find_element(By.ID, "password")
pwfield = driver.find_element(By.CLASS_NAME, "btn")
login1 #defined at start # fill in the text box
user1.send_keys(uoft_user) # fill in the text box
pwfield.send_keys(uoft_pw) # click on the button login1.click()
To make the web driver wait for an element to fully load, or for a time-based wait:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
and, e.g.:
#Navigate to search page here
"https://www.proquest.com/canadiannews/advanced") driver.get(
= WebDriverWait(driver, 30).until( EC.presence_of_element_located((By.ID, "SourceType_Newspapers"))
elem # wait 30 seconds or until element is loaded
To select a checkbox/radio button:
= driver.find_element(By.ID, "fullTextLimit")
fulltextbx fulltextbx.click()
To select a checkbox, if a condition is met (if box is not already checked):
= driver.find_element(By.ID, "fullTextLimit")
fulltextbx if(fulltextbx.is_selected() == False):
fulltextbx.click()
Hope this helps!