Install the prerequisites
sudo apt update
sudo apt install python3 python3-pip python3-venv firefox
On Ubuntu 22.04+ the firefox apt package installs the Snap version, which works with Selenium but has some sandbox quirks. If you hit "profile error" messages, install the Mozilla PPA variant or use the Flatpak (see my Firefox via Flatpak note).
Set up a virtualenv
Keep the project isolated from system Python:
mkdir -p ~/selenium-scraper && cd ~/selenium-scraper
python3 -m venv .venv
source .venv/bin/activate
pip install selenium pandas
Install geckodriver
Modern Selenium (4.6+) can auto-manage the driver if selenium-manager detects a compatible Firefox — you can skip this step and it'll "just work." If it doesn't (common on unusual distros or locked-down CI), install manually:
GECKO_VER=$(curl -s https://api.github.com/repos/mozilla/geckodriver/releases/latest | grep -Po '"tag_name":\s*"\K[^"]*')
curl -L -o /tmp/gecko.tar.gz "https://github.com/mozilla/geckodriver/releases/download/${GECKO_VER}/geckodriver-${GECKO_VER}-linux64.tar.gz"
mkdir -p ./drivers
tar -xzf /tmp/gecko.tar.gz -C ./drivers/
./drivers/geckodriver --version
A working example
The original version of this tutorial used Selenium 3's find_element_by_id(...) API. That API was removed in Selenium 4. Here's the equivalent modern script:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time
# Options
opts = Options()
opts.add_argument("--headless") # no GUI; remove while debugging
opts.set_preference("browser.download.folderList", 2)
opts.set_preference("browser.download.dir", "/home/amir/downloads")
opts.set_preference("browser.helperApps.neverAsk.saveToDisk",
"application/octet-stream,application/pdf,application/zip")
# If you installed geckodriver manually:
service = Service(executable_path="./drivers/geckodriver")
driver = webdriver.Firefox(service=service, options=opts)
# Or let selenium-manager find it:
# driver = webdriver.Firefox(options=opts)
wait = WebDriverWait(driver, 15)
try:
# Log in
driver.get("https://example.com/login")
wait.until(EC.presence_of_element_located((By.ID, "username")))
driver.find_element(By.ID, "username").send_keys("your-username")
driver.find_element(By.ID, "password").send_keys("your-password")
driver.find_element(By.ID, "rememberme").click()
driver.find_element(By.NAME, "login").click()
# Wait for login to complete
wait.until(EC.url_contains("/dashboard"))
# Paginate through downloads
driver.get("https://example.com/downloads")
while True:
# Grab all download buttons on this page
downloads = driver.find_elements(By.CSS_SELECTOR, "form.download-single-form")
for form in downloads:
form.submit()
time.sleep(2) # be polite; don't hammer
# Click "Next" if it exists; else stop
try:
next_btn = driver.find_element(By.CSS_SELECTOR, "a.next[value='next']")
next_btn.click()
wait.until(EC.staleness_of(next_btn))
except NoSuchElementException:
break
finally:
driver.quit()
Key things that changed between Selenium 3 and 4
find_element_by_id("x")→find_element(By.ID, "x"). Everyfind_element_by_*method is gone.executable_pathis now passed through aServiceobject rather than directly to thewebdriver.Firefox(...)constructor.- WebDriver-level waits (
WebDriverWait+expected_conditions) are now the idiomatic way to handle timing — avoidtime.sleep()for anything that depends on the page state. - Selenium Manager (4.6+) can download the right driver automatically when Firefox and Python Selenium versions don't match exactly. Let it, when you can.
Making the scraper robust
The two things that take scrapers from "works on my box" to "works overnight unattended":
Retries and timeouts everywhere. Wrap every page load in a try/except for TimeoutException, retry once, and only fail on the second attempt. Network flakiness is the #1 reason long scrapers die.
Explicit rate limiting. Don't hit a site as fast as Selenium will let you. One request every 2–5 seconds is generally polite and avoids triggering anti-bot mitigations. If a site has a robots.txt or terms that forbid scraping, respect them — both for ethics and because getting your IP banned halfway through a dataset is deeply annoying.
In 2026 if you're starting a new browser-automation project and Selenium isn't already a requirement, Playwright is generally better: faster, better auto-waiting, a nicer API, built-in tracing, and support for Chromium, Firefox, and WebKit from a single package. Selenium is still right when you need W3C WebDriver compatibility (e.g. running against BrowserStack) or when you're extending an existing Selenium codebase.