Install the prerequisites

sudo apt update
sudo apt install python3 python3-pip python3-venv firefox

On Ubuntu 22.04+ the firefox apt package installs the Snap version, which works with Selenium but has some sandbox quirks. If you hit "profile error" messages, install the Mozilla PPA variant or use the Flatpak (see my Firefox via Flatpak note).

Set up a virtualenv

Keep the project isolated from system Python:

mkdir -p ~/selenium-scraper && cd ~/selenium-scraper
python3 -m venv .venv
source .venv/bin/activate
pip install selenium pandas

Install geckodriver

Modern Selenium (4.6+) can auto-manage the driver if selenium-manager detects a compatible Firefox — you can skip this step and it'll "just work." If it doesn't (common on unusual distros or locked-down CI), install manually:

GECKO_VER=$(curl -s https://api.github.com/repos/mozilla/geckodriver/releases/latest | grep -Po '"tag_name":\s*"\K[^"]*')
curl -L -o /tmp/gecko.tar.gz "https://github.com/mozilla/geckodriver/releases/download/${GECKO_VER}/geckodriver-${GECKO_VER}-linux64.tar.gz"
mkdir -p ./drivers
tar -xzf /tmp/gecko.tar.gz -C ./drivers/
./drivers/geckodriver --version

A working example

The original version of this tutorial used Selenium 3's find_element_by_id(...) API. That API was removed in Selenium 4. Here's the equivalent modern script:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time

# Options
opts = Options()
opts.add_argument("--headless")                  # no GUI; remove while debugging
opts.set_preference("browser.download.folderList", 2)
opts.set_preference("browser.download.dir", "/home/amir/downloads")
opts.set_preference("browser.helperApps.neverAsk.saveToDisk",
                    "application/octet-stream,application/pdf,application/zip")

# If you installed geckodriver manually:
service = Service(executable_path="./drivers/geckodriver")
driver = webdriver.Firefox(service=service, options=opts)

# Or let selenium-manager find it:
# driver = webdriver.Firefox(options=opts)

wait = WebDriverWait(driver, 15)

try:
    # Log in
    driver.get("https://example.com/login")
    wait.until(EC.presence_of_element_located((By.ID, "username")))
    driver.find_element(By.ID, "username").send_keys("your-username")
    driver.find_element(By.ID, "password").send_keys("your-password")
    driver.find_element(By.ID, "rememberme").click()
    driver.find_element(By.NAME, "login").click()

    # Wait for login to complete
    wait.until(EC.url_contains("/dashboard"))

    # Paginate through downloads
    driver.get("https://example.com/downloads")
    while True:
        # Grab all download buttons on this page
        downloads = driver.find_elements(By.CSS_SELECTOR, "form.download-single-form")
        for form in downloads:
            form.submit()
            time.sleep(2)  # be polite; don't hammer

        # Click "Next" if it exists; else stop
        try:
            next_btn = driver.find_element(By.CSS_SELECTOR, "a.next[value='next']")
            next_btn.click()
            wait.until(EC.staleness_of(next_btn))
        except NoSuchElementException:
            break
finally:
    driver.quit()

Key things that changed between Selenium 3 and 4

  • find_element_by_id("x")find_element(By.ID, "x"). Every find_element_by_* method is gone.
  • executable_path is now passed through a Service object rather than directly to the webdriver.Firefox(...) constructor.
  • WebDriver-level waits (WebDriverWait + expected_conditions) are now the idiomatic way to handle timing — avoid time.sleep() for anything that depends on the page state.
  • Selenium Manager (4.6+) can download the right driver automatically when Firefox and Python Selenium versions don't match exactly. Let it, when you can.

Making the scraper robust

The two things that take scrapers from "works on my box" to "works overnight unattended":

Retries and timeouts everywhere. Wrap every page load in a try/except for TimeoutException, retry once, and only fail on the second attempt. Network flakiness is the #1 reason long scrapers die.

Explicit rate limiting. Don't hit a site as fast as Selenium will let you. One request every 2–5 seconds is generally polite and avoids triggering anti-bot mitigations. If a site has a robots.txt or terms that forbid scraping, respect them — both for ethics and because getting your IP banned halfway through a dataset is deeply annoying.

Consider Playwright for new projects

In 2026 if you're starting a new browser-automation project and Selenium isn't already a requirement, Playwright is generally better: faster, better auto-waiting, a nicer API, built-in tracing, and support for Chromium, Firefox, and WebKit from a single package. Selenium is still right when you need W3C WebDriver compatibility (e.g. running against BrowserStack) or when you're extending an existing Selenium codebase.