Skip to content

Getting Started

This tutorial will guide you through the process of setting up Onion Peelerand executing your first crawl.

By the end of this tutorial, you will have a working environment and your first set of extracted data from a dark web index.

Prerequisites

Before we begin, ensure you have the following installed on your machine:

  • Python 3.12+: The core language used by the framework.
  • uv: A fast Python package installer and resolver. Install uv here.
  • Tor Browser / Tor Service: Required for accessing .onion domains.
  • Make: (Optional but recommended) For running automation commands.

Step 1: Clone the Repository

First, download the source code to your local machine:

git clone https://github.com/DominickFoti/Dark-Web-Scraping.git
cd Dark-Web-Scraping

Step 2: Install Dependencies

We use uv to manage a virtual environment and project dependencies. Run the following command to sync the project:

make build

Under the Hood

This command executes uv sync, which creates a .venv directory and installs all requirements listed in pyproject.toml with precise versions from uv.lock.

Step 3: Configure Your Environment

Onion Peeler uses environment variables for sensitive configuration and feature toggling. Create a .env file from the provided example:

cp .env.example .env

Open .env in your preferred editor. Ensure the following values are set (you can leave the VPN keys blank for now if you are just testing local Tor access):

TOR_CONTROL_PASSWORD=secret_tor_password
TOR_ROTATION_INTERVAL=600

Follow the instructions here to set up the mullvad vpn access.

Step 4: Verify Tor Connectivity

Before crawling, let's verify that the scraper can communicate through the Tor network. Onion Peeler is configured to use Tor as a proxy for .onion requests.

Run the following test command:

make test-tor

If successful, you should see a message confirming you are using Tor (usually "Congratulations. This browser is configured to use Tor.").

Step 5: Execute Your First Crawl

Now we are ready to scrape! We will use the daunt spider, which is dynamically generated from config/sites/daunt.toml.

Run the crawl and save the results to a JSON file:

make crawl-daunt
make crawl-daunt
uv run scrapy crawl daunt -o daunt_output.json
source .venv/bin/activate
scrapy crawl daunt -o daunt_output.json

What just happened?

The framework read the TOML configuration, identified the extraction selectors for the "daunt" site, spawned a Scrapy spider, routed requests through Tor, and extracted the data into daunt_output.json.

Step 6: Inspect the Results

Open daunt_output.json. You should see a list of extracted items similar to this:

[
  {
    "title": "Example Service Name",
    "url": "http://exampleonionlink.onion"
  },
  ...
]

Next Steps

Congratulations! You've successfully completed your first crawl.