Getting Started¶

This tutorial will guide you through the process of setting up Onion Peelerand executing your first crawl.

By the end of this tutorial, you will have a working environment and your first set of extracted data from a dark web index.

Prerequisites¶

Before we begin, ensure you have the following installed on your machine:

Python 3.12+: The core language used by the framework.
uv: A fast Python package installer and resolver. Install uv here.
Tor Browser / Tor Service: Required for accessing .onion domains.
Make: (Optional but recommended) For running automation commands.

Step 1: Clone the Repository¶

First, download the source code to your local machine:

git clone https://github.com/DominickFoti/Dark-Web-Scraping.git
cd Dark-Web-Scraping

Step 2: Install Dependencies¶

We use uv to manage a virtual environment and project dependencies. Run the following command to sync the project:

make build

Under the Hood

This command executes uv sync, which creates a .venv directory and installs all requirements listed in pyproject.toml with precise versions from uv.lock.

Step 3: Configure Your Environment¶

Onion Peeler uses environment variables for sensitive configuration and feature toggling. Create a .env file from the provided example:

cp .env.example .env

Open .env in your preferred editor. Ensure the following values are set (you can leave the VPN keys blank for now if you are just testing local Tor access):

TOR_CONTROL_PASSWORD=secret_tor_password
TOR_ROTATION_INTERVAL=600

Follow the instructions here to set up the mullvad vpn access.

Step 4: Verify Tor Connectivity¶

Before crawling, let's verify that the scraper can communicate through the Tor network. Onion Peeler is configured to use Tor as a proxy for .onion requests.

Run the following test command:

make test-tor

If successful, you should see a message confirming you are using Tor (usually "Congratulations. This browser is configured to use Tor.").

Step 5: Execute Your First Crawl¶

Now we are ready to scrape! We will use the daunt spider, which is dynamically generated from config/sites/daunt.toml.

Run the crawl and save the results to a JSON file:

make crawl-daunt

Makefile (Recommended when using uv)Manual Scrapy Command

make crawl-daunt

uvvenv & pip

uv run scrapy crawl daunt -o daunt_output.json

source .venv/bin/activate
scrapy crawl daunt -o daunt_output.json

What just happened?

The framework read the TOML configuration, identified the extraction selectors for the "daunt" site, spawned a Scrapy spider, routed requests through Tor, and extracted the data into daunt_output.json.

Step 6: Inspect the Results¶

Open daunt_output.json. You should see a list of extracted items similar to this:

[
  {
    "title": "Example Service Name",
    "url": "http://exampleonionlink.onion"
  },
  ...
]

Next Steps¶

Congratulations! You've successfully completed your first crawl.

Want to add a new site? Check out the Creating a Custom Site Configuration.
Curious about the architecture? Read the Onion Peeler Project Plan.
Need help with setup issues? Visit the Quick Start Setup Guide.