Skip to content

Docker Infrastructure Guide

Onion Peeler leverages Docker to provide a secure, isolated environment for scraping. This guide explains how the multi-container architecture works.

Prerequisites

Container Architecture

The deploy/docker-compose.yml defines three specialized services. This system uses the Sidecar Pattern, where containers share the same network stack.

Service Container Name Role
dw_vpn dw_vpn The "gateway". Runs Gluetun with Mullvad VPN.
tor tor Provides a Tor SOCKS/HTTP proxy. Routes all its traffic through dw_vpn.
scraper scraper The application container. Inherits the network of dw_vpn.

1. dw_vpn (Gluetun)

The central security hub. It establishes a Wireguard connection to Mullvad. Because of cap_add: [NET_ADMIN], it can manage the routing for other containers.

2. tor

This container runs a Tor relay service. By using network_mode: "service:dw_vpn", its own connection to the Tor entry nodes is masked by the VPN.

3. scraper

This container houses the Python code. It is also in the VPN's network namespace, meaning any "clearweb" requests (like to check-ip services) will report the VPN's IP, not yours.


How it works: Proxy Middleware

Even though the scraper container is inside the VPN network, it still needs to know when to use the tor container's proxy.

  1. Clearweb Requests: Scrapy sends these directly. They exit through the dw_vpn gateway automatically.
  2. Onion Requests (.onion): The ProxyMiddleware inside Onion Peeler detects the .onion suffix and routes the request to localhost:9050 (the Tor SOCKS proxy).

Because the containers share a network namespace, localhost for the scraper is the dw_vpn container, which exposes the Tor ports.


Running CLI Commands with Docker

The most effective way to use the Scrapy CLI in the containerized environment is through docker compose run --rm scraper. This ensures the entire network stack (VPN + Tor) is active before the command executes.

# List all spiders
make list

# Run a specific crawl
make crawl site=daunt
# List all spiders
docker compose run --rm scraper list

# Run a specific crawl
docker compose run --rm scraper crawl daunt

Networking Visualization

graph LR
    subgraph Docker_Network
        VPN[dw_vpn / Mullvad]
        Tor[tor container]
        Scraper[scraper container]
    end

    Scraper -- Clearweb --> VPN
    Scraper -- .onion --> Tor
    Tor --> VPN
    VPN -- Internet --> WWW((World Wide Web))
    VPN -- Tor Entry Nodes --> DarkNet((Tor Network))

To monitor the health of this chain:

make status
make logs