System Architecture¶

The Onion Peeler is built on top of Scrapy and web-poet, it is designed to be highly maintainable, performant, and automated. It represents a significant departure from traditional headless browser-based scrapers, focusing instead on a modular, configuration-driven approach that minimizes the need for custom code when adding new target sites.

Architectural Philosophy¶

The core philosophy is to decouple Extraction Logic from Crawl Mechanics. By using configuration files (TOML) to define how a site should be scraped, we can scale support for new dark web forums without modifying the application's core Python code.

Core Design Goals¶

Maintainability: Centralized configuration and comprehensive documentation reduce debugging time.
Performance: Leveraging Scrapy's asynchronous engine significantly reduces CPU and memory footprint compared to headless browsers.
Automation: Robust handling of connection timeouts, retries, and proxy rotation (Tor/VPN) minimizes manual intervention.

High-Level Component Map¶

The application is divided into several logical layers, each with a specific responsibility.

graph TD
    subgraph User_Space
        ScrapyCLI[Standard CLI\nscrapy crawl ...]
        Configs[config/sites/ directory]
    end

    subgraph Config_Layer [src/onion_peeler/settings]
        Loader[loader.py]
        Models[models.py]
        Configs -->|Read TOMLs| Loader
        Loader -->|Validates| Models
        Models -->|Returns| SiteConfigObj[SiteConfig Object]
    end

    subgraph Scrapy_Engine [Scrapy Core]
        ScrapyCLI -->|Start Process| Crawler
        Crawler[Crawler Process] -->|Spawns| Spider[ConfigDrivenSpider]
        SiteConfigObj -->|Injected| Spider

        subgraph Middleware_Layer [src/onion_peeler/middlewares]
            Spider -->|Requests| Routing["ProxyMiddleware (proxy.py)"]
            Routing -->|Checks URL| Decision{Is .onion?}
            Decision -- Yes --> TorProxy[Tor SOCKS Proxy]
            Decision -- No --> VpnProxy[VPN Gateway]
        end
    end

    subgraph Logic_Layer [src/onion_peeler/pages]
        Response -->|Injected via web-poet| Factory[factory.py]
        Factory -->|Instantiates| PageObj[ConfigurableWebPage]
        PageObj -->|Extracts| Items[Item Objects]
    end

    subgraph Data_Layer [src/onion_peeler/pipelines]
        Items -->|Validate| Val[validation.py]
        Val -->|Enrich| Enrich[enrichment.py]
        Enrich -->|Save| Storage[storage.py]
    end

    TorProxy -->|Tor Network| DarkWeb((Dark Web))
    VpnProxy -->|VPN Tunnel| ClearWeb((Clear Web))
    DarkWeb -->|HTML| Response
    ClearWeb -->|HTML| Response

The "Peeling" Process (Sequence)¶

This diagram details the lifecycle of a single URL request, highlighting the interaction between the Spider, Middleware, and Page Objects.

sequenceDiagram
    participant S as ConfigDrivenSpider
    participant M as Proxy Middleware
    participant N as Network (Tor/VPN)
    participant F as Page Factory
    participant P as Page Object (Configurable)
    participant Pipe as Pipeline

    Note over S: 1. Start from SiteConfig
    S->>M: Yield Request (URL)
    M->>N: Forward Request (proxied)
    N-->>M: Return HTML Response
    M-->>S: Return Response

    Note over S, F: 2. web-poet injection
    S->>F: Request Page Object for Response
    F->>P: Instantiate Page (w/ Config)

    rect rgb(240, 239, 237)
        Note over P: 3. Extraction Phase
        P->>P: Apply TOML Selectors
    end

    P-->>S: Return Item (e.g., LinkItem)
    S->>Pipe: Yield Item

    Pipe->>Pipe: validation.py
    Pipe->>Pipe: storage.py (JSON/CSV)

    Note over P: 4. Pagination Phase
    P-->>S: Return Next Page Links
    S->>S: Schedule Next Requests

Page Object Class Hierarchy¶

We use a hierarchical approach to extraction. The ConfigurableWebPage handles standard TOML-based extraction, while site-specific subclasses can be created for complex logic.

classDiagram
    class ConfigurableWebPage {
        +SiteConfig config
        +to_item() List[Item]
    }

    class SiteSpecificPage {
        +custom_parsing_logic()
    }

    class LinkItem {
        +title: str
        +url: str
    }

    ConfigurableWebPage <|-- SiteSpecificPage : inherits
    ConfigurableWebPage ..> LinkItem : produces

ConfigurableWebPage: The default handler. It reads the selectors block from the TOML and uses them to map HTML elements to Pydantic models.
SiteSpecificPage: (Optional) A Python class created in src/onion_peeler/pages/sites/ to handle sites that require custom logic (e.g., decryption or complex date parsing).

Data Pipelines¶

Once an item is extracted, it passes through a series of pipelines:

Validation: Ensures the extracted data meets the required schema (via Pydantic).
Enrichment: Adds metadata such as scraped_at timestamps or source site identifiers.
Storage: Persists the data to the configured output format (JSON, CSV, or Database).