Intro¶

While the original scrapper was a solid MVP the growing scope of this project requires a more robust and maintainable solution. The legacy scraper was built on top of Selenium scripts and suffers from high resource load, excessive manual intervention, stability problems, poor documentation and difficult feature additions.

Refactor Goals

Improve code maintainability

Comprehensive documentation

Decrease debugging time

Consistent build & run patterns following scrapy and python best practices

Improve Performance & Concurrency

Optimize request delays to avoid detection

Reduce memory & cpu footprint

Support multiple forums at once

Increased automation

Reduce the human interaction needed wherever possible

Handle connection timeouts and failures through retry functions

Proposed architecture¶

The restructure moves from a Procedural Script to a Data Driven Modular Architecture. Leveraging Scrapy & WebPoet to create production grade web scraper that supports multiple forums based on configuration files.

Tip

This modular approach allows us to scale support for new dark web forums simply by adding new configuration files, without rewriting the core scraping logic.

Project Structure¶

onion-peeler/
|-- README.md
|-- pyproject.toml                 # Modern Python packaging
|-- scrapy.cfg                     # For scrapy cli support
|-- .env                           # Local secrets and config values
|-- docs/                          # Documentation
|-- config/                        # Configuration files (not in package)
|   |-- base.toml                  # Global scraper settings
|   |-- accounts/                  # Authentication credentials (gitignored)
|   |   |-- dread_accounts.json
|   |   `-- daunt_accounts.json
|   |-- sites/                     # Site-specific configs
|   |   `-- template.toml          # Template for new sites
|   `-- tor/                       # Site-specific configs
|       `-- torrc                  # Tor config for docker container
|-- logs/                          # Application logs (gitignored)
|   |-- scraper.log
|   |-- errors.log
|   `-- auth.log
|-- src/
|   `-- onion_peeler/              # Main package
|       |-- __init__.py
|       |-- __main__.py            # Entry point
|       |-- cli.py                 # CLI commands
|       |-- spiders/               # Scrapy spiders
|       |   |-- base.py            # ConfigDrivenSpider
|       |   `-- sites/             # Site-specific spiders (if needed)
|       |-- pages/                 # web-poet page objects
|       |   |-- base.py            # Base page object classes
|       |   |-- extraction.py      # link extraction behaviors
|       |   |-- auth.py            # Authentication page objects
|       |   |-- factory.py         # Page object factory
|       |   |-- registry.py        # Page object registry
|       |   `-- sites/             # Site-specific overrides
|       |-- items/                 # Scrapy items (data models)
|       |   |-- base.py            # BaseItem with common fields
|       |   |-- thread.py          # ThreadItem
|       |   `-- post.py            # PostItem
|       |-- middlewares/           # Request/Reponse middlewares
|       |   |-- tor.py             # Tor/VPN routing
|       |   |-- sessions.py        # Session management
|       |   `-- antibot.py         # Header spoofing
|       |-- pipelines/             # Data pipelines
|       |   |-- validation.py      # Data validation
|       |   |-- enrichment.py      # Add metadata
|       |   `-- storage.py         # Save to files/DB
|       |-- settings/              # Configuration management
|       |   |-- base.py            # Scrapy settings
|       |   |-- loader.py          # TOML loader
|       |   `-- models.py          # Config data models
|       `-- utils/                 # Utility functions
|-- tests/                         # Test suite
|-- deploy/                        # Deployment files
|-- deprecated/                    # Deprecated files (folder contains all files from the old main and Development branch)
`-- .github/                       # CI/CD

Architecture¶

graph TD
    subgraph User_Space
        CustomCLI[Custom CLI\npython -m onion_peeler]
        ScrapyCLI[Standard CLI\nscrapy crawl ...]
        Configs[config/ directory]
    end

    subgraph Entry_Points
        CustomCLI -->|1. Parse Args| Main[__main__.py]
        ScrapyCLI -.->|1. Read Project| ScrapyCfg[scrapy.cfg]
        ScrapyCfg -.->|2. Load Settings| ScrapySettings[src/onion_peeler/settings.py]
    end

    subgraph Config_Layer [src/onion_peeler/config]
        Main -->|Explicit Load| Loader[loader.py]
        ScrapySettings -.->|Implicit/Lazy Load| Loader
        Configs -.->|Read TOMLs| Loader
        Loader -->|Validates| Models[models.py]
        Models -->|Returns| SiteConfigObj[SiteConfig Object]
    end

    subgraph Scrapy_Engine [Scrapy Core]
        Main -->|3a. Start Process| Crawler
        ScrapyCLI -->|3b. Start Process| Crawler

        Crawler[Crawler Process] -->|Spawns| Spider[ConfigDrivenSpider]
        SiteConfigObj -->|Injected| Spider
        Spider -->|Requests| Middlewares

        subgraph Middleware_Layer [src/onion_peeler/middlewares]
            Middlewares -->|Intercept| Routing["ProxyMiddleware (tor.py)"]
            Routing -->|Checks URL| Decision{Is .onion?}

            Decision -- Yes --> TorProxy[Tor Access Proxy]
            Decision -- No --> VpnProxy[VPN Access Proxy]

            TorProxy -->|Mask| AntiBot[antibot.py]
            VpnProxy -->|Mask| AntiBot
        end
    end

    subgraph Logic_Layer [src/onion_peeler/pages]
        Response -->|Injected via web-poet| Factory[factory.py]
        Factory -->|Instantiates| PageObj[ConfigurableWebPage]
        PageObj -->|Uses| Behaviors[extraction.py]
        PageObj -->|Extracts| Items[Item Objects]
    end

    subgraph Data_Layer [src/onion_peeler/pipelines]
        Items -->|Validate| Val[validation.py]
        Val -->|Cleanse| Clean[cleanser.py]
        Clean -->|Enrich| Enrich[enrichment.py]
        Enrich -->|Save| Storage[storage.py]
    end

    TorProxy -->|Tor Network| DarkWeb((Dark Web))
    VpnProxy -->|VPN Tunnel| ClearWeb((Clear Web))

    DarkWeb -->|HTML| Response
    ClearWeb -->|HTML| Response

    Spider -->|Yields Response| Factory
    PageObj -->|Yields Items| Spider
    Spider -->|Yields Items| Items

Request/Response Sequence (The "Peeling" Process)¶

This sequence diagram details the lifecycle of a single URL request, highlighting the interaction between the Spider, Middleware, and Page Objects.

sequenceDiagram
    participant S as ConfigDrivenSpider
    participant M as Middleware (Tor/Antibot)
    participant N as Tor Network
    participant F as Page Factory
    participant P as Page Object (Configurable)
    participant Pipe as Pipeline

    Note over S: 1. Start from SiteConfig
    S->>M: Yield Request (URL)
    M->>N: Forward Request (proxied)
    N-->>M: Return HTML Response
    M-->>S: Return Response

    Note over S, F: 2. web-poet injection
    S->>F: Request Page Object for Response
    F->>F: Match URL to Registry
    F->>P: Instantiate Page (w/ Config)

    rect rgb(255,238,136)
        Note over P: 3. Extraction Phase
        P->>P: Run extraction.py behaviors
        P->>P: Apply items/validators.py
    end

    P-->>S: Return Item (e.g., ThreadItem)
    S->>Pipe: Yield Item

    Pipe->>Pipe: validation.py
    Pipe->>Pipe: storage.py (DB/JSON)

    rect rgb(255,238,136)
        Note over P: 4. Pagination Phase
        P-->>S: Return Next Page Links
        S->>S: Schedule Next Requests
    end

Page Object Class Hierarchy¶

This diagram shows how pages directory are mapped to python classes to be used by spiders.

classDiagram
    class ConfigurableWebPage {
        +SiteConfig config
        +to_item()
    }

    class ExtractionBehavior {
        <<Interface>>
        +extract_links()
        +extract_pagination()
    }

    class AuthPage {
        +login()
        +handle_captcha()
    }

    class DreadPage {
        +custom_date_parsing()
    }

    class ThreadItem {
        +title
        +author
        +content
    }

    %% Relationships
    ConfigurableWebPage ..|> ExtractionBehavior : uses (extraction.py)
    ConfigurableWebPage <|-- AuthPage : inherits
    ConfigurableWebPage <|-- DreadPage : inherits

    DreadPage ..> ThreadItem : produces

    note for DreadPage "Located in src/.../pages/sites/dread.py"
    note for ConfigurableWebPage "Located in src/.../pages/base.py"