Skip to content

Intro

While the original scrapper was a solid MVP the growing scope of this project requires a more robust and maintainable solution. The legacy scraper was built on top of Selenium scripts and suffers from high resource load, excessive manual intervention, stability problems, poor documentation and difficult feature additions.

Refactor Goals

  • Improve code maintainability
    • Comprehensive documentation
    • Decrease debugging time
    • Consistent build & run patterns following scrapy and python best practices
  • Improve Performance & Concurrency
    • Optimize request delays to avoid detection
    • Reduce memory & cpu footprint
    • Support multiple forums at once
  • Increased automation
    • Reduce the human interaction needed wherever possible
    • Handle connection timeouts and failures through retry functions

Proposed architecture

The restructure moves from a Procedural Script to a Data Driven Modular Architecture. Leveraging Scrapy & WebPoet to create production grade web scraper that supports multiple forums based on configuration files.

Tip

This modular approach allows us to scale support for new dark web forums simply by adding new configuration files, without rewriting the core scraping logic.

Project Structure

onion-peeler/
|-- README.md
|-- pyproject.toml                 # Modern Python packaging
|-- scrapy.cfg                     # For scrapy cli support
|-- .env                           # Local secrets and config values
|-- docs/                          # Documentation
|-- config/                        # Configuration files (not in package)
|   |-- base.toml                  # Global scraper settings
|   |-- accounts/                  # Authentication credentials (gitignored)
|   |   |-- dread_accounts.json
|   |   `-- daunt_accounts.json
|   |-- sites/                     # Site-specific configs
|   |   `-- template.toml          # Template for new sites
|   `-- tor/                       # Site-specific configs
|       `-- torrc                  # Tor config for docker container
|-- logs/                          # Application logs (gitignored)
|   |-- scraper.log
|   |-- errors.log
|   `-- auth.log
|-- src/
|   `-- onion_peeler/              # Main package
|       |-- __init__.py
|       |-- __main__.py            # Entry point
|       |-- cli.py                 # CLI commands
|       |-- spiders/               # Scrapy spiders
|       |   |-- base.py            # ConfigDrivenSpider
|       |   `-- sites/             # Site-specific spiders (if needed)
|       |-- pages/                 # web-poet page objects
|       |   |-- base.py            # Base page object classes
|       |   |-- extraction.py      # link extraction behaviors
|       |   |-- auth.py            # Authentication page objects
|       |   |-- factory.py         # Page object factory
|       |   |-- registry.py        # Page object registry
|       |   `-- sites/             # Site-specific overrides
|       |-- items/                 # Scrapy items (data models)
|       |   |-- base.py            # BaseItem with common fields
|       |   |-- thread.py          # ThreadItem
|       |   `-- post.py            # PostItem
|       |-- middlewares/           # Request/Reponse middlewares
|       |   |-- tor.py             # Tor/VPN routing
|       |   |-- sessions.py        # Session management
|       |   `-- antibot.py         # Header spoofing
|       |-- pipelines/             # Data pipelines
|       |   |-- validation.py      # Data validation
|       |   |-- enrichment.py      # Add metadata
|       |   `-- storage.py         # Save to files/DB
|       |-- settings/              # Configuration management
|       |   |-- base.py            # Scrapy settings
|       |   |-- loader.py          # TOML loader
|       |   `-- models.py          # Config data models
|       `-- utils/                 # Utility functions
|-- tests/                         # Test suite
|-- deploy/                        # Deployment files
|-- deprecated/                    # Deprecated files (folder contains all files from the old main and Development branch)
`-- .github/                       # CI/CD

Architecture

graph TD
    subgraph User_Space
        CustomCLI[Custom CLI\npython -m onion_peeler]
        ScrapyCLI[Standard CLI\nscrapy crawl ...]
        Configs[config/ directory]
    end

    subgraph Entry_Points
        CustomCLI -->|1. Parse Args| Main[__main__.py]
        ScrapyCLI -.->|1. Read Project| ScrapyCfg[scrapy.cfg]
        ScrapyCfg -.->|2. Load Settings| ScrapySettings[src/onion_peeler/settings.py]
    end

    subgraph Config_Layer [src/onion_peeler/config]
        Main -->|Explicit Load| Loader[loader.py]
        ScrapySettings -.->|Implicit/Lazy Load| Loader
        Configs -.->|Read TOMLs| Loader
        Loader -->|Validates| Models[models.py]
        Models -->|Returns| SiteConfigObj[SiteConfig Object]
    end

    subgraph Scrapy_Engine [Scrapy Core]
        Main -->|3a. Start Process| Crawler
        ScrapyCLI -->|3b. Start Process| Crawler

        Crawler[Crawler Process] -->|Spawns| Spider[ConfigDrivenSpider]
        SiteConfigObj -->|Injected| Spider
        Spider -->|Requests| Middlewares

        subgraph Middleware_Layer [src/onion_peeler/middlewares]
            Middlewares -->|Intercept| Routing["ProxyMiddleware (tor.py)"]
            Routing -->|Checks URL| Decision{Is .onion?}

            Decision -- Yes --> TorProxy[Tor Access Proxy]
            Decision -- No --> VpnProxy[VPN Access Proxy]

            TorProxy -->|Mask| AntiBot[antibot.py]
            VpnProxy -->|Mask| AntiBot
        end
    end

    subgraph Logic_Layer [src/onion_peeler/pages]
        Response -->|Injected via web-poet| Factory[factory.py]
        Factory -->|Instantiates| PageObj[ConfigurableWebPage]
        PageObj -->|Uses| Behaviors[extraction.py]
        PageObj -->|Extracts| Items[Item Objects]
    end

    subgraph Data_Layer [src/onion_peeler/pipelines]
        Items -->|Validate| Val[validation.py]
        Val -->|Cleanse| Clean[cleanser.py]
        Clean -->|Enrich| Enrich[enrichment.py]
        Enrich -->|Save| Storage[storage.py]
    end

    TorProxy -->|Tor Network| DarkWeb((Dark Web))
    VpnProxy -->|VPN Tunnel| ClearWeb((Clear Web))

    DarkWeb -->|HTML| Response
    ClearWeb -->|HTML| Response

    Spider -->|Yields Response| Factory
    PageObj -->|Yields Items| Spider
    Spider -->|Yields Items| Items

Request/Response Sequence (The "Peeling" Process)

This sequence diagram details the lifecycle of a single URL request, highlighting the interaction between the Spider, Middleware, and Page Objects.

sequenceDiagram
    participant S as ConfigDrivenSpider
    participant M as Middleware (Tor/Antibot)
    participant N as Tor Network
    participant F as Page Factory
    participant P as Page Object (Configurable)
    participant Pipe as Pipeline

    Note over S: 1. Start from SiteConfig
    S->>M: Yield Request (URL)
    M->>N: Forward Request (proxied)
    N-->>M: Return HTML Response
    M-->>S: Return Response

    Note over S, F: 2. web-poet injection
    S->>F: Request Page Object for Response
    F->>F: Match URL to Registry
    F->>P: Instantiate Page (w/ Config)

    rect rgb(255,238,136)
        Note over P: 3. Extraction Phase
        P->>P: Run extraction.py behaviors
        P->>P: Apply items/validators.py
    end

    P-->>S: Return Item (e.g., ThreadItem)
    S->>Pipe: Yield Item

    Pipe->>Pipe: validation.py
    Pipe->>Pipe: storage.py (DB/JSON)

    rect rgb(255,238,136)
        Note over P: 4. Pagination Phase
        P-->>S: Return Next Page Links
        S->>S: Schedule Next Requests
    end

Page Object Class Hierarchy

This diagram shows how pages directory are mapped to python classes to be used by spiders.

classDiagram
    class ConfigurableWebPage {
        +SiteConfig config
        +to_item()
    }

    class ExtractionBehavior {
        <<Interface>>
        +extract_links()
        +extract_pagination()
    }

    class AuthPage {
        +login()
        +handle_captcha()
    }

    class DreadPage {
        +custom_date_parsing()
    }

    class ThreadItem {
        +title
        +author
        +content
    }

    %% Relationships
    ConfigurableWebPage ..|> ExtractionBehavior : uses (extraction.py)
    ConfigurableWebPage <|-- AuthPage : inherits
    ConfigurableWebPage <|-- DreadPage : inherits

    DreadPage ..> ThreadItem : produces

    note for DreadPage "Located in src/.../pages/sites/dread.py"
    note for ConfigurableWebPage "Located in src/.../pages/base.py"