System Architecture¶
The Onion Peeler is built on top of Scrapy and web-poet, it is designed to be highly maintainable, performant, and automated. It represents a significant departure from traditional headless browser-based scrapers, focusing instead on a modular, configuration-driven approach that minimizes the need for custom code when adding new target sites.
Architectural Philosophy¶
The core philosophy is to decouple Extraction Logic from Crawl Mechanics. By using configuration files (TOML) to define how a site should be scraped, we can scale support for new dark web forums without modifying the application's core Python code.
Core Design Goals¶
- Maintainability: Centralized configuration and comprehensive documentation reduce debugging time.
- Performance: Leveraging Scrapy's asynchronous engine significantly reduces CPU and memory footprint compared to headless browsers.
- Automation: Robust handling of connection timeouts, retries, and proxy rotation (Tor/VPN) minimizes manual intervention.
High-Level Component Map¶
The application is divided into several logical layers, each with a specific responsibility.
graph TD
subgraph User_Space
ScrapyCLI[Standard CLI\nscrapy crawl ...]
Configs[config/sites/ directory]
end
subgraph Config_Layer [src/onion_peeler/settings]
Loader[loader.py]
Models[models.py]
Configs -->|Read TOMLs| Loader
Loader -->|Validates| Models
Models -->|Returns| SiteConfigObj[SiteConfig Object]
end
subgraph Scrapy_Engine [Scrapy Core]
ScrapyCLI -->|Start Process| Crawler
Crawler[Crawler Process] -->|Spawns| Spider[ConfigDrivenSpider]
SiteConfigObj -->|Injected| Spider
subgraph Middleware_Layer [src/onion_peeler/middlewares]
Spider -->|Requests| Routing["ProxyMiddleware (proxy.py)"]
Routing -->|Checks URL| Decision{Is .onion?}
Decision -- Yes --> TorProxy[Tor SOCKS Proxy]
Decision -- No --> VpnProxy[VPN Gateway]
end
end
subgraph Logic_Layer [src/onion_peeler/pages]
Response -->|Injected via web-poet| Factory[factory.py]
Factory -->|Instantiates| PageObj[ConfigurableWebPage]
PageObj -->|Extracts| Items[Item Objects]
end
subgraph Data_Layer [src/onion_peeler/pipelines]
Items -->|Validate| Val[validation.py]
Val -->|Enrich| Enrich[enrichment.py]
Enrich -->|Save| Storage[storage.py]
end
TorProxy -->|Tor Network| DarkWeb((Dark Web))
VpnProxy -->|VPN Tunnel| ClearWeb((Clear Web))
DarkWeb -->|HTML| Response
ClearWeb -->|HTML| Response
The "Peeling" Process (Sequence)¶
This diagram details the lifecycle of a single URL request, highlighting the interaction between the Spider, Middleware, and Page Objects.
sequenceDiagram
participant S as ConfigDrivenSpider
participant M as Proxy Middleware
participant N as Network (Tor/VPN)
participant F as Page Factory
participant P as Page Object (Configurable)
participant Pipe as Pipeline
Note over S: 1. Start from SiteConfig
S->>M: Yield Request (URL)
M->>N: Forward Request (proxied)
N-->>M: Return HTML Response
M-->>S: Return Response
Note over S, F: 2. web-poet injection
S->>F: Request Page Object for Response
F->>P: Instantiate Page (w/ Config)
rect rgb(240, 239, 237)
Note over P: 3. Extraction Phase
P->>P: Apply TOML Selectors
end
P-->>S: Return Item (e.g., LinkItem)
S->>Pipe: Yield Item
Pipe->>Pipe: validation.py
Pipe->>Pipe: storage.py (JSON/CSV)
Note over P: 4. Pagination Phase
P-->>S: Return Next Page Links
S->>S: Schedule Next Requests
Page Object Class Hierarchy¶
We use a hierarchical approach to extraction. The ConfigurableWebPage handles standard TOML-based extraction, while site-specific subclasses can be created for complex logic.
classDiagram
class ConfigurableWebPage {
+SiteConfig config
+to_item() List[Item]
}
class SiteSpecificPage {
+custom_parsing_logic()
}
class LinkItem {
+title: str
+url: str
}
ConfigurableWebPage <|-- SiteSpecificPage : inherits
ConfigurableWebPage ..> LinkItem : produces
ConfigurableWebPage: The default handler. It reads theselectorsblock from the TOML and uses them to map HTML elements to Pydantic models.SiteSpecificPage: (Optional) A Python class created insrc/onion_peeler/pages/sites/to handle sites that require custom logic (e.g., decryption or complex date parsing).
Data Pipelines¶
Once an item is extracted, it passes through a series of pipelines:
- Validation: Ensures the extracted data meets the required schema (via Pydantic).
- Enrichment: Adds metadata such as
scraped_attimestamps or source site identifiers. - Storage: Persists the data to the configured output format (JSON, CSV, or Database).