Intro¶
While the original scrapper was a solid MVP the growing scope of this project requires a more robust and maintainable solution. The legacy scraper was built on top of Selenium scripts and suffers from high resource load, excessive manual intervention, stability problems, poor documentation and difficult feature additions.
Refactor Goals
- Improve code maintainability
- Comprehensive documentation
- Decrease debugging time
- Consistent build & run patterns following scrapy and python best practices
- Improve Performance & Concurrency
- Optimize request delays to avoid detection
- Reduce memory & cpu footprint
- Support multiple forums at once
- Increased automation
- Reduce the human interaction needed wherever possible
- Handle connection timeouts and failures through retry functions
Proposed architecture¶
The restructure moves from a Procedural Script to a Data Driven Modular Architecture. Leveraging Scrapy & WebPoet to create production grade web scraper that supports multiple forums based on configuration files.
Tip
This modular approach allows us to scale support for new dark web forums simply by adding new configuration files, without rewriting the core scraping logic.
Project Structure¶
onion-peeler/
|-- README.md
|-- pyproject.toml # Modern Python packaging
|-- scrapy.cfg # For scrapy cli support
|-- .env # Local secrets and config values
|-- docs/ # Documentation
|-- config/ # Configuration files (not in package)
| |-- base.toml # Global scraper settings
| |-- accounts/ # Authentication credentials (gitignored)
| | |-- dread_accounts.json
| | `-- daunt_accounts.json
| |-- sites/ # Site-specific configs
| | `-- template.toml # Template for new sites
| `-- tor/ # Site-specific configs
| `-- torrc # Tor config for docker container
|-- logs/ # Application logs (gitignored)
| |-- scraper.log
| |-- errors.log
| `-- auth.log
|-- src/
| `-- onion_peeler/ # Main package
| |-- __init__.py
| |-- __main__.py # Entry point
| |-- cli.py # CLI commands
| |-- spiders/ # Scrapy spiders
| | |-- base.py # ConfigDrivenSpider
| | `-- sites/ # Site-specific spiders (if needed)
| |-- pages/ # web-poet page objects
| | |-- base.py # Base page object classes
| | |-- extraction.py # link extraction behaviors
| | |-- auth.py # Authentication page objects
| | |-- factory.py # Page object factory
| | |-- registry.py # Page object registry
| | `-- sites/ # Site-specific overrides
| |-- items/ # Scrapy items (data models)
| | |-- base.py # BaseItem with common fields
| | |-- thread.py # ThreadItem
| | `-- post.py # PostItem
| |-- middlewares/ # Request/Reponse middlewares
| | |-- tor.py # Tor/VPN routing
| | |-- sessions.py # Session management
| | `-- antibot.py # Header spoofing
| |-- pipelines/ # Data pipelines
| | |-- validation.py # Data validation
| | |-- enrichment.py # Add metadata
| | `-- storage.py # Save to files/DB
| |-- settings/ # Configuration management
| | |-- base.py # Scrapy settings
| | |-- loader.py # TOML loader
| | `-- models.py # Config data models
| `-- utils/ # Utility functions
|-- tests/ # Test suite
|-- deploy/ # Deployment files
|-- deprecated/ # Deprecated files (folder contains all files from the old main and Development branch)
`-- .github/ # CI/CD
Architecture¶
graph TD
subgraph User_Space
CustomCLI[Custom CLI\npython -m onion_peeler]
ScrapyCLI[Standard CLI\nscrapy crawl ...]
Configs[config/ directory]
end
subgraph Entry_Points
CustomCLI -->|1. Parse Args| Main[__main__.py]
ScrapyCLI -.->|1. Read Project| ScrapyCfg[scrapy.cfg]
ScrapyCfg -.->|2. Load Settings| ScrapySettings[src/onion_peeler/settings.py]
end
subgraph Config_Layer [src/onion_peeler/config]
Main -->|Explicit Load| Loader[loader.py]
ScrapySettings -.->|Implicit/Lazy Load| Loader
Configs -.->|Read TOMLs| Loader
Loader -->|Validates| Models[models.py]
Models -->|Returns| SiteConfigObj[SiteConfig Object]
end
subgraph Scrapy_Engine [Scrapy Core]
Main -->|3a. Start Process| Crawler
ScrapyCLI -->|3b. Start Process| Crawler
Crawler[Crawler Process] -->|Spawns| Spider[ConfigDrivenSpider]
SiteConfigObj -->|Injected| Spider
Spider -->|Requests| Middlewares
subgraph Middleware_Layer [src/onion_peeler/middlewares]
Middlewares -->|Intercept| Routing["ProxyMiddleware (tor.py)"]
Routing -->|Checks URL| Decision{Is .onion?}
Decision -- Yes --> TorProxy[Tor Access Proxy]
Decision -- No --> VpnProxy[VPN Access Proxy]
TorProxy -->|Mask| AntiBot[antibot.py]
VpnProxy -->|Mask| AntiBot
end
end
subgraph Logic_Layer [src/onion_peeler/pages]
Response -->|Injected via web-poet| Factory[factory.py]
Factory -->|Instantiates| PageObj[ConfigurableWebPage]
PageObj -->|Uses| Behaviors[extraction.py]
PageObj -->|Extracts| Items[Item Objects]
end
subgraph Data_Layer [src/onion_peeler/pipelines]
Items -->|Validate| Val[validation.py]
Val -->|Cleanse| Clean[cleanser.py]
Clean -->|Enrich| Enrich[enrichment.py]
Enrich -->|Save| Storage[storage.py]
end
TorProxy -->|Tor Network| DarkWeb((Dark Web))
VpnProxy -->|VPN Tunnel| ClearWeb((Clear Web))
DarkWeb -->|HTML| Response
ClearWeb -->|HTML| Response
Spider -->|Yields Response| Factory
PageObj -->|Yields Items| Spider
Spider -->|Yields Items| Items
Request/Response Sequence (The "Peeling" Process)¶
This sequence diagram details the lifecycle of a single URL request, highlighting the interaction between the Spider, Middleware, and Page Objects.
sequenceDiagram
participant S as ConfigDrivenSpider
participant M as Middleware (Tor/Antibot)
participant N as Tor Network
participant F as Page Factory
participant P as Page Object (Configurable)
participant Pipe as Pipeline
Note over S: 1. Start from SiteConfig
S->>M: Yield Request (URL)
M->>N: Forward Request (proxied)
N-->>M: Return HTML Response
M-->>S: Return Response
Note over S, F: 2. web-poet injection
S->>F: Request Page Object for Response
F->>F: Match URL to Registry
F->>P: Instantiate Page (w/ Config)
rect rgb(255,238,136)
Note over P: 3. Extraction Phase
P->>P: Run extraction.py behaviors
P->>P: Apply items/validators.py
end
P-->>S: Return Item (e.g., ThreadItem)
S->>Pipe: Yield Item
Pipe->>Pipe: validation.py
Pipe->>Pipe: storage.py (DB/JSON)
rect rgb(255,238,136)
Note over P: 4. Pagination Phase
P-->>S: Return Next Page Links
S->>S: Schedule Next Requests
end
Page Object Class Hierarchy¶
This diagram shows how pages directory are mapped to python classes to be used by spiders.
classDiagram
class ConfigurableWebPage {
+SiteConfig config
+to_item()
}
class ExtractionBehavior {
<<Interface>>
+extract_links()
+extract_pagination()
}
class AuthPage {
+login()
+handle_captcha()
}
class DreadPage {
+custom_date_parsing()
}
class ThreadItem {
+title
+author
+content
}
%% Relationships
ConfigurableWebPage ..|> ExtractionBehavior : uses (extraction.py)
ConfigurableWebPage <|-- AuthPage : inherits
ConfigurableWebPage <|-- DreadPage : inherits
DreadPage ..> ThreadItem : produces
note for DreadPage "Located in src/.../pages/sites/dread.py"
note for ConfigurableWebPage "Located in src/.../pages/base.py"