Creating a Custom Site Configuration¶
This tutorial walks you through the process of adding a new website/forum to the scraper using a configuration file. We will move from inspecting raw HTML to building a functional, paginated scraper without writing a single line of Python.
The "No-Code" Philosophy¶
Onion Peeler uses a Data-Driven Architecture. Instead of writing code each website, a "Map" (the TOML file) is defined which tells the scrapy engine: 1. Where to go (Base URL). 2. What to look for (Selectors). 3. How to clean the data (Processors). 4. Where to go next (Pagination).
Step 1: Inspect the Target with Scrapy Shell¶
Before writing a config, we can understand the site's html structure using the scrapy shell command. This allows us to test our assumptions in real-time.
Assume we are targeting a directory at http://example.onion.
Identifying the "Container"¶
Most sites list items (threads, links, mirrors) in a repeating pattern. You must find the Container—the smallest HTML block that holds exactly one item.
# If every mirror is in a <div class="mirror-entry">
containers = response.css("div.mirror-entry")
print(f"Found {len(containers)} items")
Testing Relative Selectors¶
Once you have the container, all field selectors (title, url, date) must be relative to it.
item = containers[0]
# Use relative CSS
item.css("h3.title::text").get()
# Or relative XPath (starts with ./)
item.xpath("./span[@class='url']/text()").get()
Step 2: Anatomy of a Site Config¶
Create a new file: config/sites/example.toml. Below is the breakdown of a standard configuration.
1. The [site] Block (Metadata)¶
This identifies the spider within the application.
[site]
name = "example" # The ID used in 'scrapy crawl example'
base_url = "http://example.onion" # The entry point
allowed_domains = ["example.onion"] # Security boundary
2. The [site.items] Block (Schema)¶
This enables specific data models. Onion Peeler supports types like link, post, and thread.
3. The [site.selectors] Block (The Map)¶
This is the heart of the configuration.
[site.selectors.link]
# The 'Row' or 'Block' identified in Step 1
container = "div.mirror-entry"
# 'Cells' or 'Fields' relative to the container
title = "h3.title::text"
url = "a.link-out::attr(href)"
Selector Engine Auto-Detection
Onion Peeler assumes CSS by default. However, if your string starts with
xpath:,/,.//, or(, it automatically switches to the XPath engine.
Step 3: Data Cleansing (Processors)¶
Dark web HTML is often messy, full of extra tabs, newlines, or "noise." Processors are transformation functions applied to the extracted text.
Common processors include strip, lower, and upper.
Step 4: Verification with scrapy parse¶
Before committing to a long-running crawl, use the parse command to verify your "Map."
Proxy Requirement
Ensure your Docker stack (VPN + Tor) is running, as the application routes all traffic through the proxy chain to protect your identity.
What to look for:
- Success: A list of items with the correct titles and URLs appears.
- Empty List: Your container selector is likely wrong.
- Partial Data: Your field selectors (e.g., title) aren't matching anything within the container.
Step 5: Automated Traversal (Pagination)¶
To scrape more than just the first page, we define how the application finds the "Next" button.
- Find the selector in the shell:
response.css("a.next-page::attr(href)").get() - Add the block to
example.toml:
[site.pagination]
enabled = true
type = "next_button"
selector = "a.next-page::attr(href)"
max_pages = 5 # Safety limit to prevent infinite loops
Summary¶
By creating this TOML file, you've programmed the Onion Peeler engine to handle a new target without modifying the core application.
- Container: Defines the "Rows."
- Selectors: Define the "Columns."
- Processors: Define the "Cleanliness."
- Pagination: Defines the "Depth."
Run your first crawl¶
Also check out:¶
- Config Reference - A breakdown of all available site configuration options.