Skip to content

Creating a Custom Site Configuration

This tutorial walks you through the process of adding a new website/forum to the scraper using a configuration file. We will move from inspecting raw HTML to building a functional, paginated scraper without writing a single line of Python.

The "No-Code" Philosophy

Onion Peeler uses a Data-Driven Architecture. Instead of writing code each website, a "Map" (the TOML file) is defined which tells the scrapy engine: 1. Where to go (Base URL). 2. What to look for (Selectors). 3. How to clean the data (Processors). 4. Where to go next (Pagination).


Step 1: Inspect the Target with Scrapy Shell

Before writing a config, we can understand the site's html structure using the scrapy shell command. This allows us to test our assumptions in real-time.

Assume we are targeting a directory at http://example.onion.

uv run scrapy shell "http://example.onion"

Identifying the "Container"

Most sites list items (threads, links, mirrors) in a repeating pattern. You must find the Container—the smallest HTML block that holds exactly one item.

# If every mirror is in a <div class="mirror-entry">
containers = response.css("div.mirror-entry")
print(f"Found {len(containers)} items")

Testing Relative Selectors

Once you have the container, all field selectors (title, url, date) must be relative to it.

item = containers[0]
# Use relative CSS
item.css("h3.title::text").get()
# Or relative XPath (starts with ./)
item.xpath("./span[@class='url']/text()").get()

Step 2: Anatomy of a Site Config

Create a new file: config/sites/example.toml. Below is the breakdown of a standard configuration.

1. The [site] Block (Metadata)

This identifies the spider within the application.

[site]
name = "example"                  # The ID used in 'scrapy crawl example'
base_url = "http://example.onion" # The entry point
allowed_domains = ["example.onion"] # Security boundary

2. The [site.items] Block (Schema)

This enables specific data models. Onion Peeler supports types like link, post, and thread.

[site.items.link]
enabled = true  # Tells the engine we are looking for LinkItem objects

3. The [site.selectors] Block (The Map)

This is the heart of the configuration.

[site.selectors.link]
# The 'Row' or 'Block' identified in Step 1
container = "div.mirror-entry" 

# 'Cells' or 'Fields' relative to the container
title = "h3.title::text"
url = "a.link-out::attr(href)"

Selector Engine Auto-Detection

Onion Peeler assumes CSS by default. However, if your string starts with xpath:, /, .//, or (, it automatically switches to the XPath engine.


Step 3: Data Cleansing (Processors)

Dark web HTML is often messy, full of extra tabs, newlines, or "noise." Processors are transformation functions applied to the extracted text.

[site.selectors.link.processors]
title = ["strip"] # Removes leading/trailing whitespace

Common processors include strip, lower, and upper.


Step 4: Verification with scrapy parse

Before committing to a long-running crawl, use the parse command to verify your "Map."

Proxy Requirement

Ensure your Docker stack (VPN + Tor) is running, as the application routes all traffic through the proxy chain to protect your identity.

uv run scrapy parse "http://example.onion"

What to look for: - Success: A list of items with the correct titles and URLs appears. - Empty List: Your container selector is likely wrong. - Partial Data: Your field selectors (e.g., title) aren't matching anything within the container.


Step 5: Automated Traversal (Pagination)

To scrape more than just the first page, we define how the application finds the "Next" button.

  1. Find the selector in the shell: response.css("a.next-page::attr(href)").get()
  2. Add the block to example.toml:
[site.pagination]
enabled = true
type = "next_button"
selector = "a.next-page::attr(href)"
max_pages = 5 # Safety limit to prevent infinite loops

Summary

By creating this TOML file, you've programmed the Onion Peeler engine to handle a new target without modifying the core application.

  • Container: Defines the "Rows."
  • Selectors: Define the "Columns."
  • Processors: Define the "Cleanliness."
  • Pagination: Defines the "Depth."

Run your first crawl

uv run scrapy crawl example -o data/example_output.json

Also check out: