Skip to content

Onion Peeler Documentation

Site Configuration Reference (TOML)

Dark Web Scraping

Site Configuration Reference (TOML)¶

This document defines the structure and available fields for site configuration files in config/sites/*.toml.

Site Metadata (`[site]`)¶

Field	Type	Description
`name`	`string`	The unique name of the spider (used for `scrapy crawl`).
`base_url`	`string`	The starting URL for the spider.
`allowed_domains`	`list`	Domains the spider is allowed to crawl.

Item Extraction (`[site.items.<type>]`)¶

Enable or disable specific item types (e.g., link, thread, post).

Field	Type	Description
`enabled`	`boolean`	Whether this item type should be extracted.

Selectors (`[site.selectors.<type>]`)¶

Map HTML elements to data fields using CSS or XPath.

Field	Type	Description
`container`	`string`	The selector that encompasses each extracted item.
`field_name`	`string`	The selector for a specific field (e.g., `title`, `url`), relative to the container.

Selector Engine Behavior¶

CSS (Default): The engine assumes CSS by default.
XPath: Automatically switches to XPath if the string starts with xpath:, /, .//, or (.

Processors (`[site.selectors.<type>.processors]`)¶

Apply post-processing functions to extracted field data.

Field	Type	Description
`field_name`	`list`	A list of processor names (e.g., `["strip"]`).

Pagination (`[site.pagination]`)¶

Configure automatic traversal of multiple pages.

Field	Type	Description
`enabled`	`boolean`	Whether pagination is active.
`type`	`string`	Currently supports `next_button`.
`selector`	`string`	The selector for the "Next" page link.
`max_pages`	`int`	The maximum number of pages to crawl.

Example Configuration¶

[site]
name = "daunt"
base_url = "https://daunt.link"
allowed_domains = ["daunt.link"]

[site.items.link]
enabled = true

[site.selectors.link]
container = "div.mirror-list .mirror" 
title = "xpath:ancestor::div[contains(@class,'service-item')]//a[@class='service-name']//text()"
url = "a.link::attr(href)"

[site.selectors.link.processors]
title = ["strip"]

[site.pagination]
enabled = true
type = "next_button"
selector = "a.next::attr(href)"
max_pages = 50