Skip to content

Site Configuration Reference (TOML)

This document defines the structure and available fields for site configuration files in config/sites/*.toml.

Site Metadata ([site])

Field Type Description
name string The unique name of the spider (used for scrapy crawl).
base_url string The starting URL for the spider.
allowed_domains list Domains the spider is allowed to crawl.

Item Extraction ([site.items.<type>])

Enable or disable specific item types (e.g., link, thread, post).

Field Type Description
enabled boolean Whether this item type should be extracted.

Selectors ([site.selectors.<type>])

Map HTML elements to data fields using CSS or XPath.

Field Type Description
container string The selector that encompasses each extracted item.
field_name string The selector for a specific field (e.g., title, url), relative to the container.

Selector Engine Behavior

  • CSS (Default): The engine assumes CSS by default.
  • XPath: Automatically switches to XPath if the string starts with xpath:, /, .//, or (.

Processors ([site.selectors.<type>.processors])

Apply post-processing functions to extracted field data.

Field Type Description
field_name list A list of processor names (e.g., ["strip"]).

Pagination ([site.pagination])

Configure automatic traversal of multiple pages.

Field Type Description
enabled boolean Whether pagination is active.
type string Currently supports next_button.
selector string The selector for the "Next" page link.
max_pages int The maximum number of pages to crawl.

Example Configuration

[site]
name = "daunt"
base_url = "https://daunt.link"
allowed_domains = ["daunt.link"]

[site.items.link]
enabled = true

[site.selectors.link]
container = "div.mirror-list .mirror" 
title = "xpath:ancestor::div[contains(@class,'service-item')]//a[@class='service-name']//text()"
url = "a.link::attr(href)"

[site.selectors.link.processors]
title = ["strip"]

[site.pagination]
enabled = true
type = "next_button"
selector = "a.next::attr(href)"
max_pages = 50