Site Configuration Reference (TOML)¶
This document defines the structure and available fields for site configuration files in config/sites/*.toml.
Site Metadata ([site])¶
| Field | Type | Description |
|---|---|---|
name |
string |
The unique name of the spider (used for scrapy crawl). |
base_url |
string |
The starting URL for the spider. |
allowed_domains |
list |
Domains the spider is allowed to crawl. |
Item Extraction ([site.items.<type>])¶
Enable or disable specific item types (e.g., link, thread, post).
| Field | Type | Description |
|---|---|---|
enabled |
boolean |
Whether this item type should be extracted. |
Selectors ([site.selectors.<type>])¶
Map HTML elements to data fields using CSS or XPath.
| Field | Type | Description |
|---|---|---|
container |
string |
The selector that encompasses each extracted item. |
field_name |
string |
The selector for a specific field (e.g., title, url), relative to the container. |
Selector Engine Behavior¶
- CSS (Default): The engine assumes CSS by default.
- XPath: Automatically switches to XPath if the string starts with
xpath:,/,.//, or(.
Processors ([site.selectors.<type>.processors])¶
Apply post-processing functions to extracted field data.
| Field | Type | Description |
|---|---|---|
field_name |
list |
A list of processor names (e.g., ["strip"]). |
Pagination ([site.pagination])¶
Configure automatic traversal of multiple pages.
| Field | Type | Description |
|---|---|---|
enabled |
boolean |
Whether pagination is active. |
type |
string |
Currently supports next_button. |
selector |
string |
The selector for the "Next" page link. |
max_pages |
int |
The maximum number of pages to crawl. |
Example Configuration¶
[site]
name = "daunt"
base_url = "https://daunt.link"
allowed_domains = ["daunt.link"]
[site.items.link]
enabled = true
[site.selectors.link]
container = "div.mirror-list .mirror"
title = "xpath:ancestor::div[contains(@class,'service-item')]//a[@class='service-name']//text()"
url = "a.link::attr(href)"
[site.selectors.link.processors]
title = ["strip"]
[site.pagination]
enabled = true
type = "next_button"
selector = "a.next::attr(href)"
max_pages = 50