Show HN: wxpath – Declarative web crawling in XPath
(github.com)57 points by rodricios 6 days ago
wxpath is a declarative web crawler where web crawling and scraping are expressed directly in XPath.
Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression:
import wxpath
# Crawl, extract fields, build a Wikipedia knowledge graph
path_expr = """
url('https://en.wikipedia.org/wiki/Expression_language')
///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
/map{
'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
'url': string(base-uri(.)),
'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
}
"""
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
print(item)
The key addition is a `url(...)` operator that fetches and returns HTML for further XPath processing, and `///url(...)` for deep (or paginated) traversal. Everything else is standard XPath 3.1 (maps/arrays/functions).Features:
- Async/concurrent crawling with streaming results
- Scrapy-inspired auto-throttle and polite crawling
- Hook system for custom processing
- CLI for quick experiments
Another example, paginating through HN comments (via "follow=" argument) pages and extracting data:
url('https://news.ycombinator.com',
follow=//a[text()='comments']/@href | //a[@class='morelink']/@href)
//tr[@class='athing']
/map {
'text': .//div[@class='comment']//text(),
'user': .//a[@class='hnuser']/@href,
'parent_post': .//span[@class='onstory']/a/@href
}
Limitations: HTTP-only (no JS rendering yet), no crawl persistence.
Both are on the roadmap if there's interest.GitHub: https://github.com/rodricios/wxpath
PyPI: pip install wxpath
I'd love feedback on the expression syntax and any use cases this might unlock.
Thanks!
It's impressive that wxpath does the DSL as an extension of XPath syntax. I hadn't quite thought of it that way.
I routinely used a mix of XPath and arbitrary code heavily for Web scraping (as implied in the intro for "https://docs.racket-lang.org/html-parsing/").
Then I made some DSLs for doing some of the common scraping coding patterns more concisely and declaratively, but the DSLs ended up in a Lisp-y syntax, not looking like XPath.