@have/spider: Web Crawling and Content Extraction
Web crawling and content parsing tools for extracting structured data from websites.
Overview
The @have/spider
package provides powerful web scraping capabilities:
- 🕷️ Web Crawling: Intelligent website crawling and navigation
- 📄 Content Extraction: Clean text and structured data extraction
- 🎯 Selector Engine: CSS and XPath selector support
- 🚦 Rate Limiting: Respectful crawling with built-in delays
- 🔄 Retry Logic: Automatic retry with exponential backoff
Quick Start
import { WebScraperTool } from '@have/spider';
const scraper = new WebScraperTool();
// Extract content from a URL
const content = await scraper.extractContent('https://example.com');
console.log(content.title);
console.log(content.text);
// Extract specific elements
const headlines = await scraper.extractElements('https://news.site.com', 'h2.headline');
Content Extraction
// Basic content extraction
const result = await scraper.extractContent(url);
// Returns: { title, text, links, images, metadata }
// Custom extraction with selectors
const customData = await scraper.extract(url, {
title: 'h1',
price: '.price',
description: '.product-description',
images: 'img[src]'
});
// Extract multiple pages
const urls = ['https://site.com/page1', 'https://site.com/page2'];
const results = await scraper.extractMultiple(urls);
Integration with Content Module
import { Content } from '@have/content';
import { WebScraperTool } from '@have/spider';
async function scrapeToContent(url: string): Promise<Content> {
const scraper = new WebScraperTool();
const scraped = await scraper.extractContent(url);
const content = new Content({
title: scraped.title,
body: scraped.text,
url: url,
source: 'web_scraping',
status: 'published'
});
await content.save();
return content;
}
Full documentation coming soon...