Skip to main content

@happyvertical/documents

License: MIT

Document processing with hierarchical structure. Currently supports PDF documents with text extraction, automatic document management system detection (WordPress Download Manager, CivicWeb, DocuShare), and file caching. Uses @happyvertical/spider for web page analysis and @happyvertical/pdf for PDF text extraction.

Installation​

npm install @happyvertical/documents
# or
pnpm add @happyvertical/documents

Published to GitHub Packages (npm.pkg.github.com). Requires @happyvertical/files, @happyvertical/pdf, @happyvertical/spider, and @happyvertical/utils as workspace dependencies.

Quick Start​

import { fetchDocument } from '@happyvertical/documents';

// Process a local PDF
const doc = await fetchDocument('file:///path/to/report.pdf');

for (const part of doc.parts) {
console.log(part.title);
console.log(part.content);
}

// Fetch a remote PDF (auto-detected from URL extension)
const remote = await fetchDocument('https://example.com/report.pdf');
console.log(remote.parts[0].content);

Usage​

Document Management System Detection​

When fetching web URLs, the package uses @happyvertical/spider to detect document management systems and extract direct PDF links:

// WordPress Download Manager URL — spider detects the PDF link automatically
const doc = await fetchDocument(
'https://example.com/download/meeting-minutes/',
{ scraper: 'basic', spider: 'dom' }
);

Override MIME Type​

const doc = await fetchDocument('https://example.com/download?id=123', {
type: 'application/pdf',
});

Cache Control​

const doc = await fetchDocument('https://example.com/report.pdf', {
cacheDir: './my-cache',
cache: true,
cacheExpiry: 600_000, // 10 minutes
});

API Reference​

fetchDocument(url, options?)​

Main factory function. Detects document format, selects the appropriate processor, and returns structured content.

  • url string — Document URL or file path (file://, http://, https://)
  • options FetchDocumentOptions — See below
  • Returns Promise<Document>
  • Throws if no processor is available for the detected MIME type

FetchDocumentOptions​

OptionTypeDefaultDescription
typestringauto-detectedOverride MIME type detection
extractImagesbooleantrueExtract images from document (stub — currently returns [])
runOcrbooleantrue for PDFsRun OCR on extracted images (stub)
cacheDirstringOS temp dirDirectory for caching downloaded files
cachebooleantrueEnable/disable spider fetch caching
cacheExpirynumber300000Cache expiry in milliseconds
scraper'basic' | 'crawlee''basic'Scraper type for content extraction
spider'simple' | 'dom' | 'crawlee''dom'Spider adapter for fetching web pages
headersRecord<string, string>—Custom HTTP headers for spider requests
timeoutnumber30000Request timeout in milliseconds
maxDurationnumber—Max scraping time in milliseconds
maxInteractionsnumber—Max interactions for advanced scrapers

Document (class)​

Base document handler. Manages downloading, caching, and local file path resolution. Used internally by processors; can also be used directly via Document.create(url, options).

PDFProcessor​

Implements DocumentProcessor. Extracts text from PDF files, validates PDF headers (detects HTML cache poisoning), and caches processed results.

getTitleFromUrl(url, defaultTitle?)​

Extracts a human-readable title from a URL by parsing the filename, removing extensions, and decoding URL-encoded characters.

Types​

interface Document {
url: string;
type: string;
parts: DocumentPart[];
metadata?: Record<string, any>;
}

interface DocumentPart {
id: string;
title: string;
content: string;
type: 'text' | 'html' | 'markdown';
images?: DocumentImage[];
metadata?: Record<string, any>;
parts?: DocumentPart[];
}

interface DocumentImage {
id: string;
url: string;
localPath?: string;
altText?: string;
ocrText?: string;
position?: number;
metadata?: { width?: number; height?: number; format?: string };
}

interface DocumentProcessor {
process(url: string, options?: FetchDocumentOptions): Promise<Document>;
supports(type: string): boolean;
}

License​

MIT