@happyvertical/documents

Document processing with hierarchical structure. Currently supports PDF documents with text extraction, automatic document management system detection (WordPress Download Manager, CivicWeb, DocuShare), and file caching. Uses @happyvertical/spider for web page analysis and @happyvertical/pdf for PDF text extraction.

Installation

npm install @happyvertical/documents
# or
pnpm add @happyvertical/documents

Published to GitHub Packages (npm.pkg.github.com). Requires @happyvertical/files, @happyvertical/pdf, @happyvertical/spider, and @happyvertical/utils as workspace dependencies.

Quick Start

import { fetchDocument } from '@happyvertical/documents';

// Process a local PDF
const doc = await fetchDocument('file:///path/to/report.pdf');

for (const part of doc.parts) {
  console.log(part.title);
  console.log(part.content);
}

// Fetch a remote PDF (auto-detected from URL extension)
const remote = await fetchDocument('https://example.com/report.pdf');
console.log(remote.parts[0].content);

Usage

Document Management System Detection

When fetching web URLs, the package uses @happyvertical/spider to detect document management systems and extract direct PDF links:

// WordPress Download Manager URL — spider detects the PDF link automatically
const doc = await fetchDocument(
  'https://example.com/download/meeting-minutes/',
  { scraper: 'basic', spider: 'dom' }
);

Override MIME Type

const doc = await fetchDocument('https://example.com/download?id=123', {
  type: 'application/pdf',
});

Cache Control

const doc = await fetchDocument('https://example.com/report.pdf', {
  cacheDir: './my-cache',
  cache: true,
  cacheExpiry: 600_000, // 10 minutes
});

API Reference

`fetchDocument(url, options?)`

Main factory function. Detects document format, selects the appropriate processor, and returns structured content.

url string — Document URL or file path (file://, http://, https://)
options FetchDocumentOptions — See below
Returns Promise<Document>
Throws if no processor is available for the detected MIME type

`FetchDocumentOptions`

Option	Type	Default	Description
`type`	`string`	auto-detected	Override MIME type detection
`extractImages`	`boolean`	`true`	Extract images from document (stub — currently returns `[]`)
`runOcr`	`boolean`	`true` for PDFs	Run OCR on extracted images (stub)
`cacheDir`	`string`	OS temp dir	Directory for caching downloaded files
`cache`	`boolean`	`true`	Enable/disable spider fetch caching
`cacheExpiry`	`number`	`300000`	Cache expiry in milliseconds
`scraper`	`'basic' \| 'crawlee'`	`'basic'`	Scraper type for content extraction
`spider`	`'simple' \| 'dom' \| 'crawlee'`	`'dom'`	Spider adapter for fetching web pages
`headers`	`Record<string, string>`	—	Custom HTTP headers for spider requests
`timeout`	`number`	`30000`	Request timeout in milliseconds
`maxDuration`	`number`	—	Max scraping time in milliseconds
`maxInteractions`	`number`	—	Max interactions for advanced scrapers

`Document` (class)

Base document handler. Manages downloading, caching, and local file path resolution. Used internally by processors; can also be used directly via Document.create(url, options).

`PDFProcessor`

Implements DocumentProcessor. Extracts text from PDF files, validates PDF headers (detects HTML cache poisoning), and caches processed results.

`getTitleFromUrl(url, defaultTitle?)`

Extracts a human-readable title from a URL by parsing the filename, removing extensions, and decoding URL-encoded characters.

Types

interface Document {
  url: string;
  type: string;
  parts: DocumentPart[];
  metadata?: Record<string, any>;
}

interface DocumentPart {
  id: string;
  title: string;
  content: string;
  type: 'text' | 'html' | 'markdown';
  images?: DocumentImage[];
  metadata?: Record<string, any>;
  parts?: DocumentPart[];
}

interface DocumentImage {
  id: string;
  url: string;
  localPath?: string;
  altText?: string;
  ocrText?: string;
  position?: number;
  metadata?: { width?: number; height?: number; format?: string };
}

interface DocumentProcessor {
  process(url: string, options?: FetchDocumentOptions): Promise<Document>;
  supports(type: string): boolean;
}

License

MIT

Installation​

Quick Start​

Usage​

Document Management System Detection​

Override MIME Type​

Cache Control​

API Reference​

fetchDocument(url, options?)​

FetchDocumentOptions​

Document (class)​

PDFProcessor​

getTitleFromUrl(url, defaultTitle?)​

Types​

License​