document_classifier

Document type classification with fallback strategy.

Classification strategy (MVP - no LLM fallback):

Filename pattern (cheap, fast)
First page content analysis (medium cost)
Default to generic (no LLM fallback for MVP)

Classes

`DocumentClassifier`

Classifies documents by type using fallback strategy (MVP - no LLM fallback).

Constructor:

def __init__(self)

Methods

`classify`

def classify(self, filename: str, first_page_text: str | None = None, document_id: str | None = None) -> dict

Classify document type using fallback strategy.

Args: filename: Document filename first_page_text: Optional first page text (for content analysis) document_id: Optional document ID (for LLM classification if needed)

Returns: Dictionary with keys:

document_type: str ('official_statement', 'acfr', 'annual_report', 'generic', 'unknown')
classification_confidence: float (0.0-1.0)
classification_method: str ('filename_pattern', 'content_analysis', 'llm_classification')

`classify_by_filename`

def classify_by_filename(self, filename: str) -> dict | None

Classify document by filename pattern.

Args: filename: Document filename

Returns: Classification result dict or None if no match

`classify_by_content`

def classify_by_content(self, first_page_text: str) -> dict | None

Classify document by first page content analysis.

Args: first_page_text: First page text content

Returns: Classification result dict or None if no match

Classes​

DocumentClassifier​

Methods​

classify​

classify_by_filename​

classify_by_content​