Skip to main content

document_classifier

Document type classification with fallback strategy.

Classification strategy (MVP - no LLM fallback):

  1. Filename pattern (cheap, fast)
  2. First page content analysis (medium cost)
  3. Default to generic (no LLM fallback for MVP)

Classes

DocumentClassifier

Classifies documents by type using fallback strategy (MVP - no LLM fallback).

Constructor:

def __init__(self)

Methods

classify

def classify(self, filename: str, first_page_text: str | None = None, document_id: str | None = None) -> dict

Classify document type using fallback strategy.

Args: filename: Document filename first_page_text: Optional first page text (for content analysis) document_id: Optional document ID (for LLM classification if needed)

Returns: Dictionary with keys:

  • document_type: str ('official_statement', 'acfr', 'annual_report', 'generic', 'unknown')
  • classification_confidence: float (0.0-1.0)
  • classification_method: str ('filename_pattern', 'content_analysis', 'llm_classification')

classify_by_filename

def classify_by_filename(self, filename: str) -> dict | None

Classify document by filename pattern.

Args: filename: Document filename

Returns: Classification result dict or None if no match

classify_by_content

def classify_by_content(self, first_page_text: str) -> dict | None

Classify document by first page content analysis.

Args: first_page_text: First page text content

Returns: Classification result dict or None if no match