document_classifier
Document type classification with fallback strategy.
Classification strategy (MVP - no LLM fallback):
- Filename pattern (cheap, fast)
- First page content analysis (medium cost)
- Default to generic (no LLM fallback for MVP)
Classes
DocumentClassifier
Classifies documents by type using fallback strategy (MVP - no LLM fallback).
Constructor:
def __init__(self)
Methods
classify
def classify(self, filename: str, first_page_text: str | None = None, document_id: str | None = None) -> dict
Classify document type using fallback strategy.
Args: filename: Document filename first_page_text: Optional first page text (for content analysis) document_id: Optional document ID (for LLM classification if needed)
Returns: Dictionary with keys:
- document_type: str ('official_statement', 'acfr', 'annual_report', 'generic', 'unknown')
- classification_confidence: float (0.0-1.0)
- classification_method: str ('filename_pattern', 'content_analysis', 'llm_classification')
classify_by_filename
def classify_by_filename(self, filename: str) -> dict | None
Classify document by filename pattern.
Args: filename: Document filename
Returns: Classification result dict or None if no match
classify_by_content
def classify_by_content(self, first_page_text: str) -> dict | None
Classify document by first page content analysis.
Args: first_page_text: First page text content
Returns: Classification result dict or None if no match