docling_handler
Document extraction handler using IBM Granite Docling.
Orchestrates the full document processing pipeline:
- Download PDF from object storage
- Extract content via Docling
- Chunk with structure awareness
- Generate embeddings
- Store in PostgreSQL
- Update PostgreSQL JSONB status
Classes
HardLimitExceededError
Raised when document exceeds hard processing limits.
ConversionResult
Result from DocTags to format conversion.
DoclingHandler
Handles document extraction via IBM Granite Docling.
Constructor:
def __init__(self, postgres: PostgresClient, storage: StorageClient, valkey: ValkeyClient, embedding: GraniteEmbeddingClient, official_docling: OfficialDoclingClient | None = None, hybrid_chunker: DoclingHybridChunker | None = None) -> None
Methods
process_document
def process_document(self, message_data: dict[str, Any]) -> None
Process a document extraction request.
Args: message_data: Kafka message data with:
- document_id: str (required)
- organization_id: str (required)
- storage_path: str (required) - relative path in object storage bucket
- user_id: str (optional)
- metadata: dict (optional)
- content_type: str (optional)
- file_size_bytes: int (optional)
Functions
convert_doctags_to_formats
def convert_doctags_to_formats(doctags_list: list[str], image_bytes_list: list[bytes], document_name: str = 'Document') -> ConversionResult
Convert DocTags to Markdown and HTML formats.
TODO: Implement proper conversion using DoclingDocument.load_from_doctags() For now, this is a simple fallback that joins doctags as text.
Args: doctags_list: List of DocTags strings (one per page) image_bytes_list: List of image bytes (one per page) document_name: Name of the document
Returns: ConversionResult with markdown, html, success, and error fields