Skip to main content

docling_handler

Document extraction handler using IBM Granite Docling.

Orchestrates the full document processing pipeline:

  1. Download PDF from object storage
  2. Extract content via Docling
  3. Chunk with structure awareness
  4. Generate embeddings
  5. Store in PostgreSQL
  6. Update PostgreSQL JSONB status

Classes

HardLimitExceededError

Raised when document exceeds hard processing limits.

ConversionResult

Result from DocTags to format conversion.

DoclingHandler

Handles document extraction via IBM Granite Docling.

Constructor:

def __init__(self, postgres: PostgresClient, storage: StorageClient, valkey: ValkeyClient, embedding: GraniteEmbeddingClient, official_docling: OfficialDoclingClient | None = None, hybrid_chunker: DoclingHybridChunker | None = None) -> None

Methods

process_document

def process_document(self, message_data: dict[str, Any]) -> None

Process a document extraction request.

Args: message_data: Kafka message data with:

  • document_id: str (required)
  • organization_id: str (required)
  • storage_path: str (required) - relative path in object storage bucket
  • user_id: str (optional)
  • metadata: dict (optional)
  • content_type: str (optional)
  • file_size_bytes: int (optional)

Functions

convert_doctags_to_formats

def convert_doctags_to_formats(doctags_list: list[str], image_bytes_list: list[bytes], document_name: str = 'Document') -> ConversionResult

Convert DocTags to Markdown and HTML formats.

TODO: Implement proper conversion using DoclingDocument.load_from_doctags() For now, this is a simple fallback that joins doctags as text.

Args: doctags_list: List of DocTags strings (one per page) image_bytes_list: List of image bytes (one per page) document_name: Name of the document

Returns: ConversionResult with markdown, html, success, and error fields