Skip to main content

hybrid_chunker

Docling HybridChunker integration for tokenization-aware chunking.

Uses Docling's official HybridChunker which provides:

  • Hierarchical chunking based on document structure
  • Tokenization-aware splitting and merging
  • Context enrichment with header hierarchy
  • Provenance tracking from DoclingDocument

Classes

HybridChunk

A chunk from HybridChunker with metadata.

DoclingHybridChunker

Wrapper around Docling's HybridChunker for tokenization-aware chunking.

This chunker works directly on DoclingDocument objects, providing:

  • Accurate tokenization using the embedding model's tokenizer
  • Automatic splitting of oversized chunks
  • Automatic merging of undersized chunks
  • Context enrichment with header hierarchy
  • Provenance tracking (page numbers, bounding boxes)
  • Furniture filtering (excludes headers, footers, footnotes for cleaner RAG)

Constructor:

def __init__(self, embed_model_id: str | None = None, chunk_size: int | None = None, exclude_furniture: bool = True) -> None

Methods

chunk_document

def chunk_document(self, doc: Any, page_numbers: list[int] | None = None) -> list[HybridChunk]

Chunk a DoclingDocument using HybridChunker.

Only processes body content, excluding furniture (headers, footers, footnotes) for cleaner RAG results. This is a best practice for document chunking.

Args: doc: DoclingDocument instance from DocumentConverter page_numbers: Optional list of page numbers (DEPRECATED - extracted from chunk metadata)

Returns: List of HybridChunk objects with metadata (body content only)

Raises: Exception: If chunking fails