hybrid_chunker
Docling HybridChunker integration for tokenization-aware chunking.
Uses Docling's official HybridChunker which provides:
- Hierarchical chunking based on document structure
- Tokenization-aware splitting and merging
- Context enrichment with header hierarchy
- Provenance tracking from DoclingDocument
Classes
HybridChunk
A chunk from HybridChunker with metadata.
DoclingHybridChunker
Wrapper around Docling's HybridChunker for tokenization-aware chunking.
This chunker works directly on DoclingDocument objects, providing:
- Accurate tokenization using the embedding model's tokenizer
- Automatic splitting of oversized chunks
- Automatic merging of undersized chunks
- Context enrichment with header hierarchy
- Provenance tracking (page numbers, bounding boxes)
- Furniture filtering (excludes headers, footers, footnotes for cleaner RAG)
Constructor:
def __init__(self, embed_model_id: str | None = None, chunk_size: int | None = None, exclude_furniture: bool = True) -> None
Methods
chunk_document
def chunk_document(self, doc: Any, page_numbers: list[int] | None = None) -> list[HybridChunk]
Chunk a DoclingDocument using HybridChunker.
Only processes body content, excluding furniture (headers, footers, footnotes) for cleaner RAG results. This is a best practice for document chunking.
Args: doc: DoclingDocument instance from DocumentConverter page_numbers: Optional list of page numbers (DEPRECATED - extracted from chunk metadata)
Returns: List of HybridChunk objects with metadata (body content only)
Raises: Exception: If chunking fails