text_chunker
Simple text chunker for text-only extraction mode.
Uses character-based estimation for token counting (approximate 4 chars = 1 token).
Classes
TextChunker
Simple chunker for plain text extraction (text-only mode).
Constructor:
def __init__(self, chunk_size: int = 400, chunk_overlap: int = 50) -> None
Methods
chunk_text
def chunk_text(self, text: str, page_numbers: list[int] | None = None, section_name: str | None = None) -> list[dict[str, Any]]
Chunk text into smaller pieces with overlap.
Args: text: Full text to chunk page_numbers: List of page numbers this text spans (optional) section_name: Name of section (optional)
Returns: List of chunk dictionaries with keys:
- chunk_index: int (0-indexed)
- text: str
- token_count: int (estimated)
- page_numbers: list[int] (optional)
Functions
estimate_tokens
def estimate_tokens(text: str) -> int
Estimate token count from text length.
Uses conservative estimate of 4 characters per token. This is approximate but sufficient for chunking purposes.