Skip to main content

text_chunker

Simple text chunker for text-only extraction mode.

Uses character-based estimation for token counting (approximate 4 chars = 1 token).

Classes

TextChunker

Simple chunker for plain text extraction (text-only mode).

Constructor:

def __init__(self, chunk_size: int = 400, chunk_overlap: int = 50) -> None

Methods

chunk_text

def chunk_text(self, text: str, page_numbers: list[int] | None = None, section_name: str | None = None) -> list[dict[str, Any]]

Chunk text into smaller pieces with overlap.

Args: text: Full text to chunk page_numbers: List of page numbers this text spans (optional) section_name: Name of section (optional)

Returns: List of chunk dictionaries with keys:

  • chunk_index: int (0-indexed)
  • text: str
  • token_count: int (estimated)
  • page_numbers: list[int] (optional)

Functions

estimate_tokens

def estimate_tokens(text: str) -> int

Estimate token count from text length.

Uses conservative estimate of 4 characters per token. This is approximate but sufficient for chunking purposes.