pdf_handler

PDF text extraction handler supporting both native and scanned PDFs.

Handles three scenarios:

Classes

Handler for extracting text from PDF documents.

def extract_text(self, pdf_bytes: bytes) -> list[dict[str, Any]]

Extract text from PDF (native or scanned).

Handles three scenarios:

Args: pdf_bytes: PDF file content as bytes

Returns: List of page dictionaries with keys:

def get_full_text(self, pdf_bytes: bytes) -> str

Extract full text from PDF as a single string.

Args: pdf_bytes: PDF file content as bytes

Returns: Full text content with page breaks marked