Skip to main content

pdf_handler

PDF text extraction handler supporting both native and scanned PDFs.

Handles three scenarios:

  1. Scanned PDFs (image-based) -> OCR extraction
  2. Native PDFs with valid text -> Native extraction (PyPDF)
  3. Native PDFs with garbled text (font encoding issues) -> OCR fallback per-page

Classes

PDFHandler

Handler for extracting text from PDF documents.

Methods

extract_text

def extract_text(self, pdf_bytes: bytes) -> list[dict[str, Any]]

Extract text from PDF (native or scanned).

Handles three scenarios:

  1. Fully scanned PDFs -> OCR for all pages
  2. Native PDFs with good text -> Native extraction
  3. Native PDFs with garbled text -> Per-page OCR fallback

Args: pdf_bytes: PDF file content as bytes

Returns: List of page dictionaries with keys:

  • page_number: int (1-indexed)
  • text: str
  • confidence: float (for OCR, 0-100; for native PDFs, 100.0)
  • extraction_method: str ("native", "ocr", or "ocr_fallback")

get_full_text

def get_full_text(self, pdf_bytes: bytes) -> str

Extract full text from PDF as a single string.

Args: pdf_bytes: PDF file content as bytes

Returns: Full text content with page breaks marked