ocr

OCR detection and processing for scanned PDFs.

Functions

def is_garbled_text(text: str, threshold: float = 0.3) -> bool

Detect if text is garbled due to PDF font encoding issues.

Some PDFs use custom fonts where character codes don't map to Unicode. This results in text like "H<9" instead of "the", "5B8" instead of "and".

Detection heuristics:

Args: text: Extracted text to analyze threshold: Ratio threshold for garbled detection (default: 0.3)

Returns: True if text appears garbled, False if normal

def is_scanned_pdf(pdf_bytes: bytes, sample_pages: int = 3) -> bool

Detect if a PDF is scanned (image-based) or native text.

This function samples a few pages and checks if they contain extractable text. If no text is found, it's likely a scanned PDF.

Args: pdf_bytes: PDF file content as bytes sample_pages: Number of pages to sample (default: 3)

Returns: True if PDF appears to be scanned, False if native text

def extract_text_with_ocr(pdf_bytes: bytes) -> list[dict[str, Any]]

Extract text from scanned PDF using Tesseract OCR.

Args: pdf_bytes: PDF file content as bytes

Returns: List of page dictionaries with keys: - page_number: int (1-indexed) - text: str - confidence: float (OCR confidence score, 0-100)

def extract_text_with_ocr_for_pages(pdf_bytes: bytes, page_numbers: list[int]) -> list[dict[str, Any]]

Extract text via OCR for specific pages only.

Used for fallback when native text extraction produces garbled text due to font encoding issues. Only processes specified pages for efficiency.

Uses pdf2image's first_page/last_page to convert page ranges efficiently, avoiding conversion of pages that don't need OCR.

Args: pdf_bytes: PDF file content as bytes page_numbers: List of page numbers (1-indexed) to process with OCR

Returns: List of page dictionaries with keys: - page_number: int (1-indexed) - text: str - confidence: float (OCR confidence score, 0-100)