Skip to main content

ocr

OCR detection and processing for scanned PDFs.

Functions

is_garbled_text

def is_garbled_text(text: str, threshold: float = 0.3) -> bool

Detect if text is garbled due to PDF font encoding issues.

Some PDFs use custom fonts where character codes don't map to Unicode. This results in text like "H<9" instead of "the", "5B8" instead of "and".

Detection heuristics:

  1. High ratio of non-alphabetic characters in "word" positions
  2. Presence of known garbled patterns
  3. Low ratio of common English words

Args: text: Extracted text to analyze threshold: Ratio threshold for garbled detection (default: 0.3)

Returns: True if text appears garbled, False if normal

is_scanned_pdf

def is_scanned_pdf(pdf_bytes: bytes, sample_pages: int = 3) -> bool

Detect if a PDF is scanned (image-based) or native text.

This function samples a few pages and checks if they contain extractable text. If no text is found, it's likely a scanned PDF.

Args: pdf_bytes: PDF file content as bytes sample_pages: Number of pages to sample (default: 3)

Returns: True if PDF appears to be scanned, False if native text

extract_text_with_ocr

def extract_text_with_ocr(pdf_bytes: bytes) -> list[dict[str, Any]]

Extract text from scanned PDF using Tesseract OCR.

Args: pdf_bytes: PDF file content as bytes

Returns: List of page dictionaries with keys: - page_number: int (1-indexed) - text: str - confidence: float (OCR confidence score, 0-100)

extract_text_with_ocr_for_pages

def extract_text_with_ocr_for_pages(pdf_bytes: bytes, page_numbers: list[int]) -> list[dict[str, Any]]

Extract text via OCR for specific pages only.

Used for fallback when native text extraction produces garbled text due to font encoding issues. Only processes specified pages for efficiency.

Uses pdf2image's first_page/last_page to convert page ranges efficiently, avoiding conversion of pages that don't need OCR.

Args: pdf_bytes: PDF file content as bytes page_numbers: List of page numbers (1-indexed) to process with OCR

Returns: List of page dictionaries with keys: - page_number: int (1-indexed) - text: str - confidence: float (OCR confidence score, 0-100)