ocr
OCR detection and processing for scanned PDFs.
Functions
is_garbled_text
def is_garbled_text(text: str, threshold: float = 0.3) -> bool
Detect if text is garbled due to PDF font encoding issues.
Some PDFs use custom fonts where character codes don't map to Unicode. This results in text like "H<9" instead of "the", "5B8" instead of "and".
Detection heuristics:
- High ratio of non-alphabetic characters in "word" positions
- Presence of known garbled patterns
- Low ratio of common English words
Args: text: Extracted text to analyze threshold: Ratio threshold for garbled detection (default: 0.3)
Returns: True if text appears garbled, False if normal
is_scanned_pdf
def is_scanned_pdf(pdf_bytes: bytes, sample_pages: int = 3) -> bool
Detect if a PDF is scanned (image-based) or native text.
This function samples a few pages and checks if they contain extractable text. If no text is found, it's likely a scanned PDF.
Args: pdf_bytes: PDF file content as bytes sample_pages: Number of pages to sample (default: 3)
Returns: True if PDF appears to be scanned, False if native text
extract_text_with_ocr
def extract_text_with_ocr(pdf_bytes: bytes) -> list[dict[str, Any]]
Extract text from scanned PDF using Tesseract OCR.
Args: pdf_bytes: PDF file content as bytes
Returns: List of page dictionaries with keys: - page_number: int (1-indexed) - text: str - confidence: float (OCR confidence score, 0-100)
extract_text_with_ocr_for_pages
def extract_text_with_ocr_for_pages(pdf_bytes: bytes, page_numbers: list[int]) -> list[dict[str, Any]]
Extract text via OCR for specific pages only.
Used for fallback when native text extraction produces garbled text due to font encoding issues. Only processes specified pages for efficiency.
Uses pdf2image's first_page/last_page to convert page ranges efficiently, avoiding conversion of pages that don't need OCR.
Args: pdf_bytes: PDF file content as bytes page_numbers: List of page numbers (1-indexed) to process with OCR
Returns: List of page dictionaries with keys: - page_number: int (1-indexed) - text: str - confidence: float (OCR confidence score, 0-100)