Back to Learning Center

How to Extract Text from Scanned PDFs

Scanned PDFs are just images β€” you can't copy text from them. Learn how OCR extracts text from scanned documents quickly and accurately to make them searchable.

PDF Smaller Team
8 min read
ocrscanned-pdfextract-textpdf-ocr

You open a scanned PDF, try to select some text, and... nothing happens. You can't copy it, can't search it, can't do anything with it. The words are right there on the screen, but your computer treats them as a picture of words, not actual text.

This is the #1 frustration with scanned documents. But there's a straightforward fix: OCR.

Why You Can't Copy Text from Scanned PDFs

When you scan a paper document, the scanner takes a photograph of each page. That photograph gets saved inside a PDF file. But it's still just an image β€” a grid of colored pixels.

Your computer doesn't "see" the letters. It sees patterns of light and dark. That's why:

  • Ctrl+F doesn't work: No text to search
  • Copy-paste fails: There's nothing to select
  • Screen readers can't read it: The document is inaccessible
  • File size is huge: Images take far more space than text

What Is OCR?

OCR (Optical Character Recognition) is technology that looks at images of text and converts them into actual, selectable, searchable text.

It works like this:

  1. OCR examines the image pixel by pixel
  2. Identifies shapes that look like letters and numbers
  3. Determines which character each shape represents
  4. Outputs real text you can copy, search, and edit

Modern OCR is remarkably accurate β€” typically 95-99% for clean scans of printed text.

How to Extract Text from a Scanned PDF

Method 1: Use Our OCR Tool (Recommended)

The fastest way to make a scanned PDF searchable:

  1. Go to our PDF OCR tool
  2. Upload your scanned PDF
  3. The tool processes each page with OCR
  4. Download your searchable PDF

What you get: A PDF that looks identical to the original, but with an invisible text layer underneath. You can select text, search with Ctrl+F, and copy-paste.

Time: 5-30 seconds depending on page count. Privacy: Everything runs in your browser. Your document never leaves your device.

Method 2: Convert to Word

If you need to edit the text (not just search/copy):

  1. Use our PDF to Word converter
  2. Upload the scanned PDF
  3. Download the editable Word document
  4. Edit as needed

Best for: When you need to modify the content, not just read it.

Method 3: Google Drive (Free Alternative)

Google Drive has built-in OCR:

  1. Upload your scanned PDF to Google Drive
  2. Right-click β†’ Open with β†’ Google Docs
  3. Google automatically runs OCR
  4. Text appears in a Google Doc

Pros: Free, decent accuracy Cons: Formatting gets destroyed. You get raw text, not a nicely formatted document. Also, your file goes to Google's servers.

Method 4: Adobe Acrobat

If you have Acrobat Pro:

  1. Open the scanned PDF
  2. Tools β†’ Enhance Scans β†’ Recognize Text
  3. Choose language and output settings
  4. Run OCR

Pros: Excellent accuracy, preserves formatting Cons: Requires expensive subscription

What Affects OCR Accuracy?

Not all scans are created equal. Here's what impacts how well OCR works:

Scan Quality

QualityExpected AccuracyNotes
300 DPI, clean98-99%Ideal for OCR
200 DPI, clean95-98%Good enough
150 DPI or less85-95%Accuracy drops
Blurry/skewed70-85%May need manual correction
Poor photocopy60-80%OCR struggles significantly

Rule of thumb: 300 DPI produces the best OCR results. If you haven't scanned yet, use 300 DPI.

Document Characteristics

OCR works best with:

  • Printed text (not handwriting)
  • Standard fonts (Times New Roman, Arial, etc.)
  • Black text on white background
  • Clean, straight pages
  • Common languages (English, Spanish, French, German, etc.)

OCR struggles with:

  • Handwritten text (accuracy drops to 60-80%)
  • Decorative or unusual fonts
  • Colored backgrounds or watermarks
  • Skewed or rotated pages
  • Mixed languages on one page
  • Very small text (under 8pt)

Page Orientation

If pages are skewed (slightly rotated from scanning), OCR accuracy drops. Many OCR tools auto-correct for slight skew, but if your pages are significantly rotated, rotate them first before running OCR.

Common OCR Use Cases

Digitizing Old Documents

Situation: You have boxes of paper documents that need to be searchable.

Approach:

  1. Scan everything at 300 DPI
  2. Run OCR in batches
  3. File the searchable PDFs in your document management system

Result: Decades of paper documents become instantly searchable.

Making Legal Documents Searchable

Situation: Court filings, contracts, or case files scanned as images.

Approach:

  1. OCR the documents
  2. Use Ctrl+F to find specific clauses, dates, or names
  3. Copy relevant text for briefs or summaries

Time saved: Hours of manual reading replaced by seconds of searching.

Processing Receipts and Invoices

Situation: Scanned receipts for expense reports or tax filings.

Approach:

  1. Scan or photograph receipts
  2. Convert to PDF if needed
  3. Run OCR to extract amounts, dates, and vendor names

Result: Searchable financial records.

Academic Research

Situation: Older journal articles or books only available as scanned PDFs.

Approach:

  1. Download the scanned PDF
  2. Run OCR
  3. Search for keywords, copy quotes with proper citations

Time saved: Instead of reading 50 pages to find one quote, search in 2 seconds.

Accessibility Compliance

Situation: Your organization needs documents to be accessible to screen readers.

Approach:

  1. OCR all scanned documents
  2. The text layer makes documents screen-reader compatible
  3. Meet accessibility requirements (ADA, WCAG, Section 508)

Result: Inclusive documents that everyone can access.

OCR Tips for Better Results

Before Scanning

  • Use 300 DPI: The sweet spot for file size vs. OCR accuracy
  • Use a flatbed scanner: Better quality than phone cameras for multi-page documents
  • Ensure clean glass: Dust and smudges cause OCR errors
  • Align pages straight: Skew reduces accuracy

Before Running OCR

  • Check page orientation: Rotate any sideways or upside-down pages first
  • Remove blank pages: They waste processing time
  • Crop margins: Large dark borders can confuse OCR

After OCR

  • Spot-check accuracy: Read a few paragraphs and compare to the original
  • Check numbers carefully: OCR sometimes confuses 0/O, 1/l, 5/S
  • Verify special characters: Symbols like @, #, & can be misread
  • Compress the result: OCR adds a text layer, which slightly increases file size β€” compression offsets this

OCR Accuracy by Content Type

ContentTypical AccuracyNotes
Typed business letters99%+Best-case scenario
Book pages97-99%Very reliable
Magazine/newspaper95-98%Column layouts can cause issues
Tables and spreadsheets90-95%Structure may need manual fixing
Forms with checkboxes85-95%Checkmarks sometimes misread
Handwritten notes60-80%Highly variable
Faded or aged documents70-90%Depends on contrast
Receipts (thermal paper)80-90%Fading is the main problem

Troubleshooting Common OCR Problems

Problem: OCR Returns Gibberish

Cause: Image is too low quality, heavily compressed, or in a script the OCR engine doesn't support.

Fix:

  • Re-scan at higher DPI if possible
  • Increase image contrast before OCR
  • Make sure you've selected the correct language

Problem: Text Is Extracted but Formatting Is Wrong

Cause: OCR reads text in the wrong order (e.g., reading across columns instead of down).

Fix:

  • Use OCR tools that understand document layout
  • For complex layouts, try converting to Word first, then fix formatting

Problem: Numbers Are Wrong

Cause: OCR commonly confuses similar characters (0/O, 1/l/I, 8/B).

Fix:

  • Always proofread numbers manually
  • For financial documents, double-check every figure

Problem: OCR Is Very Slow

Cause: Large files with many pages, or low-powered device.

Fix:

  • Process in smaller batches (split, OCR, then merge)
  • Close other browser tabs to free up memory
  • Use a desktop/laptop instead of a phone

OCR vs. Manual Retyping

When does OCR beat manual data entry?

FactorOCRManual Retyping
Speed1-30 seconds per page5-15 minutes per page
Accuracy95-99% (clean scans)95-99% (human error exists too)
CostFree with our toolYour time, or hiring a typist
FormattingMostly preservedRequires recreation
Best forAny volumeVery short documents (<1 page)

Bottom line: OCR wins for anything longer than a paragraph.

Ready to Extract Text?

Stop squinting at scanned PDFs and manually retyping content. OCR handles it in seconds.

Extract Text with OCR β€” upload your scanned PDF, get searchable text back. Free, private, no account needed.

Need to edit the extracted text? Convert your scanned PDF directly to Word for full editing capabilities.

Ready to try it yourself?

Put what you learned into practice with our free tools.

Related Articles