Blog » Adobe Acrobat and a thousand words

6 May 2005

Adobe Acrobat and a thousand words

Filed under: Digital revolution — paulcook @ 3:53 pm

Sometimes a picture really can mean a thousand words — especially if it’s a PDF file that was scanned from a paper document. Like, say, a paper written before the advent of online journals. The problem with these, pretty though they may be, is that you can’t do textual searches on words in the document, or copy text without rewriting. Right?

Nope. I discovered yesterday that Adobe Acrobat (full version, not the Reader), as available on campus site-license for Caltech students, includes optical character recognition. Just go to the “Document” menu, and there is an item to read the document.

The beauty of the system is that it doesn’t replace the scanned images with badly formatted text, but rather it associates the recognised text with the image. This allows you to search or select using the text selection tool, even though it’s still the scanned image you’re viewing. It’s quite a surreal experience.

I tested it on one paper, and it works really well. Not only is the accuracy very good, it even recognised that the scanned images were rotated 90 degrees, and in two columns. So when I copied the text from a page, the lines flowed correctly through the first and then the second column.

Since Acrobat can create PDF’s from virtually any image format (possibly via printing to the include PDFWriter “virtual printer” that it installs), you can use this for all your character-reading needs — as long as you don’t need detailed formatting to be preserved. Also, though I haven’t tested this, I think resaving the PDF will embed the textual information in the file, and will make it available even to people using the free Acrobat Reader.

« »


  1. I don’t think that’s quite true in the last paragraph. There is a good read on my website about font embedding in adobe PDF files with adobe acrobat. Font Embed in PDF

    Comment by graphic design forums — 16 Sep 2006 @ 10:29 am

  2. I think you’re talking about embedding font metrics — that is, details of how to show a particular font. I’m talking about embedding the actual text, associated with a PDF containing scanned images (ie. no actual fonts).

    Comment by paulcook — 16 Sep 2006 @ 10:49 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Line and paragraph breaks are automatic. You can use the following HTML tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Live Preview