Google Indexing Scanned Documents Via OCR
Earlier, you needed to create a text-based PDF (not image-based), whenever you wanted your PDF files to be indexed by Google, as Googlebot was unable to recognize the content of image-based documents. But, gone are those days!
Google says, the case is no longer the same, in an official announcement.
Here is what Google elaborated:
“This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.
While we’ve indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however — it is a picture of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pagesâ€
“To see our new system at work, click on these search queries. Note the document excerpt in the search results, along with the full text presented after the ‘View as HTML’ link:
[repairing aluminum wiring]
[spin lock performance]
[Mumps and Severe Neutropenia]
[Steady success in a volatile world]â€

November 7th, 2008 at 3:34 am
[...] Google is using Optical Character Recognition (OCR) technology in indexing scanned documents. Earlier, it was needed to create text-based PDF files to be indexed by the Googlebot. More… [...]