The simplest OCR options you have here

Home > audience-is-language-learning-center-staff, audience-is-teachers, Institution-is-University-of-North-Carolina-Charlotte > The simplest OCR options you have here

The simplest OCR options you have here

2014/01/10 plagwitz Leave a comment Go to comments

(Staff:) Using the departmental scanner which outputs PDF to a network share (that you can link from your desktop). The PDF is searchable at least
(Staff & Students:) Using only your desktop, at work or at home:
1. MS-Office
  1. OneNote 2007/2010: paste image, right-click to access context menu, “extract text”. Example (you can see it is quick and simple, but not error-free):
  2. Imaging components :TBA
2. Google Apps can also OCR the files you upload to Google Docs.
  1. You first need to change the default settings. Choose from hard-drive icon for file uploads, context menu: “Settings” / “Convert text form uploaded PDF and image files”.
  2. You may want to upload an entire folder – then you need to either use Chrome or allow the install of a Java applet.
  3. You may want to use not have to deal with one Googledoc for each image you upload. So bind your scanned pages (unless your OCR software already allows this – I have been restricted to “Windows Scan and Fax”) to multi-page PDFs (imagemagick’s convert command can do it for free). Note that the max upload size in Google Docs is 2mb, which restricted me to about 10 pages per document (strangely, since I had scanned to b lack and white and very small size, but the PDF size grew, likely using a less efficient encoding – might be able to optimize this).
  4. Google Apps uses the same OCR engine as Google Books. Not much formatting is being retained, in the below examples note the line breaks, but that is fine for me, since I am only after large chunks of text for further processing:
  5. I have only tested English (largely current affairs) text, but was impressed with the OCR results.
    1. Also, where the OCR went wrong (2-times 4 per page; also some artifacts, my scans were not very clean: Google Apps seems to handle dark spots on the page better then unstraightened lines),
    2. the proofreading suggestions (as usual, right click to access) are very good (better than MS-Word’s when I downloaded the files).
    3. Sometimes you have to consult the original image which conveniently gets put above the OCR’ed: text.
    4. You can download the results as MS-Word files and within MS-Word, remove all the scan images using ^g.