The simplest OCR options you have here

  1. (Staff:) Using the departmental scanner which outputs PDF to a network share (that you can link from your desktop). The PDF is searchable at least
  2. (Staff & Students:) Using only your desktop, at work or at home:
    1. MS-Office
      1. OneNote 2007/2010: paste image, right-click to access context menu, “extract text”.  Example (you can see it is quick and simple, but not error-free): image
      2. Imaging components :TBA
    2. Google Apps can also OCR the files you upload to Google Docs.
      1. You first need to change the default settings. Choose from hard-drive icon for file uploads, context menu: “Settings” / “Convert text form uploaded PDF and image files”.
      2. You may want to upload an entire folder – then you need to either use Chrome or allow the install of a Java applet.
      3. You may want to use not have to deal with one Googledoc for each image you upload. So bind your scanned pages (unless your OCR software already allows this – I have been restricted to “Windows Scan and Fax”) to multi-page PDFs (imagemagick’s convert command can do it for free). Note that the max upload size in Google Docs is 2mb, which restricted me to about 10 pages per document (strangely, since I had scanned to b lack and white and very small size, but the PDF size grew, likely using a less efficient encoding  – might be able to optimize this).
      4. Google Apps uses the same OCR engine as Google Books. Not much formatting is being retained, in the below examples note the line breaks, but that is fine for me, since I am only after large chunks of text for further processing: image
      5. I have only tested English (largely current affairs) text, but was impressed with the OCR results.
        1. Also, where the OCR went wrong (2-times 4 per page; also some artifacts, my scans were not very clean: Google Apps seems to handle dark spots on the page better then unstraightened lines),
        2. the proofreading suggestions (as usual, right click to access) are very good (better than MS-Word’s when I downloaded the files).image
        3. Sometimes you have to consult the original image which conveniently gets put above the  OCR’ed: text. imageimage
        4. You can download the results as MS-Word files and within MS-Word, remove all the scan images using ^g. image
    1. No comments yet.
    1. No trackbacks yet.

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )

    Connecting to %s

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    %d bloggers like this: