RECMGMT-L Archives

Records Management

RECMGMT-L@LISTSERV.IGGURU.US

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Josef Elliott <[log in to unmask]>
Reply To:
Records Management Program <[log in to unmask]>
Date:
Wed, 9 Apr 2008 11:34:51 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (20 lines)
Melissa wrote:

I was asked a trick question today:  How  accurate is optical character recognition?  I wondering how accurate is OCR software/scanning that you are running in your records shops? What accuracy percentage are you committing to/ promising users? 



Before I comment on OCR text accuracy, let me introduce myself.  My name is Josef Elliott and I am with a specialist RIM consultancy in the UK called Oyster IMS, having previously owned a litigation support bureau.  I have been lurking for some time and must say that the knowledge and expertise on show in this Listserv is very impressive.  Thank you all.  I have not, until now, felt that I could add anything meaningful but I do have a practitioner's view on the OCR text accuracy question.  So here goes.

Yes, OCR text accuracy varies from 0 - 100% but the real question is "What is the text going to be used for?", with the options being some form of auto-indexing (typically forms recognition) or search.  Putting auto-indexing to one side (as this is usually a validated and corrected process), I'd like to talk about search.   In my view, no one should ever rely on OCR as the only tool to be used for mission-critical (or even mildly important) data retrieval as, even putting aside the accuracy issue, the lack of context will scupper anything other than a simple word search.  As we all know, the only way to ensure completeness in search is to use well-structured and consistently applied metadata.  OCR text can, however, be a fabulous search asset when used as an additional, almost "bonus" search field alongside such metadata.  For example, users can do full-text searches to find examples of document types and dates mentioning certain words or phrases (especially using tools like proximity searches or other Boolean operators) and then follow this up with a more precise search using the Document Type and Data metadata fields to find all the documents needed.  And, given that OCR text is practically free from both a creation and maintenance point of view, it really is silly NOT to include it as a search field, whatever the accuracy. 

Notwithstanding the above, I have always found that the hardest part of providing full-text has been managing users' expectations, so it is important to ensure that there is a full understanding of just how good and how bad OCR text can be as a search aid.

Finally beware of what it says on the tin:  in the past vendors have claimed that failure to recognise a character gives a "true" (I correctly identified that I couldn't read that!) and that only getting characters wrong gives a "false".  And don't forget that 90% accuracy means that 1 in 10 characters is likely to be wrong leading to one in every two or three words being wrong.
 

List archives at http://lists.ufl.edu/archives/recmgmt-l.html
Contact [log in to unmask] for assistance
To unsubscribe from this list, click the below link. If not already present, place UNSUBSCRIBE RECMGMT-L or UNSUB RECMGMT-L in the body of the message.
mailto:[log in to unmask]

ATOM RSS1 RSS2