RECMGMT-L Archives

Records Management

RECMGMT-L@LISTSERV.IGGURU.US

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Sender:
Records Management Program <[log in to unmask]>
Date:
Fri, 9 Feb 2007 07:07:27 -0800
Reply-To:
Records Management Program <[log in to unmask]>
Subject:
MIME-Version:
1.0
Content-Transfer-Encoding:
7bit
In-Reply-To:
Content-Type:
text/plain; charset="us-ascii"
From:
Linda Buss <[log in to unmask]>
Parts/Attachments:
text/plain (129 lines)
Patrick,

You have certainly answered my question.  I have implemented EDMS systems
(FileNet and OmniRIM) with Kofax Ascent Capture for two employers and one of
the requirements has been that the software be able to identify duplicate
documents.  My experience has been exactly as you stated: a human needs to
be involved in the process of determining if a document has already been
captured.  

The documents in question are forms submitted by the customer that the
client wants to scan and capture electronically.   This client wants a fully
automated system that will eliminate the need for the scan technician to be
too involved.  Even with OCRd files there will be distinctions between file
size, bits/bytes, formatting so, as you pointed out there will be little
benefit having the software make the determination.  

I just wanted to be sure I was not giving my client poor advice.

Thanks for sharing your wisdom.

Linda Buss


-----Original Message-----
From: Records Management Program [mailto:[log in to unmask]] On Behalf
Of Patrick Cunningham
Sent: Friday, February 09, 2007 6:32 AM
To: [log in to unmask]
Subject: Re: [RM] DM software question

--- Linda Buss <[log in to unmask]> wrote:

> I am looking for a document management software that will interface
> with a
> scanner.  This software must be able to recognize duplicate
> documents.  Has
> anyone heard of this feature before?

Linda, could you be a little more specific?

To my mind, you're talking about apples and oranges to some extent.

To me, document management software (and I'll readily admit that I am
somewhat "old school") deals primarily with MS Office-like documents
(Word, PowerPoint and so forth). You can store other document types in
the repository, but they are stored as objects (i.e. a static
uneditable file).

If you use a scanner, you will have some capture software that
typically may allow you to store the captured images in the repository.

Broadly, an integrated system is what the industry likes to call an
Electronic Document Management System (EDMS) or even an Electronic
Content Management System (ECMS).

Identifying "duplicate documents" is a formidable task. So let me
dissect that for a minute. You have Office documents and you have
scanned documents in the repository. On the Office document side, it is
possible that the system could recognize that an exact duplicate is
being placed into the repository. However, you would have to decide
what parameters constitute an exact duplicate -- and when there are
exceptions. In my mind, an exact duplicate would be a document with the
exact same number of bytes, the same file name, and the same date and
time stamps. Any variance from those parameters would be a version of
the original. So let's take that apart... you create a document called
RMPOLICIES.doc and send it to me. You then put the document into the
EDMS. I open the document and make two changes; I change a colon to a
semicolon and a semicolon to a colon. I save the document and place it
into the EDMS. In theory, the documents are close to identical, but the
time stamps have changed, the metadata about the author should have
changed, and there are two hard to see character changes. The system
may compare documents and decide that my document is a new version of
yours, or simply treat it as a unique new document. So let's say that
you sent me the document and immediately put it into the EDMS and I do
the same thing before I even open the document. Now you should have two
exact duplicates in the system -- only the EDMS metadata will be
different. I think that you could have a requirement that would
identify the two latter documents as duplicates, but not the two former
ones. The systems aren't that subjectively smart and writing code to
really "read" every document and recognize "duplicates" (at least to
human eyes) would be challenging.

Now let's look at images. It is highly unlikely that even if you
scanned the same piece of paper twice, would you get two documents with
the same byte count. Remember, imaging a document is just taking a
picture of the document -- the system maps the bits and doesn't
understand the content. Even if you performed OCR (optical character
recognition) on the documents, it would be nearly impossible for a
computer system to recognize that two documents were identical. And if
I put a scanned version of a document into an EDMS where the Word
version exists, I can't imagine the system ever being subjectively
smart enough to recognize that the documents were identical. There are
still processes that the human grey matter is better equipped to deal
with. That said, when it comes to comparing minute detail between
electronic documents, there are many hidden elements that are best left
to a machine to evaluate.

About the only way that you can use the machine to accurately determine
duplicates would be to bar code your documents with a unique
identifier. Problem is, if I put the bar code on a PDF document (for
the sake of this example), mail out a paper copy, then expect it to be
returned and scanned, all I know about the incoming scanned document is
that it has the same bar code as the original PDF. If someone wrote a
comment on the document, crossed out some words, etc., you have no way
for the system to recognize the differences in any meaningful way.

So after all that, the short answer is that it is highly unlikely that
you can identify duplicates in most EDMS, without some human
intervention.

I'm certain that others will enlighten me if I'm a generation behind
current software capabilities. And if I missed what you were really
looking for, please respond and we'll try to help you out.



Patrick Cunningham, CRM
[log in to unmask]

"Those who would give up essential liberty to purchase a little temporary
safety deserve neither liberty nor safety."
Benjamin Franklin, Historical Review of Pennsylvania, 1759

List archives at http://lists.ufl.edu/archives/recmgmt-l.html
Contact [log in to unmask] for assistance

List archives at http://lists.ufl.edu/archives/recmgmt-l.html
Contact [log in to unmask] for assistance

ATOM RSS1 RSS2