RECMGMT-L Archives

Records Management

RECMGMT-L@LISTSERV.IGGURU.US

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Content-Transfer-Encoding:
7bit
Sender:
Records Management Program <[log in to unmask]>
Subject:
From:
Margaret Duncan <[log in to unmask]>
Date:
Sun, 11 Feb 2007 16:23:03 -0500
Content-Type:
text/plain; charset="iso-8859-1"
MIME-Version:
1.0
Reply-To:
Records Management Program <[log in to unmask]>
Parts/Attachments:
text/plain (154 lines)
I agree with what Patrick as said.  However, in my previous life in the
legal department at a financial institution I became aware of software in
the litigation arena that claims to identify duplicates and remove them or
flag them from the review process during litigation.  How it works I'm not
sure.  It might be possible to design something similar with DM systems or
to have litigation software that works with the DM.

Meg Duncan, CRM
Milford, MA

----- Original Message ----- 
From: "Linda Buss" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Friday, February 09, 2007 10:07 AM
Subject: Re: DM software question


> Patrick,
>
> You have certainly answered my question.  I have implemented EDMS systems
> (FileNet and OmniRIM) with Kofax Ascent Capture for two employers and one
of
> the requirements has been that the software be able to identify duplicate
> documents.  My experience has been exactly as you stated: a human needs to
> be involved in the process of determining if a document has already been
> captured.
>
> The documents in question are forms submitted by the customer that the
> client wants to scan and capture electronically.   This client wants a
fully
> automated system that will eliminate the need for the scan technician to
be
> too involved.  Even with OCRd files there will be distinctions between
file
> size, bits/bytes, formatting so, as you pointed out there will be little
> benefit having the software make the determination.
>
> I just wanted to be sure I was not giving my client poor advice.
>
> Thanks for sharing your wisdom.
>
> Linda Buss
>
>
> -----Original Message-----
> From: Records Management Program [mailto:[log in to unmask]] On
Behalf
> Of Patrick Cunningham
> Sent: Friday, February 09, 2007 6:32 AM
> To: [log in to unmask]
> Subject: Re: [RM] DM software question
>
> --- Linda Buss <[log in to unmask]> wrote:
>
> > I am looking for a document management software that will interface
> > with a
> > scanner.  This software must be able to recognize duplicate
> > documents.  Has
> > anyone heard of this feature before?
>
> Linda, could you be a little more specific?
>
> To my mind, you're talking about apples and oranges to some extent.
>
> To me, document management software (and I'll readily admit that I am
> somewhat "old school") deals primarily with MS Office-like documents
> (Word, PowerPoint and so forth). You can store other document types in
> the repository, but they are stored as objects (i.e. a static
> uneditable file).
>
> If you use a scanner, you will have some capture software that
> typically may allow you to store the captured images in the repository.
>
> Broadly, an integrated system is what the industry likes to call an
> Electronic Document Management System (EDMS) or even an Electronic
> Content Management System (ECMS).
>
> Identifying "duplicate documents" is a formidable task. So let me
> dissect that for a minute. You have Office documents and you have
> scanned documents in the repository. On the Office document side, it is
> possible that the system could recognize that an exact duplicate is
> being placed into the repository. However, you would have to decide
> what parameters constitute an exact duplicate -- and when there are
> exceptions. In my mind, an exact duplicate would be a document with the
> exact same number of bytes, the same file name, and the same date and
> time stamps. Any variance from those parameters would be a version of
> the original. So let's take that apart... you create a document called
> RMPOLICIES.doc and send it to me. You then put the document into the
> EDMS. I open the document and make two changes; I change a colon to a
> semicolon and a semicolon to a colon. I save the document and place it
> into the EDMS. In theory, the documents are close to identical, but the
> time stamps have changed, the metadata about the author should have
> changed, and there are two hard to see character changes. The system
> may compare documents and decide that my document is a new version of
> yours, or simply treat it as a unique new document. So let's say that
> you sent me the document and immediately put it into the EDMS and I do
> the same thing before I even open the document. Now you should have two
> exact duplicates in the system -- only the EDMS metadata will be
> different. I think that you could have a requirement that would
> identify the two latter documents as duplicates, but not the two former
> ones. The systems aren't that subjectively smart and writing code to
> really "read" every document and recognize "duplicates" (at least to
> human eyes) would be challenging.
>
> Now let's look at images. It is highly unlikely that even if you
> scanned the same piece of paper twice, would you get two documents with
> the same byte count. Remember, imaging a document is just taking a
> picture of the document -- the system maps the bits and doesn't
> understand the content. Even if you performed OCR (optical character
> recognition) on the documents, it would be nearly impossible for a
> computer system to recognize that two documents were identical. And if
> I put a scanned version of a document into an EDMS where the Word
> version exists, I can't imagine the system ever being subjectively
> smart enough to recognize that the documents were identical. There are
> still processes that the human grey matter is better equipped to deal
> with. That said, when it comes to comparing minute detail between
> electronic documents, there are many hidden elements that are best left
> to a machine to evaluate.
>
> About the only way that you can use the machine to accurately determine
> duplicates would be to bar code your documents with a unique
> identifier. Problem is, if I put the bar code on a PDF document (for
> the sake of this example), mail out a paper copy, then expect it to be
> returned and scanned, all I know about the incoming scanned document is
> that it has the same bar code as the original PDF. If someone wrote a
> comment on the document, crossed out some words, etc., you have no way
> for the system to recognize the differences in any meaningful way.
>
> So after all that, the short answer is that it is highly unlikely that
> you can identify duplicates in most EDMS, without some human
> intervention.
>
> I'm certain that others will enlighten me if I'm a generation behind
> current software capabilities. And if I missed what you were really
> looking for, please respond and we'll try to help you out.
>
>
>
> Patrick Cunningham, CRM
> [log in to unmask]
>
> "Those who would give up essential liberty to purchase a little temporary
> safety deserve neither liberty nor safety."
> Benjamin Franklin, Historical Review of Pennsylvania, 1759
>
> List archives at http://lists.ufl.edu/archives/recmgmt-l.html
> Contact [log in to unmask] for assistance
>
> List archives at http://lists.ufl.edu/archives/recmgmt-l.html
> Contact [log in to unmask] for assistance

List archives at http://lists.ufl.edu/archives/recmgmt-l.html
Contact [log in to unmask] for assistance

ATOM RSS1 RSS2