Looking for a Service that could extract individual records from a PDF and upload to Document Database system

Posted by
a083c-4fc13-image-asset
We have a file share folder, 751 GB in size, of student records that were converted from Microfiche film rolls into PDF files. There are many PDF files and they contain up to over a thousand pages of somewhat organized records. We would like to get a quote from a vendor that could extract each student record into its own PDF file and have the PDFs uploaded into our Document Database System.——————————
Marissa Lewis
Wicomico County Public Schools
——————————


Marissa,

For clarity, you are not seeking to do OCR/ICR?  Strictly break apart compound documents into individual documents based upon a key piece of metadata such as student ID?

Also, what is the target and source systems?

Aria

Aria Business Card-0۸   Aria Business Card-۱۰


a083c-4fc13-image-asset

OCR would be nice, but at a minimum just need to break the documents into individual documents based on keys such as: Name, School​, Graduation Year… The Target Source is just a file share that we dumped all these PDFs into. The source system is called LaserVault.

——————————
Marissa Lewis
Wicomico County Public Schools
——————————


a083c-4fc13-image-asset

In addition to Lorne’s questions.  How consistent are the files?  Do you have consistency to the documents you want to split on?  or has the form/file/ID changed over time?
Thanks,
-Rick——————————
Richard Molique
ECM Consultant
IQ Business Group, Inc.
804-614-6445
rmolique@…
——————————


a083c-4fc13-image-asset

The files are not very consistent. The majority of them are scanned images of a physical folder ​and all of its contents that at one point in time were in a filing cabinet inside a school. The folders go as far back as the 1920s I believe. When a former student submits a request for their student record, we have someone scrolling through thousands of pages to locate their folder. They are somewhat grouped by School and Year, but it is not consistent. In addition to student records there are random records like graduation pamphlets and detention slips. Our goal is to be able to quickly look up a record by Student Name, Graduation Year, School, SSN… I am looking into a possible vendor that could offer this service or if we are better off hiring a summer intern to take on the task.

——————————
Marissa Lewis
Wicomico County Public Schools
——————————


a083c-4fc13-image-asset

Hi Marissa,

Based upon your responses, this is actually a fairly significant and involved project.  If the PDF’s were reasonably consistent and you were simply needing to break apart based upon 1 key then an approach like a summer student would almost certainly be the most economical and reasonable.  However, since you are looking to capture multiple pieces of metadata  (name, school grad year, etc.) to be able to, I assume, provide those as index values for search, this goes way beyond the scope of any summer student I’ve ever run into!��. And, also for those reasons, OCR/ICR isn’t a nice to have, it is mandatory.  And it will take amongst the best available to achieve reasonable quality.

For this kind of project I would definitely recommend contacting Kofax (Kofax.com) who is indisputably one of the top solutions in the space, to find a partner within reasonable physical distance in Maryland that does doc conversion and has the quality of equipment, training, and experience that can work with you on this.

I have not heard of LaserVault until now, but according to their website, the product can do ingestion.  So, once the files are converted, it sounds like all you would need to do is drop them in a ‘watch’ folder and LaserVault should be able to pick them up.  And their website says they can do some automated routing to appropriate filing locations within it’s repository so you should be ok there.

Aria

Aria Business Card-0۸   Aria Business Card-۱۰


a083c-4fc13-image-asset
Thank you for this information, this is exactly what I needed to know.​——————————
Marissa Lewis
Wicomico County Public Schools
——————————


a083c-4fc13-image-asset
Marissa;

the process you need is included in some advanced capture products such as Kofax, Ephesoft, Ityx, etc.   They need to be ‘trained’ to be able to identify where each file begins/end and extract data.   the training process, depending on how varied the files are that were converted, could be simple or very complex/costly.    The key here is to first determine if each student record within the PDF can be identified through a training process and if so, the tools as suggested by Lorne could work, but you might be better served to try to break this document down into smaller more identifiable components which you can test yourself through the batch controls/processing built into adobe pro.   To buy/train for a single 800’ish Gb file might/probably wouldn’t be worth the cost and the extensive time it will take to properly train the system.  if you had multiple sets of large documents in the same condition, these tools would make sense.   I would suggest that you have a technical review of the files first to see if each ‘document type’ can be properly identified and then determine what information you want extracted.   THEN you will know whether the tools mentioned above would be of value and work or not.   You will need this information to ‘train’ the system with the tools others have mentioned anyway, so you might as well collect that information first to get a much better sense of how accurate the separation process will be.——————————
Robert Blatt, MIT, LIT, CHPA-III
Principal Consultant, Electronic Image Designers (EID).
AIIIM Fellow #175
Chair, Trustworthy Storage
Chair, Trustworthy Document Management & Assessment
Chair, ECM Implementation Guidelines
ISO Convenor: 18829, 18759, 22957, 18759)
US Delegate to ISO TC/171
TC/171 Liaison Officer to TC46 SC11
TC/171 Liaison Officer to TC/272
——————————


Bob,

Very true!  It is exactly for those reasons that I suggested finding a reasonably local partner for Kofax (or Ephesoft, Ityx as you suggested, but Kofax has more partners so might make it easier to go that route) to talk through these considerations with Marissa’s organization.

Aria

Aria Business Card-0۸   Aria Business Card-۱۰

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.