Document pre-processing ideas We get files from our brokers (we are an insurance general agency). These files contain different file formats, etc. We are working with a vendor on the OCR of the carrier forms. However, we’ll need to do some pre-processing (e.g., handle password protection of PDF files; breaking PDF files bigger than 50 MB into smaller files, etc.).
Obviously we’ll do it manually up front but I’m trying to automate as much as possible.
Anyone has any ideas or good references they can point me to?
Based on the info you provided of your need, my guess is that you’re likely going to need to write something custom as I’ve not heard of a software that would do those specific types of activities, especially handling the passwords issue.
That said, there is a project on Github that MIGHT provide a headstart for a development effort in this vein called Scantailor. You can find info by starting at scantailor.org.
Looking into Scantailor. It will not read PDFs as it’s a post-scan, pre-PDF tool. May still be useful but as we tend to get the files in PDF format it would be an added step to use Scantailor. Just wanted to let you know.
Hi, Another option to possibly check out would be AdLib Software. They can definitely handle the ingestion of the PDF’s and data extraction. Whether or not their extraction engine can deal with the passwords is something to ask.
I used their software for workflow-embedded PDF gen on a project and was very impressed.
Unfortunately, that’s all I’ve got.
Thanks Lorne. I’ll check them out.
Our OCR vendor suggested imagemagick. They can handle PDF files so now I have two more options than I did before!
WRT password, that’ll take some more noodling. A tool, whatever it is, would have to open a PDF file, figure out that it requires a password, open another file/database to get the correct password, apply it to the PDF file, then save the PDF file without a password. Not familiar with the Adobe PDF API to know whether that is feasible but I could see it as doable provided we have the password and it’s made available to the tool. Interesting stuff.
Can you provide an update? Were you able to successfully solve this issue? We have a similar PDF issue.
Not there yet but my team (Enterprise Architect and Network Engineer) have the pre-processing almost done. Don’t have the name of the product or products though (they are keeping it quiet) until they resolve it. We’ll need to do some post-processing too but I don’t think they’ve tackled that one yet.
We can chat once we have it wrapped up. It should be soon. Send me a note at