Digitisation Project Issues

Posted by

Hi, I recently took on a new role in a public sector organisation and am struggling to come up with a solution around the deliverables from a digitisation project and am looking to the broader AIIM Community for some advice:

. Project delivered 32,000 digitised files that are OCR capable.
. Digitised files are exact replications of current files.
. Files are so large that they are unusable.
. Error rates are really low from a Quality Control perspective.

In order to be useful to the business the digital PDF files must be split but this cannot be done by size splits. There are logical breaks in the files but this is not standardised.

My question is: Has anyone ever had to manually split very large PDF files so that they are able to be useful? The last resort is to rescan the hardcopy files again but I am hoping to mitigate this.

Any advice or product guidance would be appreciated.



Have you looked into the reason of the filesize? Typically this is caused by high resolution color images. Perhaps you can reduce the quality with no/minimal loss for the user. I am no tool expert but I would expect you can do post-processing to achieve this.
Kind regards,

Perhaps compression maybe an option?
Depending on the file contents it may be possible to build business rules to automate the splitting of the documents into logical smaller documents. We do a bit of work within this area particularly for Local Government customers. Happy to have a conversation to discuss further. If you are interested drop me an email.

Redman Solutions

Hi James,
the other reason for large file size can be the scanning software treating documents that are colour washed as an image rather than text when there is a white border to the document. So this might be something to look for in the future

Many scanning solutions allow for file split to be undertaken at the scanning stage and I have noticed that during demonstrations the vendors use imported files to demonstrate this. It may be worth having a chat with the scanning software supplier to see if you can use a similar technique to split the scanned files that you have into documents within the files rather than a single pdf of the whole file. It will require some manual input but will be easier than re scanning the whole file and using separaters

Hope this helps

Lesley Holmes MA

I’ve split PDF files, although it was based on numbers of pages not content. I can try to look for info in my notes.
But there exists a PDF option that may help. It web optimizes the file so it will start being displayed without requiring the entire file to be downloaded. This will need web server support. Principal IMERGE Consulting

Hi James,
I’ve used software that could do this, but not without cost. The main issue will be the rules, if you can define a way of splitting the documents, you can probably get some software to do it. If you can’t clearly define rules for where the breaks are yourself, then you are unlikely to be able to do it automatically.
Personally, I’d be looking to define a regular expression which identified headings/new sections and split based on that
. Based on previous intelligent scanning projects I’ve done, I’d anticipate now that there will be some manual work you will have to do, even if you can tackle 80% or 90% of the documents.

Best of luck, Graham
Plymouth City Council

Somewhere in this company’s list of products you may be able to find a tool(s) that will meet your needs at quite affordable prices.
Hope that helps.


James, unfortunately, I can’t tell you how to fix your current problem. However, going forward the solution is pretty clear. See https://en.wikipedia.org/wiki/Machine-Readable_Documents

Public agencies have a particular obligation to do better in the future.

James, We’ve split files at page breaks. As per my email, contact me and we can talk about more details.

Out of curiosity, how large are they? I’m of a similar mind as Lesley, my guess is that they are color PDFs that perhaps do not need to be color? Your solution also may be as simple as saving them as a Reduced size PDF in Acrobat.
Data Direction inc.

If file compression is an option for you and saving as reduced size in Acrobat is not sufficient, you might also consider the LuraTech PDF Compressor (LuraTech PDF Compressor). It uses a clever technique by splitting the images into layers (text and colour) and is definitely much cheaper than re-scanning. Feel free to send me some pages and I will send them through my PDF Compressor.

Grünenthal Gmbh, Aachen, Germany

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.