Auto-Categorization Tools

Posted by

Does anyone have a list of the top 3-5 Auto-Categorization tools? Thanks!


“Top” is probably pretty subjective to most folks. And, obviously, there are a swack of different criteria that each organization will need to consider when defining what products/vendors fall into such a label. Not the least of which, by any means, is budget.

That said, here are a few of my suggestions:

  • ConceptSearching’s suite (they have multiple tools that can work in concert)

  • ShinyDocs

  • OpenText (of course)

  • AvePoint

  • Alfresco

  • Hyland has one in their OnBase suite

There are others of course, but I would personally rate these as some of the best.

That is a pretty large net…what are your requirements?
What are you categorizing?
What is the volume?
How old is it or is it all newborn content? (a good place to start in many cases BTW) is there business value in the data you need to classify? (is so, now is a good time to think about leveraging it)
Is the content clear of ROT or is that something that needs to be addressed as well?

Those are some questions I can think of off the top of my head and on one cup of coffee but should help in finding the right direction.


IQ Business Group, Inc.

There are 2 approaches to automating the identification of your documents, categorization and classification. Categorization groups common documents together and assigns metadata to the file groups. Files contained in the groups will inherit the metadata tagged to the groups. Classification will tag file characteristics of the documents on a file by file basis. Categorization and classification vendors are typically repository agnostic and work across multiple heterogeneous repositories. Here are some vendors for categorization and classification;


  • DocAuthority

  • Everteam

  • ConceptSearching

  • BigID


  • ActiveNavigation

  • StoredIQ

  • Titus

  • Varonis

  • Nuix

I worked for DocAuthority. Our product is AI based and automates the categorization process. If yo want, I would be glad to discuss our product with you


Alan Weintraub

Can you provide some insight on the error rate for Classification and the error rate for Categorization? I would think you would have some statistics for both New Content Only and also a scenario with Historical Content. I would think most readers would be interested in the stats based on scanned documents as the content if that can also be part of a stats matrix.

Some amount of error is likely to be more acceptable in certain document type situations than in others and readers can then be armed to make a good choice of what to put through the engine without further review.

Thank you

Millennia Group LLC

The error rates depend the type of technology used to either classify or categorize the documents.
There are 3 types of technologies typically used in the classification and categorization process;
Pattern matching and regular expressions
Machine learning
Artificial Intelligence
I don’t have any quantitative evidence on the accuracy rates of each of the technologies except for DocAuthority where I currently work. My experience has been that pattern matching has the lowest accuracy, followed by machine learning, with AI having the highest accuracy. The one caveat I can say is that the accuracy for machine learning increases with the amount of time teaching the software the different types of document types. DocAuthority has stated that our accuracy is 99.99%, translating to 1 error in 10,000 documents. Hope this helps. Please feel free to contact me at alan.weintraub@… if you want o discuss further.



Thank you Alan,

That is much better than I had anticipated and definitely makes it worthwhile to include consideration of auto-categorization in any project scope.


Millennia Group LLC

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.