Cloud focused information governance with machine learning – is there a need?

Posted by

Hi all, I’d like your guidance on an idea me and my co-founder are working on in the information governance space.
We believe there is a need for a cloud-focused Information Governance SaaS solution that:

Connects to all the relevant data sources in an organization (e.g Sharepoint, Google Apps, file servers, GMail, Slack, etc).
Manages information in-place (no need to move the document, email, etc to a separate system to apply policies)
Uses machine learning to help users auto-declare and auto-classify records (this will reduce manual burden on users and records managers and aid them with a first pass with a high degree of accuracy)

Think of this as applying “lite-governance” for cloud data-sources with significant focus on automation as much as possible

Are we on the right path? Is this a problem worth tackling?

I appreciate your input. Thanks,
– deepak
PS: We are also looking for deeper conversations with records managers to understand where the market is going. We are offering $25 Amazon gift certificates for such a validation call if you are interested. Please reach out to me one on one for tha

Deepak Balakrishna


Hi, The first thing that occurs to me upon reading your post is that it sounds like semantic modeling/content mining and there are already tools out there that provide this. Albeit I don’t believe that any are yet SaaS.
The second thing was around the organizational effort to “train” the ML instance for each organization. Your post indicated a thought towards this being targeted at content stores enterprise-wide, including other cloud stores. That is potentially a LARGE, and likely daunting, undertaking. There is also the extensive human effort for parallel running for whatever period of time a particular organization requires before they “trust” the ML.
Next, of course, is the effort on your side to build, test, and deploy the connectors and authentication mechanisms required to make this viable in many enterprises. Not to mention the multiple API’s you would need to work through to create the functional interop. Consider, just to have a decent go-to-market position to start you would likely need to integrate to most, if not all, of:

OpenText and/or Documentum and maybe FileNet
Salesforce SharePoint (on-prem/online)
Exchange (on-prem/online)
Gmail and Google docs
file shares which could span on-prem and multiple clouds

To do all the above is a LOT of work/expense. Especially if your desired target clients have primary content stores spanning multiple clouds as well as on-prem as you are then having to work through solutions like Azure Stack, OpenStack, etc., etc.
I certainly support the idea of automation, just wary of the amount of effort to create what you’re describing, then the educational effort required to get prospects converted to buyers.


Hi Thanks for your detailed comments.
Setting aside for a moment the technical feasibility (which I address below), do you believe there is a need in the market for:

a) governance of cloud data sources (while keeping them in-place)
b) an aide to users and records managers (for decision making in declaration and classification) so that it is not as manual a task as it is now. On your points below:

1. Connectors : A lot of the data sources provide APIs that we can leverage (and others do use them – see //The mechanics of manage-in-place records management tools). We hope to leverage those APIs. We wont support all of the data sources on day one – we will do a few and over time add others. The architecture will be in place to expand quickly – but we will start with a few to prove out the platform.
2. Organizational effort to “train” the ML instance for each organization: The way we envision this is that an organization will have a few (30-50) documents for each planned class. They feed this “training set” on the system which will extract and create the rules on behalf of the RM. The RM can accept/change this – so it is an aide to ease the setup. For the end user, this same system can guide them on what is potentially declarable. For each a denied threshold can be applied to automatically classify or send back to the RM for review. For example, an RM can say “if confidence level is less than 90% send it to me for review else classify it”. And the system learns over time by watching what the RM does with the manual classification – over time the number of times the RM needs to be involved will reduce.
I’d love to get your thoughts on the above


Deepak Balakrishna

Certainly there is a need to be able to govern the content in SaaS, PaaS, and even IaaS cloud environments. That is inarguable. There are already numerous products available in the market to do this at a variety of price points and feature sets
And almost anything that can also aid in classification, correctly!, will be a boon to any RIM staff.
I also understand the mechanics of how to train the ML. I was simply commenting on the size and steepness of the hill you’re proposing to climb from both an internal develop/test/deploy/sell perspective as well as from a client perspective to fully on-board the solution.
Obviously if you have the talent and scope of start-up capital funding to make it realistic available then for sure a worthwhile endeavor.


Deepak – This is indeed a need that companies like Haystac do address today. The market space is now defined by Both Forrester and Gartner as “File Anlytics”.

Haystac Inc.

Hi, Thank for the heads-up! I was not aware of this segment of the market – I’ll check it out more
Thanks, –


Interesting topic – just was discussing this with our SharePoint guru. We discussed the Records Centre vs. In-Place and we are leaning to Records Centre to cover off our requirements for “records” vs. “non-records” that are housed in-place. Then assigning the records series to the “record” and then auto-classifying attaching retention requirements.
As a RIM specialist I feel that we need to declare “records” and manage the information throughout its lifecycle, assigning retention based on the records series. How do you autodeclare a record? I understand autoclassify based on the record but autodeclare is something I don’t comprehend. Please explain – it may just be a communication difference between RIM and IT… —

Hi What we were thinking about doing is to guide the user on what is potentially declarable. Rather than calling it auto-declaration, maybe the better term is “declaration-guidance”. When we have the rules in place for classification, we can use that same information to help guide the user on what we believe is useful to declarable. This is a first pass filter to guide the user – the user will still have to accept the guidance. Of course, the user can always declare records that we have not deemed useful (based on our current knowledge and ruleset). And they can ignore our guidance as well.

Over time, our knowledge and ruleset rules should get better as the learning algorithm learns based on the choices made by the user and RM – and hence improve the guidance for declaration.It creates a positive feedback loop.

Let me know if this is clear – and what your thoughts are on it. If possible, I’d like to speak with you one on one on this.
– deepak

Deepak Bal

I understand totally your intent and it would be wonderful (in a perfect world!). Yes please call me and we can chat… my work phone is 403-781-2823

Gibson Energy

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.