Disappearing Documents Root Cause?

Posted by

a083c-4fc13-image-asset
Hi,

Do any of you know of a case in which an entire directory with sub-directories and their files have simply disappeared? What was the story? Did you find a root cause? We recently had such an incident with Perceptive Content and a smaller, similar incident a couple of years ago and are at a dead end in finding a root cause. So we are reaching out to see if there are others with a similar experience in a document management system, Perceptive Content or other.
Thank you
Marlita Kahn
Service Manager
IST – API
University of California, Berkeley
2195 Hearst Avenue
Berkeley, CA 94720-4876
415-760-5882 (mobile)
marlita@… 


Hyland wasn’t able to determine cause?

Sincerely,


(sent via webmail)


a083c-4fc13-image-asset

 

Thank you for the rapid response Lorne.

No Hyland was not and they are working with us daily and were on our team when the files disappeared, when we discovered it and during our efforts to find root cause. They say they’ve never come across this before.

Marlita Kahn
Service Manager
IST – API
University of California, Berkeley
2195 Hearst Avenue
Berkeley, CA 94720-4876
415-760-5882 (mobile)
marlita@…


Sorry for what is likely a dumb question, but, did you/they have a DBA look at the DB itself?

Sincerely,


(sent via webmail)


a083c-4fc13-image-asset
Yes, the DB is fine. We did extensive comparisons between the DB entries and the OSM files to get a final count and identity of what went missing.

Marlita Kahn
Service Manager
IST – API
University of California, Berkeley
2195 Hearst Avenue
Berkeley, CA 94720-4876
415-760-5882 (mobile)
marlita@…


Huh.  I assume you called Rod Serling then?  When does the episode air? ��
Seriously, in 20+ years in IT, that is the 2nd most bizarre thing I’ve heard of.  Oracle simply doesn’t just lose data.  That statement is an IT version roughly on the same level as “the sun is fairly likely to rise in the east tomorrow”.

Sincerely,



 

image-asset.png?format=original

Yeah oracle didn’t lose data. The data is still there.  It was the actual directory and the files.


12616-efe8c-image-asset

 

A mystifying aspect is that there should be some sort of checksum balance check between the DB and the directories and files it manages as blobs.  And when that didn’t equate there should be an indication in the log files.  Obviously I’m making an assumption that Hyland followed best practice, since I’m not familiar with that solution.
As to what could actually cause it though?  Only thing I can think, despite Hyland’s claims, is that there is a bug that can only be triggered under a very specific and rare set of circumstances that somehow deleted the directory and file references.
Wayyy back in 2001 I was doing an ERP implementation for a client (Great Plains before MS bought them) and one night we were doing a moderate upgrade.  The upgrade stalled part ways through.  We were able to track down the error code that referenced an out of disk space problem.  We knew that couldn’t be right though as the server had more than sufficient empty space for the application to expand during the upgrade and then shrink back.  Or, so we thought.  When we looked in Windows Explorer we saw a directory that had no name.  Literally blank.  Under it was another single, blank directory.  And so on and so on.  We got tired of checking when we got down 70 or 80 or so levels.  When we looked at the properties, the size of EACH of the no name directories was greater than the whole of the gross disk size.
Naturally, we got on the horn to MS Support.  Who, oh so helpfully, informed us that it was not possible in Windows to have either a no name directory nor a directory that uses more than the total disk size.  We went down to DOS.  SAME THING.  After hours and hours with level 1, then 2, then 3 support for Windows, SQL Server, and other groups, plus Great Plains (who naturally said GP couldn’t cause such a scenario and blamed MS, who blamed GP), and actually getting an ASSEMBLER programmer remoted in and looking at this thing at the ones and zeros level, MS and GP admitted that there was, in fact, a directory with tens of thousands of sub-directories under it that had no identifier of any sort and consistently reported the above mentioned disk space usage issue.
With ALL of us scratching our heads, we eventually decided that the downtime was costing more than the loss of a day’s data so we restored from backup and started the upgrade again.  Went at least 30% faster and not a blip or beep out of the system until it completed successfully.
Years later when I heard the phrase “ghosts in the machine” I had a direct and personal correlation!

Methinks you may have a similar situation that may never be able to be solved.  Though you could consider calling Will Smith to see if he’s available, LOL!

Sincerely,


a083c-4fc13-image-asset

Well, we have a script that can do the compare between DB and OSM files/blobs but it takes a long time to run and we didn’t have it at the time. We now are developing alerts to let us know if a bulk delete occurs. The problem is that we discovered the issue about a month after our backup retention had run out so we are extending that retention for backups and logs.
Here’s what we think happened. The last edit date for the OSM was the same time that we were doing a major upgrade. Literally. So people were in the system. One theory is that there was a fat finger delete – ooops.
The situation was complicated by the fact that we had started moving the OSMs almost 2 months earlier to new NAS storage (instead of the legacy App sever) but the copy hadn’t completed before we had to pull the trigger on the upgrade because the copy process kept failing. So there was also a utility running around trying to complete the copy. I should double check that it was halted for the actual upgrade, I don’t remember.
Ironically enough the upgrade was remarkably smooth and for the first time we had Hyland running the actual upgrade (we’d usually done our upgrades without vendor support).

A couple of years ago we had a smaller loss which we noticed the next day and which might have been triggered by a DFS process we have running to our DR instance (we saw a glitch in that about the same time as the loss). Again, no one has a real explanation about what caused the loss nor have ever come across a similar situation.


a083c-4fc13-image-asset
Marlita;

Are you using replicated services for your storage environment?  I am asking as this symptom is commonly (and usually only) seen in replicated services when the source is corrupted or something happens to the copy to storage location #2, which then ends up overwriting the one good copy.

Bob​

——————————
Robert Blatt, MIT, LIT, CHPA-III
Principal Consultant, Electronic Image Designers (EID).
AIIIM Fellow #175
Chair, Trustworthy Storage
Chair, Trustworthy Document Management & Assessment
Chair, ECM Implementation Guidelines
ISO Convenor: 18829, 18759, 22957, 18759)
US Delegate to ISO TC/171
TC/171 Liaison Officer to TC46 SC11
TC/171 Liaison Officer to TC/272
——————————


a083c-4fc13-image-asset

Hi Robert. If I understand your question, the answer is yes. At the time we believe the directories and files were deleted, the following were in play:
  • Files were stored on a Windows 8 sever which also hosts the Perceptive Content Application version 7.1.5
  • Our upgrade plan included revised architecture to move the OSMs to a NAS storage device – architecture approved by Hyland
  • In preparation for the new architecture, we began to copy the files to the NAS almost 2 months in advance of the upgrade. That process choked multiple times and our team handling the process had to try many techniques before finding one that would work. As a result, the NAS was not ready in time for the upgrade so after the upgrade we had to continue to write to our application server and copy to the NAS
  • Additionally, we have a DFS service running from the application server to our DR set up elsewhere.

Marlita Kahn
Service Manager
IST – API
University of California, Berkeley
2195 Hearst Avenue
Berkeley, CA 94720-4876

415-760-5882 (cell)
marlita@…


a083c-4fc13-image-asset

Hi,

Have you considered human error, negligeance or foul play?​

——————————
Marie-Andrée Poisson
CGI
——————————


a083c-4fc13-image-asset

Yes, we have ruled out foul play and have agreed that “fat finger” mistake could have been the cause

Marlita Kahn
Service Manager
IST – API
University of California, Berkeley
2195 Hearst Avenue
Berkeley, CA 94720-4876

415-760-5882 (cell)
marlita@…


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.