How much training does machine learning require before it pays off?——————————
Xenit Solutions NV
I completely agree with Lorne!
What machine learning are we talking about? What is the task(s) or process(es) you are trying to apply machine learning to?
Automatically classifying content?
Extracting specific metadata from documents?
Information Management Specialist
Ronny, I’ll speak to our hands-on experiences performing automated text extraction from multiple types of agreements using supervised learning-based SaaS. We have subscribed to one such tool for 1-1/2 years, and we recently completed a 60-day evaluation of another tool. The latter tool allows us to train the tool by uploading samples and highlighting text we want extracted.
Multiple people in our evaluation group found 30 to be the minimum number of samples to get at least 95% text extraction success. None of these people had the need to go beyond 50 samples to have success anticipated as time-saving once in production. Any extraction misses were a mix of false positives and negatives. Within one test of several, we experienced an overall 93.8% extraction accuracy from 500 agreements within a specific department.
Texas: Southwest Chapter
Director, ECRM Program, at an upstream and midstream oil and gas company
Dear all, thank you for the replies. Of course, you are right. It depends. But I try to get some more accurate estimates (if such a thing exists). In our experience, we took a document base of about 100.000 documents, and trained it with 3 different NLP techniques (core NLP, open NLP, Spacey) to see how much precision and recall we could achieve. Initial short training (a few hundred documents) brought us to roughly 80 % for both. We then trained on 3000 documents (took 4 person weeks) with a tool similar to what Ritch Tolbert has described (something we build ourselves). As a result, with core NLP, we achieved more than 95 % precision and more than 95 % recall on an independent training set (so not the same set we used for training :-).
The exercise was set up to extra any privacy information from documents, and we did not use deep learning (neural nets). It leads me to believe that if you want to reach more than 95 % accuracy and precision, you could achieve that we the right training set when training on 1 to 3 % of your document collection. I believe that with growing size of the corpus, your training effort will decrease. We have more details on this an upcoming blog post, if you are interested.
Xenit Solutions NV
Your question is too broad to provide an adequate answer other than “It depends”
-Are the algorithms appropriate for the domain. For example a statistical algorithm can detect variable that alter the distribution of numerical factors.
– Are you training a neural network to recognize a “pattern”
Another type of question would be is the variability “noise” relative to the inputs or “causal / correlated”
Can you rephrase your question to be more specific?
Alan Frank, PhD
Business Process Analyst
PhD, CIP, IGP
Please do share that blog post when it’s available!
Although you’ve now narrowed down your particular interest, I’d like to have a crack at answering your original (broad) question, because it’s such a common one…
As everyone else here has said, ‘it depends’, but to add a bit more detail as to what it depends on, the key factors are (1) the problem being addressed, (2) the number of inputs, and (3) the algorithms being used. All 3 of these are inter-related.
(1) There are easy problems and hard problems for machine learning – performing OCR on characters in a single fixed font would be an example of an easy problem, and performing document classification with hundreds of unstructured document types and only subtle differences between them would be an example of a hard problem.
(2) In terms of inputs (technically referred to as the ‘dimensionality’ of the problem), the OCR example might have characters in a 16×16 grid, giving 256 inputs to the machine learning, while the document classification might have every possible word as a separate input, potentially giving 20-30 thousand dimensions.
(3) The appropriate machine learning algorithm is highly dependent on #1 and #2 – machine-learning algorithms build ‘models’ when they’re trained, and more complex problems require more powerful models. In mathematical terms, the power of a model is linked to its non-linearity. Linear models are weaker, but faster to train, and non-linear models are much more powerful, but the number of training samples, and the compute power needed to train them, can be very large. As a specific example, ‘deep learning’, which is all the rage at the moment, is very powerful (i.e. non-linear) but typically has very high training requirements, both in the number of samples needed, and in compute resources. Alongside all of this, greater numbers of inputs often also lead to higher training requirements, although there are specialist techniques to work around this.
The skill of a data scientist is in understanding all of the above (and much more!), firstly to know which is the appropriate technology to use, but then even more importantly to be able to find and implement the ‘sweet spot’ in terms of making the models powerful enough to solve the problem, but efficient enough to solve it with the minimum possible resources!
Even once a specific technology has been implemented and optimised for a specific domain, the answer (unfortunately) is still ‘it depends’ – in the field of document classification, for example, you could be classifying a handful of different fixed form types (easy – small number of training samples required) or a very large set of different unstructured legal documents (hard – larger number of samples needed). As an example, here’s the standard guidance we give for an engine I’ve been involved with developing in this area: https://docs.waives.io/docs/preparing-samples (see the section “How Many Samples Do I Need?”)
Also, as you’ve observed, there’s the challenge of knowing how many of the available training samples you should use (assuming you have plenty). Generally speaking, providing that the models are resilient to overtraining (something called ‘regularisation’), then more is more in terms of accuracy, but also of course in terms of compute resources. Because of diminishing returns, there’s again a sweet spot where the trade-off between precision-recall and training times are optimised. A really good AI engine may well use a mechanism called ‘cross-validation’ to optimise for this automatically.
Sorry that such a simple question has such a complex answer, but speaking as someone who has been involved with machine learning for over 25 years in areas as diverse as facial recognition, fingerprint matching, handwriting recognition and document classification, I’ve seen some stuff!
For others reading this, and as I hope you’ll appreciate from the above, I would caution anyone against diving into this area based just on marketing messages or a desire to jump on the bandwagon, and without seeking expert advice.
Co-Founder and Chief Scientist
Thank you, George. I am grateful for your input.
following up on the discussion ..
We posted a blog on document classification and our findings on https://blog.xenit.eu/blog/document-classification
The vectorization we used includes two popular NLP (Natural Language Processing) approaches:
- Extraction of BoW (bag of words), and
- TFIDF (term frequency-inverse document frequency)
We achieved good results with rather classical techniques. Why not :-).
You can enjoy the video at https://youtu.be/KXtMaNfVh2E
Feel free to leave your comments and findings.
Happy New year 2020.
Xenit Solutions NV
- Click to share on Facebook (Opens in new window)
- Click to share on Twitter (Opens in new window)
- Click to share on WhatsApp (Opens in new window)
- Click to share on Skype (Opens in new window)
- Click to print (Opens in new window)
- Click to share on Telegram (Opens in new window)
- Click to email this to a friend (Opens in new window)
- Click to share on Reddit (Opens in new window)
- Click to share on Pocket (Opens in new window)
- Click to share on Pinterest (Opens in new window)
- Click to share on Tumblr (Opens in new window)