Loading...
Kevin Gidney
Kevin, a founder of Seal Software, has held various senior technical positions within Legato, EMC, Kazeon, Iptor and Open Text. His roles have included management, solutions architecture and technical pre-sales, with a background in electronics and computer engineering, applied to both software and hardware solutions.

Is Poor OCR Quality Hindering Your Contract Management Solution?

Kevin Gidney | Mar 20, 2014

Learn How Technology Addresses the Challenges Poor OCR Presents

I stated in my last post, Next Generation Contract Analytics: Putting Power in the Hands of Contract Professionals, that OCR quality of your contracts cannot just be simply overcome.  Let’s take dates as an example for extraction. A date can be made up of a number of set formats, for example:

  • 10/01/08
  • 10/01/2008
  • 1st of January 2008
  • 10th of October 2008

Let’s take the first two dates.  What is the actual date?  Is it the 10th of January or the 1st of October? Now, add in some OCR errors where the numbers for the date are not seen correctly, which is an extremely common issue.  What if the above dates were the termination or renewal dates of a contract?  You could have missed this date by a factor of 9 months!

Is Poor OCR Quality Hindering Your Contract Management Solution?

What Seal Has Learned About OCR Quality and Technology

One of the things we learned over 3 years ago when we installed the Seal Contract Discovery platform for our first customer to process, extract and review over 40,000 customer contracts dating back over 20 years and of varying degrees of OCR quality, was that no amount of Machine Learning can account for an OCR error such as described above.  While OCR errors within sections of text can be simply accounted for, if all the key features are also of poor OCR quality, no system is able to extract them. Why is this the case?

How Machine Learning/Predictive Coding Addresses Some OCR Errors

Machine systems need to learn.   They learn based on humans or a combination of humans and code, marking text within the sections to be classified as important features. The system then builds a model based on this.  It is not based on the actual words, but more like a bag of words and vectors and weighted scores that are based on the occurrence of words within the text and the features that have been manually marked. This model can then account for some OCR errors.  The larger the text the more OCR errors that can be accounted for as not all of the features will be affected.  This is commonly referred to as Predictive coding or Machine learning.

Other ways to account for OCR errors are via NGRAMS or Latent Semantics. However, this is for text, so it is not a solution for dates or numbers.  Even within a Machine learning system the dates need to be taught to be recognized, and providing the system with examples does this. Many of the methods available today have already provided the necessary examples within the frameworks.  For example, the OpenNLP framework for Machine learning provides free models on its website that can be used to detect dates, people, places etc.  All of these models were all taught on marked up data, and for every additional language needed, a new teaching and model is required. This can be expensive and time consuming.  Not to mention, it still does not address the core issue of OCR quality regarding subtle items like dates.

How Seal Contract Analytics Goes Beyond Machine Learning/Predictive Coding

Over the years, we’ve also learned that it’s not just about finding standard clauses.  It’s about finding the non-standard clauses and not just within English, but also within other core languages such as Japanese, German or Chinese. This is why we have introduced Seal Contract Analytics.  It’s the first and only solution that is able to learn from customers, reteach the core discovery engine and account for OCR errors by using some of the methods mentioned above.  Back then and still today Seal uses the best OCR software engine.

It’s not a “keyword” solution, as incorrectly stated in a solution provider’s recent blog post.  It is a semantic based engine that also incorporates the use of Natural Language Processing to allow customers to define the key “features” of clauses, to better assist the system in identifying standard and non-standard data.

If you’d like to learn about Latent Semantics and how this deals with OCR errors, please refer to the following whitepaper by Content Analyst, Why LSI? Latent Semantic Indexing Information and Retrieval.  Content Analysts is one of the APIs we embed within our solution as we believe in using the best available methods to achieve our customer goals.