How to Make an Enterprise-Aligned ML Model
There are numerous theories about how human beings learn and many of these theories focus on the ability of children to acquire and master new skills. Inspired by how children excel in learning tasks, Seal has built an AI with the ability to actively learn – to attempt to identify the best, smallest set of examples to acquire new knowledge. In this article, we will show how this process can drive the creation of Machine Learning models uniquely aligned to each enterprise’s needs.
One theory of learning is the Zone of Proximal Development (ZPD). It states that a child can start learning a task from a small set of well-defined facts about a given domain. The learner then moves towards the unknown and unexplored areas through the ZPD. When in this zone, the learner is assisted in acquiring knowledge by a teacher with a higher skill set. This process continues until the teacher is no longer needed and the student is confident enough to expand domain knowledge on their own. This approach allows for proper and focused guidance, leading to error-free results.
The ZPD theory provides a good metaphor for some of the AI capabilities implemented in the Seal product. One important capability is the option for users to create their own User-Driven Machine Learning (UDML) models in order to train the system’s machine learning functions and identify a particular legal concept. Some background information, most machine learning (ML) systems learn from labeled data. In practice, this means that an expert has implicitly divided the positive facts from the negative by marking what is the correct concept in a pool of well-chosen data. The process of labeling data is often time-consuming, tedious and error-prone, and yet indispensable. Therefore, in order to create a model, a user will need to manually label data as well as choose an appropriate set of documents that contain the desired legal language. In addition, the users will need to create a properly balanced dataset, where the ratio of positive and negative concepts truly represents the reality. All of this puts quite a heavy burden on a user where the only requirement should be to have legal expertise.
This burden of preparing and labeling data is why Seal decided to make it easier for the users by adopting the principles of the ZPD theory instead. With ZPD, one example is enough to help Seal Analytics find similar concepts across contracts. This can be followed by a review process, which is guided with suggestions to enable the UDML to carry out more in-depth extractions. Just like in the ZPD, these review suggestions ensure that the Seal system improves quicker and requires less supervision. In ML terms this is called Active Learning (AL). The challenge is to reduce the number of data points the domain expert must label to teach an accurate model.
While there are many ways to implement AL, the selected capability in Seal leverages the concept of uncertainty sampling. Here the AI learner (algorithm) presents the human teacher with points it is most uncertain about. These points of uncertainty lie along the borderline which delineates the “true” clauses from all the rest, thus being the most informative for the system. This might seem counterintuitive; however, the model learns best from examples which are neither obviously right nor clearly wrong. Straightforward examples are not effective since they are already too easy for the student-ML model. In this sense, the training improves by constantly ramping up the difficulty of the learning exercises, which guarantees swift progress.
The pictures below illustrate the concept of AL with uncertainty sampling:
The grey point is a suggestion, which after being accepted as a true/correct example by the expert, gives a more precise classification of the positive and negative examples. By iteratively selecting grey points along the borderline, the expert only reviews examples that matter for the learning process. This process continues until in the end, the model has learned a fine-grained borderline as shown in the last picture. It can now be released, and it will solve the task it was trained for.
A long-standing barrier to better AI has been the need for high-quality examples of the concepts to be learned. By relying on AL, the efforts of the domain expert for data gathering, labeling and training are dramatically reduced. The reduction of the time-to-value for enterprises produces a leaner, smarter AI that can empower the enterprise team with more answers and insights than they could find on their own.