We say that DiligenceEngine finds 90% or more of the instances of nearly every contract provision it covers. How do we know this? Hint: it's not an assumption based on our knowing. We have good people with serious training who work really hard. We know our system's accuracy through testing. This post describes how.
If interested in this subject it might be worthwhile to start with our earlier post explaining what we mean by "90% accurate" (short answer: recall).
How we test our accuracy
The quick answer to how we test our accuracy is we measure how well our system finds provisions in real contracts. This may sound easy, but it is not—there are a number of ways to generate misleading accuracy results in this process. Read on if curious about the details.
In the course of building the DiligenceEngine due diligence contract review automation system, we reviewed and annotated a large and diverse pool of agreements. These contracts range from supply agreements, to employment contracts, to investment management agreements, to leases, to franchise agreements, to licenses, to many other types; while many are US-law-governed, a significant portion are not. All of these contracts were experienced-lawyer-word-for-word-reviewed, and a number were reviewed multiple times. We now have a large pool of annotated contracts, and test our accuracy by running the system on a big set of them which were not used to train the system. The resulting (recall) accuracy numbers are what we are referring to when we say our system finds 90% or more of nearly every provision covered.
How to cheat at accuracy testing
One way to measure accuracy would be to test on the same documents used to build provision models. However, this is analogous to giving students the answers to a test beforehand. You can know that high scoring students are good at memorizing, but you cannot know if they've really learned the material. Computers are particularly good at memorizing, and thus you should never test on the training data to determine the accuracy of a learned model (unless the problem you are trying to evaluate is if a system can find already seen instances (which might be the case for an automated contract review system only intended to work on known documents like company forms)).
This requirement to test on "unseen" data is particularly difficult to meet for systems that use manual rules (i.e., human created rules, such as those built using Boolean search strings). If using a manual rules based system, the only way to obtain truly unbiased accuracy results is to keep testing documents secret from the people building the rules. This requires a great deal of discipline; it is very tempting to look at where rules are going wrong. When testing a machine learning built model, on the other hand, it is easier to make sure the computer does not improperly peek at the test questions!
Another potential pitfall can come through testing on a fixed set of testing data. It might be tempting to set aside a portion (e.g., 20%) of total training data to be used as testing data. Testing on a small and static testing set raises the risk of biasing models to perform well on the test set; final model accuracies may reflect accuracy on the test set and not reality. To avoid this, the test set should be varied across training data. The technical term for this technique is cross-fold validation.
A final thing to beware of is training data diversity. No clever accuracy testing technique can make up for training data that is itself not a good reflection of reality.
These observations apply to situations (like legal due diligence) where the agreements for review can be of unfamiliar form, and do not necessarily apply where provisions are known in advance. For example, if testing whether a system can successfully extract provisions from a form agreement, the form agreement itself might be used in training and would definitely be used in testing. While our system should perform well on form agreements, our goal is to help users find the provisions they seek in whatever agreements they happen to have at hand. This is a different problem that requires different testing.
All this said, the tests we care about most are whether our system finds provisions for our users. Try it out for yourself and see how it does!
(photo: http://www.flickr.com )