What is coverage?
Coverage is a term frequently used in Machine Learning and relates to how well a model 'covers' the data it's used to analyse. In Re:infer, this relates to the proportion of verbatims in the dataset that have informative label predictions, and is presented in Validation as a percentage score.
'Informative labels' are those labels that Re:infer understands to be useful as standalone labels, by looking at how frequently they're assigned with other labels. Labels that are always assigned with another label, e.g. parent labels that are never assigned on their own or 'Urgent' if it's always assigned with another label, are down-weighted when the score is calculated.
The visual below gives an indication of what low coverage versus high coverage would look like across an entire dataset. Imagine the shaded circles are verbatims that have informative label predictions.
As a metric, coverage is a very helpful way of understanding if you've captured all of the different potential concepts in your dataset, and whether you've provided enough varied training examples for them so that Re:infer can effectively predict them.
In almost all cases, the higher a model's coverage is the better it performs, but it should not be considered in isolation when checking model performance.
It is also very important that the labels in the taxonomy are healthy, meaning that they have high average precision and no other performance warnings, and that the training data is a balanced representation of the dataset as a whole.
If your labels are unhealthy or the training data is not representative of the dataset, then the coverage of your model that Re:infer calculates will be unreliable.
Your model having high coverage is particularly important if you are using it to drive automated processes.
Coverage in context
So let's consider how this applies in a typical use case - automatically routing different requests received by a shared email inbox in a business.
For a model designed to help automatically route different requests, low coverage would mean that lots of requests were inaccurately routed, or sent for manual review as the model could not identify them.
If there are 10 key processes managed by the team working in the mailbox, but the taxonomy only effectively captures 7 of those, this would likely lead to a poor coverage score. Alternatively the model may incorrectly predict the remaining processes as other processes, as it has not been taught what they actually are. This would lead to an artificially high coverage score.
During the automation the remaining three processes would likely be missed, sent for manual review, or falsely classified as a different process and routed to the wrong place.
It's also important that for each of the processes, there are sufficient varied training examples that capture the varied expressions used when discussing the process. If there are not, each label will have lower recall and the model's coverage will be low.
The visual below demonstrates how this example might look in practice - we have multiple clients sending multiple request types through email. Each client may write the same request type in a different way:
There are three different model scenarios shown, each with a different potential impact on coverage:
|Not all request types (i.e. concepts) covered by the taxonomy - Likely to have low coverage|
|All request types covered, but some labels are unhealthy and may not have enough examples – Coverage may be low and is likely to be unreliable|
How to check your model's coverage
In Validation in Re:infer, the Model Rating functionality provides a detailed breakdown of model performance across the most important contributing factors, including coverage.
Users are provided with a percentage score of the proportion of verbatims that are predicted to have informative label (see explanation above) predictions, which in turn correlates to a rating and a contribution to the overall model rating.
Re:infer also provided the proportion of verbatims that have at least one label predicted, but this is not what determines the score and the rating for this factor.
In the example below we can see that:
- The coverage for this model is 85.5%
- This translates to an assessed rating of 'Good' for this factor
- This also translates to a positive contribution to the overall model rating of +3
Example breakdown of coverage shown in Validation
It's important to remember that coverage as a performance measure should always be understood in conjunction with the health of the labels in the taxonomy and the balance of the dataset. Before you consider the coverage score provided by Re:infer to be reliable, you should also ensure that you have at least a 'Good' rating for the other factors that contribute to the Model Rating.
How high should a model's coverage be?
At a minimum, your coverage should be high enough that Re:infer provides a 'Good' rating. If it reaches this stage your model is performing well, and you can choose to optimise it further based on the requirements of your use case.
Just like humans, machine learning models do have their limits, so you should not expect to achieve 100% coverage. You may get close to this, but only if your data is extremely simple and repetitive, or you label a very high proportion of the dataset.
Reaching an 'Excellent' rating can potentially be challenging depending on the data, and is not always necessary depending on the use case.
For automation focused use cases, having high coverage is particularly important, so it is likely in these instances you will want to optimise coverage as much as possible and achieve an 'Excellent' rating.
For analytics focused use cases, having very high coverage is slightly less critical. It is often sufficient for models to be able to identify the most frequently occurring concepts, as these provide the greatest insights and potential opportunities for change and improvements.
It's worth noting that an analytics focused model will typically sacrifice some accuracy in order to broadly capture a very wide range of concepts in its' taxonomy.
How do you increase a model's coverage?
Like for the other contributing factors, Re:infer provides useful recommended 'next best actions' that users can take to improve the coverage of your model (see images above and below).
The recommended actions act as links that take you directly to the suggested training mode.
In the case of coverage, Re:infer will most often recommend training in 'Low Confidence' mode, as this mode allows users to label the least 'covered' of all the verbatims in the dataset. For more detail on using this training mode see here.
After completing a reasonable amount of training in the recommended training mode (how much will depend on how low the coverage is to start with), users can check back in to see how it has improved once the model has retrained and Validation has updated. If required, they can repeat the process to continue to improve their scores.
When coverage is very low, typically early on in the training process, Re:infer may also recommend other training modes it thinks would be beneficial, such as labelling clusters of similar verbatims in Discover (as shown in example below).
Example factor card showing low coverage for a model and recommended actions
Coverage check for specific labels
Re:infer is able to intelligently identify which labels in a taxonomy are 'informative' when assessing a model's coverage and presents this in the Model Rating (as discussed above).
If, however, you want to assess the proportion of a dataset that is likely to be covered by predictions for a specific set of labels in your taxonomy, you can do this in the Reports page.
To do so, you can select all of the labels in question in the label filter bar in Reports (as shown below).
Labels selected in Reports filter bar
The verbatim count at the top of the page in Reports updates based on filters applied. When you select labels from the label filter, the count updates to show the number of verbatims that are likely to have at least one of the selected labels predicted.
Verbatim count in Reports
In this example dataset of emails solely relating to a margin call process in a bank (which contains 260,000 emails), you can see that it is likely that 237,551 verbatims out of the 260,000 will have at least one of the selected labels predicted. Indicating a good coverage of approximately 91.4%.
This should not be your only check to test the overall coverage of your model. The most reliable measure of coverage for the whole dataset is presented in the Model Rating (as discussed above), and should be considered in conjunction with the overall performance of the labels in the taxonomy.