User permissions required: 'View Sources' AND 'View Labels'
What's in this article?
- Understanding overall model performance
- Understanding label performance
- Individual Label performance
- Improving overall model and individual label performance
- Understanding the label threshold slide
Understanding overall model performance
How well a machine learning model performs overall is determined by a number of contributing factors, which need to be considered in combination with each other. It's not sufficient to only consider whether a model's labels have high average precision, or to just look at how much of a dataset is covered by predictions on its own.
To be confident that a model is a true representation of a dataset, with accurate, reliable label predictions and high coverage, we assess 4 main factors (and here's how):
- All labels - this factor assesses the average performance of all labels in the taxonomy
- Underperforming labels - this factor assesses the performance of the 10% of labels that have the most significant performance issues or lowest overall health
- Coverage - this factor assesses the proportion of verbatims that are predicted to have at least one informative label
- Balance - this factor assesses whether the reviewed data is an effective, balanced representative of the whole dataset
The platform combines its' assessment of these 4 factors into a single, easy to understand proprietary 'Model Rating'. This rating is the best measure of a model's performance, as it accounts for all of the most important factors that a label model should be assessed on, weighted by their relative importance.
The Model Rating is a score from 0 to 100, which equates to a rating of 'Poor' (0-49), 'Average' (50-69), 'Good' (70-89) or 'Excellent' (90-100).
Validation also provides a breakdown of each of these factors - each factor is assigned its own qualitative rating, as well as displaying the metrics that contribute towards that rating. These are discussed in more detail below.
Each factor can have either a positive or negative contribution towards the Model Rating, depending on its performance. It's very possible for models with 'Poor' ratings across each factor to have an overall rating of zero, particularly when it's very early on in the model training process.
Example Model Rating in Validation on a demo dataset
Please Note: The model rating is specific to labels and does not take into account the performance of entities. Validation does, however, also provide detailed metrics on the performance of entity predictions (see here).
The 'Model Version' dropdown, located above the model rating, lets you see all validation scores across past model versions on a given dataset. You can also prioritise or 'star' individual ones so that they appear at the top of the list in future. This tool can be useful for tracking and comparing progress as you build out your model.
The dropdown ensures that you don't need to pin a model version to see validation scores for it. Pinning a model version (see here) only needs to be used for model versions that you want to be able to call downstream via the API (e.g. for automated routing).
The model version dropdown
The Factors tab in Validation (as shown above) provides a breakdown of each of the key factors that are considered when calculating the Model Rating.
Each factor card shows:
- The rating for the factor and a qualitative description of it
- The contributors , which are the different components that contribute to the rating for that factor
- Recommended actions, listed in order of priority, that can help to improve the rating of the factor
- This factor assesses the average performance of all labels in the taxonomy,
- It primarily relies on the Mean Average Precision (MAP) across all labels, weighted by how frequently they're assigned
- It also takes into account when there are labels in the taxonomy that have red or amber performance warnings
- The recommended actions for this factor will typically be designed to increase the average precision of specific labels, or to remove their performance warnings
- This factor assesses the performance of the 10% of the labels in the taxonomy that have the most significant performance issues or lowest overall label 'health' (if no warnings are returned)
- If even the bottom 10% of your labels still perform very well, this is a great sign that your model is healthy
- This relies on the MAP of the lowest performing labels, as well as whether these labels have any significant performance warnings (i.e. red or amber warnings)
- The recommended actions for this factor will typically be designed to increase the average precision of specific underperforming labels and remove any other individual performance warnings
- This factor assesses the proportion of verbatims that are predicted to have at least one informative label (see here for more detail)
- It's very important that a well performing model 'covers' as much of the dataset as possible with informative label predictions
- The recommended actions for this factor are designed to help increase the coverage of the dataset. This will most often be training in 'Low Confidence' mode
- Understanding and increasing the coverage of a dataset is covered in detail here
- This factor assesses whether the reviewed data is an effective, balanced representation of the whole dataset, and is designed to detect potential labelling bias (see here for more detail)
- It relies on a 'similarity score', which results from comparing the reviewed data against the unreviewed data in the dataset, as well as whether enough 'Shuffle' mode has been used when labelling the data
- It's important that the reviewed data in a dataset is as similar as possible to the unreviewed data, so that the platform can reliably identify as much of it as possible
- The recommended actions for this factor are designed to reduce any potential labelling bias and increase the similarity score of the reviewed data
- Understanding and improving the balance of the dataset is covered in detail here
Please Note: The platform will always try to surface the recommended next best actions that could improve a factor, even if the factor is rated 'Good' or 'Excellent'. This is in case you want to continue to optimise the performance of that factor.
The Metrics tab of the Validation page shows some average label performance statistics, as well as a chart showing the average precision of each label versus their training set size. The chart also flags labels that have amber or red performance warnings.
Example Metrics tab in Validation for a demo dataset
The label performance statistics shown are:
Essentially, the higher your MAP score, the better the model is performing overall when it comes to the accuracy of it's predictions, but this will vary between datasets depending on the type of data, and the objectives you have centred your taxonomy around.
MAP is not a good measure of coverage or balance, however, and should not be solely relied on to determine whether your model is fit for purpose. For more detail on understanding and checking your model's coverage, see here, and for checking and improving your model's balance, see here.
Understanding label performance
The label performance chart shown in the Metrics tab of the Validation page (see above) gives an immediate visual indication of how each individual label is performing.
For a label to appear on this chart, it must have at least 20 pinned examples present in the training set used by the platform during validation. To ensure that this happens, users should make sure they provide a minimum of 25 (often more) pinned examples per label.
Each label will be plotted as one of three colours, based on the model's understanding of how the label is performing. Below, we explain what these mean:
Label performance indicators
Label performance indicators:
- Those labels plotted as blue on the chart have a satisfactory performance level. This is based on numerous contributing factors, including number and variety of examples, labelling bias and average precision for that label
- Labels plotted as amber have slightly less than satisfactory performance. They may have relatively low average precision, not quite enough training examples, or could have been labelled in a biased manner. These labels require a bit of training / correction to improve their performance or to reduce bias in the way that they were trained
- Labels plotted as red are poorly performing labels. They may have very low average precision, not enough training examples, or are highly biased in the way that they were labelled. These labels may require considerably more training / correction to bring their performance up to a satisfactory level or to significantly reduce the bias in the way they were trained
Please Note: you will see the amber and red performance indicators appear in the label filter bars in Explore, Reports and Validation. This helps to quickly notify you which labels need some help, and also which labels' predictions should not be relied upon (without some work to improve them) when using the analytics features.
Individual Label performance
Users can select individual labels from the label filter bar (or by clicking the label's plot on the 'All labels' chart) in order to see the label's performance statistics. The specific label view will also show any performance warnings and recommended next best action suggestions to help improve its performance.
The label view will show the average precision for the label, as well as its precision vs. recall based on a given confidence threshold that users can adjust themselves to see how it changes (see the last section of this article for detail).
Validation page with a specific label selected
Improving overall model and individual label performance
If your model or a specific label has a performance warning, the platform recommends the next best action that it thinks will help address that warning. This will either be via one of the factor cards (as shown below) or when you select a specific label from the taxonomy or the 'All Labels' chart (as shown above).
Example Factor card showing recommended actions
For some labels that have very low average precision, it might not always be clear why they are performing badly. Possible reasons why a label may have low average precision and how to improve it are discussed in the following article here. The platform will always provide what it thinks are the best corrective actions to improve the label, listed in order of priority.
When you aren't in the Validation page, the red and amber performance warnings still appear in the taxonomy filter in Explore and Reports, so these can be a helpful indicator to check Validation to see the recommended corrective actions.
The next best actions suggestions act as links that you can click to take you direct to the training view that the platform suggests in order to improve the label's performance or reduce its bias. The suggestions are intelligently ordered with the highest priority action to improve the label listed first.
These are the most important tool to help you understand the performance of your model, and should regularly be used as guides when trying to improve the performance of a label.
The next article covers in detail the reasons why a label may have low average precision (here). The later articles in this section of the Knowledge Base also cover the other training modes that the platform may suggest using to improve your model's performance, which are 'Check label', 'Missed label', and 'Rebalance'.
Please Note: If you have an older model version selected, you will still see the warnings and suggested actions, but if you hover over the information icon with your mouse, a pop-up will warn you that these specific stats cannot be changed as there is a newer model version (which you can work on).
Understanding the label threshold slider
Please Note: adjusting the confidence threshold slider does not actually impact the model, it just gives you a visual representation of how predictions would be impacted for that label if a certain confidence threshold was selected. This can help you select a suitable threshold for streams or using predictions for downstream automation.
To adjust the confidence threshold for a label, users can type a % figure into the box, or just drag the slider along the scale.
Label specific Validation stats
Notice how in the image above, the confidence threshold for the ‘Quote > Motor’ label is set at 68.7%, giving a precision of 100% and recall of 85%.
Then, in the image below, the confidence threshold has been adjusted using the slider to 17%, which brings the precision down to 84%, but increases the recall to 100%.
Label specific Validation charts
To better understand the relationship between precision and recall, see here, and to better understand how you would select thresholds to be used in automations, see here.
Previous: How does Validation work? | Next: Why might a label have low average precision?