Understanding and improving model performance

User permissions required: 'View Sources' AND 'View Labels'

What's in this article?

Understanding overall model performance
Factors
Understanding label performance
Individual Label performance
Improving overall model and individual label performance
Understanding the label threshold slider

Understanding overall model performance

How well a machine learning model performs overall is determined by a number of contributing factors, which need to be considered in combination with each other. It's not sufficient to only consider whether a model's labels have high average precision, or to just look at how much of a dataset is covered by predictions on its own.

To be confident that a model is a true representation of a dataset, with accurate, reliable label predictions and high coverage, we assess 4 main factors (and here's how):

All labels - this factor assesses the average performance of all labels in the taxonomy
Underperforming labels - this factor assesses the performance of the 10% of labels that have the most significant performance issues or lowest overall health
Coverage - this factor assesses the proportion of verbatims that are predicted to have at least one informative label
Balance - this factor assesses whether the reviewed data is an effective, balanced representative of the whole dataset

The platform combines its' assessment of these 4 factors into a single, easy to understand proprietary 'Model Rating'. This rating is the best measure of a model's performance, as it accounts for all of the most important factors that a label model should be assessed on, weighted by their relative importance.

The Model Rating is a score from 0 to 100, which equates to a rating of 'Poor' (0-49), 'Average' (50-69), 'Good' (70-89) or 'Excellent' (90-100).

Validation also provides a breakdown of each of these factors - each factor is assigned its own qualitative rating, as well as displaying the metrics that contribute towards that rating. These are discussed in more detail below.

Each factor can have either a positive or negative contribution towards the Model Rating, depending on its performance. It's very possible for models with 'Poor' ratings across each factor to have an overall rating of zero, particularly when it's very early on in the model training process.

Example Model Rating in Validation on a demo dataset

Please Note: The model rating is specific to labels and does not take into account the performance of entities. Validation does, however, also provide detailed metrics on the performance of entity predictions (see here).

The 'Model Version' dropdown, located above the model rating, lets you see all validation scores across past model versions on a given dataset. You can also prioritise or 'star' individual ones so that they appear at the top of the list in future. This tool can be useful for tracking and comparing progress as you build out your model.

The dropdown ensures that you don't need to pin a model version to see validation scores for it. Pinning a model version (see here) only needs to be used for model versions that you want to be able to call downstream via the API (e.g. for automated routing).

The model version dropdown

Factors

The Factors tab in Validation (as shown above) provides a breakdown of each of the key factors that are considered when calculating the Model Rating.

Each factor card shows:

The rating for the factor and a qualitative description of it
The contributors , which are the different components that contribute to the rating for that factor
Recommended actions, listed in order of priority, that can help to improve the rating of the factor

All labels

This factor assesses the average performance of all labels in the taxonomy,
It primarily relies on the Mean Average Precision (MAP) across all labels, weighted by how frequently they're assigned
It also takes into account when there are labels in the taxonomy that have red or amber performance warnings
The recommended actions for this factor will typically be designed to increase the average precision of specific labels, or to remove their performance warnings

Underperforming labels

This factor assesses the performance of the 10% of the labels in the taxonomy that have the most significant performance issues or lowest overall label 'health' (if no warnings are returned)
If even the bottom 10% of your labels still perform very well, this is a great sign that your model is healthy
This relies on the MAP of the lowest performing labels, as well as whether these labels have any significant performance warnings (i.e. red or amber warnings)
The recommended actions for this factor will typically be designed to increase the average precision of specific underperforming labels and remove any other individual performance warnings

Coverage

This factor assesses the proportion of verbatims that are predicted to have at least one informative label (see here for more detail)
It's very important that a well performing model 'covers' as much of the dataset as possible with informative label predictions
The recommended actions for this factor are designed to help increase the coverage of the dataset. This will most often be training in 'Low Confidence' mode
Understanding and increasing the coverage of a dataset is covered in detail here

Balance

This factor assesses whether the reviewed data is an effective, balanced representation of the whole dataset, and is designed to detect potential labelling bias (see here for more detail)
It relies on a 'similarity score', which results from comparing the reviewed data against the unreviewed data in the dataset, as well as whether enough 'Shuffle' mode has been used when labelling the data
It's important that the reviewed data in a dataset is as similar as possible to the unreviewed data, so that the platform can reliably identify as much of it as possible
The recommended actions for this factor are designed to reduce any potential labelling bias and increase the similarity score of the reviewed data
Understanding and improving the balance of the dataset is covered in detail here

Please Note: The platform will always try to surface the recommended next best actions that could improve a factor, even if the factor is rated 'Good' or 'Excellent'. This is in case you want to continue to optimise the performance of that factor.

Metrics

The Metrics tab of the Validation page shows some average label performance statistics, as well as a chart showing the average precision of each label versus their training set size. The chart also flags labels that have amber or red performance warnings.

Example Metrics tab in Validation for a demo dataset

The label performance statistics shown are:

Essentially, the higher your MAP score, the better the model is performing overall when it comes to the accuracy of it's predictions, but this will vary between datasets depending on the type of data, and the objectives you have centred your taxonomy around.

MAP is not a good measure of coverage or balance, however, and should not be solely relied on to determine whether your model is fit for purpose. For more detail on understanding and checking your model's coverage, see here, and for checking and improving your model's balance, see here.

Understanding label performance

The label performance chart shown in the Metrics tab of the Validation page (see above) gives an immediate visual indication of how each individual label is performing.

For a label to appear on this chart, it must have at least 20 pinned examples present in the training set used by the platform during validation. To ensure that this happens, users should make sure they provide a minimum of 25 (often more) pinned examples per label.

Each label will be plotted as one of three colours, based on the model's understanding of how the label is performing. Below, we explain what these mean:

Label performance indicators

Label performance indicators:

Those labels plotted as blue on the chart have a satisfactory performance level. This is based on numerous contributing factors, including number and variety of examples, labelling bias and average precision for that label

Labels plotted as amber have slightly less than satisfactory performance. They may have relatively low average precision, not quite enough training examples, or could have been labelled in a biased manner. These labels require a bit of training / correction to improve their performance or to reduce bias in the way that they were trained

Labels plotted as red are poorly performing labels. They may have very low average precision, not enough training examples, or are highly biased in the way that they were labelled. These labels may require considerably more training / correction to bring their performance up to a satisfactory level or to significantly reduce the bias in the way they were trained

Please Note: you will see the amber and red performance indicators appear in the label filter bars in Explore, Reports and Validation. This helps to quickly notify you which labels need some help, and also which labels' predictions should not be relied upon (without some work to improve them) when using the analytics features.

Individual Label performance

Users can select individual labels from the label filter bar (or by clicking the label's plot on the 'All labels' chart) in order to see the label's performance statistics. The specific label view will also show any performance warnings and recommended next best action suggestions to help improve its performance.

The label view will show the average precision for the label, as well as its precision vs. recall based on a given confidence threshold that users can adjust themselves to see how it changes (see the last section of this article for detail).

Validation page with a specific label selected

Improving overall model and individual label performance

If your model or a specific label has a performance warning, the platform recommends the next best action that it thinks will help address that warning. This will either be via one of the factor cards (as shown below) or when you select a specific label from the taxonomy or the 'All Labels' chart (as shown above).

Example Factor card showing recommended actions

For some labels that have very low average precision, it might not always be clear why they are performing badly. Possible reasons why a label may have low average precision and how to improve it are discussed in the following article here. The platform will always provide what it thinks are the best corrective actions to improve the label, listed in order of priority.

When you aren't in the Validation page, the red and amber performance warnings still appear in the taxonomy filter in Explore and Reports, so these can be a helpful indicator to check Validation to see the recommended corrective actions.

The next best actions suggestions act as links that you can click to take you direct to the training view that the platform suggests in order to improve the label's performance or reduce its bias. The suggestions are intelligently ordered with the highest priority action to improve the label listed first.

These are the most important tool to help you understand the performance of your model, and should regularly be used as guides when trying to improve the performance of a label.

The next article covers in detail the reasons why a label may have low average precision (here). The later articles in this section of the Knowledge Base also cover the other training modes that the platform may suggest using to improve your model's performance, which are 'Check label', 'Missed label', and 'Rebalance'.

Please Note: If you have an older model version selected, you will still see the warnings and suggested actions, but if you hover over the information icon with your mouse, a pop-up will warn you that these specific stats cannot be changed as there is a newer model version (which you can work on).

Understanding the label threshold slider

Please Note: adjusting the confidence threshold slider does not actually impact the model, it just gives you a visual representation of how predictions would be impacted for that label if a certain confidence threshold was selected. This can help you select a suitable threshold for streams or using predictions for downstream automation.

To adjust the confidence threshold for a label, users can type a % figure into the box, or just drag the slider along the scale.

Label specific Validation stats

Notice how in the image above, the confidence threshold for the ‘Quote > Motor’ label is set at 68.7%, giving a precision of 100% and recall of 85%.

Then, in the image below, the confidence threshold has been adjusted using the slider to 17%, which brings the precision down to 84%, but increases the recall to 100%.

Label specific Validation charts

To better understand the relationship between precision and recall, see here, and to better understand how you would select thresholds to be used in automations, see here.

Previous: How does Validation work? | Next: Why might a label have low average precision?

Model Training & Maintenance

Getting Started

Manage Accounts & Access

Model Training & Maintenance

Using Analytics & Monitoring

Automations & Communications Mining

Technical Support

FAQs & More

Model Training & Maintenance

Getting Started

Manage Accounts & Access

Using Analytics & Monitoring

Automations & Communications Mining

Technical Support

FAQs & More

Understanding and improving model performance

Understanding overall model performance

Factors

Understanding label performance

Individual Label performance

Improving overall model and individual label performance

Understanding the label threshold slider

Sections