User permissions required: 'View Sources' AND 'View Labels'
Overall model performance
The main Validation page shows you overall performance statistics for your model, as well as a graphical representation of the average precision of each label vs their training set size.
The two key measures of overall performance are:
Essentially, the higher your MAP score, the better the model is performing overall when it comes to the accuracy of it's predictions, but this will vary between datasets depending on the type of data, and the objectives you have centred your taxonomy around.
You can use the MAP score to compare the performance of the model against previous versions of the same model, but as a direct comparison to the model associated with a different dataset, it may not tell the whole story.
MAP is not a good measure of coverage, however, and should not be solely relied on to determine whether your model is fit for purpose. For more detail on understanding and checking your model's coverage, see here.
The main Validation page also lets you drill down into the individual performance of each label.
Main Validation page
Understanding label performance
The 'All labels' performance chart shown in the main Validation page (see above) gives an immediate visual indication of how each label is performing.
For a label to appear on this chart, it must have at least 20 pinned examples present in the training set used by the platform during validation. To ensure that this happens, users should make sure they provide a minimum of 25 (often more) pinned examples per label.
Each label will be plotted as one of three colours, based on the model's understanding of how the label is performing. Below, we explain what these mean:
Label performance indicators
Label performance indicators:
- Those labels plotted as green on the chart have a satisfactory performance level (based on numerous contributing factors, including number and variety of examples, as well as average precision for that label)
- Labels plotted as amber have slightly less than satisfactory performance or are potentially quite biased, and require a bit of training / correction to improve their performance or to reduce the bias in the way they were trained
- Labels plotted as red are poorly performing labels or potentially highly biased and require considerably more training / correction to bring their performance up to a satisfactory level or to significantly reduce the bias in the way they were trained
Please Note: you will also see the amber and red performance indicators appear in the label filter bars in Explore, Reports and Validation. This helps to quickly notify you which labels need some help, and also which labels' predictions should not be relied upon (without some work to improve them) when using the analytics features.
Individual Label performance:
Users can select individual labels from the label filter bar (or by clicking the label's plot on the 'All labels' chart) in order to see the label's performance statistics, as well as any performance warnings and next best action suggestions to help improve it.
Validation page with a specific Label selected
Improving label performance
Model warnings and next best actions:
The coloured label indicators will help you quickly identify labels that need work to improve them. By clicking on a specific label, you will be shown the performance statistics for that label, as well as 'Model Warnings' and 'Next Best Actions' (as shown above).
If you have the current model version selected (at the top of the Validation page), the model warnings section will display warnings for labels that are not performing well, and the next best actions section will provide suggestions on how to improve the performance of the label.
The model warnings you will see generally refer to either the average precision of the label, or whether it was trained in a biased way (using only 'Search', 'Cluster' or 'Label' when training).
The next best actions suggestions act as links that you can click to take you direct to the training view that Re:infer suggests in order to improve the label's performance or reduce its bias. The suggestions are intelligently ordered with the highest priority action to improve the label listed first.
These are the most important tool to help you understand the performance of your model, and should regularly be used as guides when trying to improve the performance of a label.
Please Note: If you have an older model version selected, you will still see the warnings and suggested actions, but if you hover over the information icon with your mouse, a pop-up will warn you that these specific stats cannot be changed as there is a newer model version (which you can work on).
Why is my label not performing well?
Below are outlined are some of the main reasons why a label may have a low average precision, as well as a suggested solution to improve it:
1. The training set size may be too small
- If the training set size is quite small, it may be that you just need to provide more training examples for the model
- Continue training the label using the methods outlined in the Explore phase
2. The label may have been applied inconsistently or incorrectly to some of the verbatims
- It can often be the case that a user’s definition of a label changes over time, and older reviewed verbatims with that label may need revisiting to see if the label still applies
- Alternatively, if there are multiple users training a dataset, they could have interpretations of what each label means, and send mixed signals to the model
- To determine whether this is the case, users can use the Teach function to go through the reviewed verbatims for the label, and see where a label has been applied inconsistently or incorrectly
- Users can then correct any errors and update labels to ensure consistency
- Going forward, if there are multiple users training a dataset, they should ensure that they are fully aligned on how they define the intents or concepts covered by each label
3. The intent or concept that the label is intended to capture may be vague or very broad and hard to distinguish from other labels
- If a label is used to capture a very broad or vague intent or concept, it can be hard for the model to identify why that label should apply to a verbatim – it may then try to apply it to far too many verbatims
- Try not to be too generic when creating a label; it needs to be identifiable and distinguishable from other labels
4. Alternatively, the intent or concept could be very specific or have too many layers in its hierarchy
- Trying to be too specific or adding many layers to a label’s hierarchy can make it too difficult for the model to detect, or distinguish it from previous layers
- The level of specificity for a label should match the content of the verbatims. If it is too specific to realistically distinguish from other similar labels in the hierarchy, the model may get confused
- In most cases, it is best practice to have three layers or less in a label’s hierarchy – i.e. [Root label] > [Connecting label] > [Leaf label]
5. There may be several labels in the taxonomy that heavily overlap and the model struggles to distinguish between the two
- If you have two labels that are very similar and hard to distinguish from one another, it can confuse the model, as it won’t know which of the two labels applies
- In these instances, consider merging the labels
- Alternatively, go through the reviewed verbatims for each and make sure that the concepts are applied consistently and are distinct from one another
6. The verbatims with that label applied may mostly be very similar or identical, and the model struggles to detect different ways of expressing the same intent or concept
- You should ensure that for every label you provide the model with multiple training examples that include various different ways of expressing the intent or concept that the label is intended to capture
7. The intent or concept captured by that label is not semantically inferable from the text of the verbatim or it’s supporting metadata
- It is common for users to label a verbatim based on their own business knowledge of the context or process that would follow, and not on the actual text or metadata of the verbatim
- For example, an SME user may know that because the communication has come from a certain individual, it must be about a certain topic, even though nothing else in the text or metadata clearly indicates that the label should apply
- In this instance, users should only apply the label if the model would be able to detect it from the text or metadata, without this inside knowledge
Understanding the label threshold slider
Please Note: adjusting the confidence threshold slider does not actually impact the model, it just gives you a visual representation of how predictions would be impacted for that label if a certain confidence threshold was selected. This can help you select a suitable threshold for triggers or using predictions for downstream automation.
To adjust the confidence threshold for a label, users can type a % figure into the box, or just drag the slider along the scale.
Label specific Validation charts
Notice how in Figure X above, the confidence threshold for the ‘Margin Call > Failing Trade’ label is set at 86.8%, giving a precision of 100% and recall of 50%.
Then, in Figure Y below, the confidence threshold has been adjusted using the slider to 39.6%, which brings the precision down to 90%, but increases the recall to 90%.
Label specific Validation charts