User permissions required: 'View Sources' AND 'View Labels'
The main Validation page shows you overall performance statistics for your model, as well as a graphical representation of the average precision of each label vs their training set size.
The two key measures of overall performance are:
Essentially, the higher your MAP score, the better the model is performing overall, but this will vary between datasets depending on the type of data, and the objectives you have centred your taxonomy around.
You can use the MAP score to compare the performance of the model against previous versions of the same model, but as a direct comparison to the model associated with a different dataset, it may not tell the whole story.
The main Validation page also lets you drill down into the individual performance of each label.
Main Validation page
The 'All labels' performance chart shown in the main Validation page (see above) gives an immediate visual indication of how each label is performing.
For a label to appear on this chart, it must have at least 20 pinned examples present in the training set used by the platform during validation. To ensure that this happens, users should make sure they provide a minimum of 25 (often more) pinned examples per label.
Each label will be plotted as one of three colours, based on the model's understanding of how the label is performing. Below, we explain what these mean:
Label performance indicators
Label performance indicators:
- Those labels plotted as green on the chart have a satisfactory performance level (based on numerous contributing factors, including number and variety of examples, as well as average precision for that label)
- Labels plotted as amber have slightly less than satisfactory performance, and require a bit of training / correction to improve their performance
- Labels plotted as red are poorly performing labels and require considerably more training / correction to bring their performance up to a satisfactory level
Please Note: you will also see the amber and red performance indicators appear in the label filter bars in Explore, Reports and Validation. This helps to quickly notify you which labels need some help, and also which labels' predictions should not be relied upon (without some work to improve them) when using the analytics features.
Individual Label performance:
Users can select individual labels from the label filter bar (or by clicking the label's plot on the 'All labels' chart) in order to see performance statistics specific to that label.
Validation page with a specific Label selected
Below the summary stats for the Labels you will see a 'Model Warnings' section.
If you have the current Model version selected (at the top of the Validation page), this section will display a warning for Labels that are not performing well, as well as suggestion on how to improve the performance of that Label
If you have an older model version selected, you will still see the warnings and suggested actions, but if you hover over the information icon with your mouse, a pop-up will warn you that these specific stats cannot be changed as there is a newer model version (which you can work on).
To understand more about improving label performance, click here.
Label threshold slider
To adjust the confidence threshold for a label, users can type a % figure into the box, or just drag the slider along the scale.
Label specific Validation charts
Notice how in Figure X above, the confidence threshold for the ‘Margin Call > Failing Trade’ label is set at 86.8%, giving a precision of 100% and recall of 50%.
Then, in Figure Y below, the confidence threshold has been adjusted using the slider to 39.6%, which brings the precision down to 90%, but increases the recall to 90%.
Label specific Validation charts
To better understand the relationship between precision and recall in Re:infer, see here.
Please Note: adjusting the confidence threshold slider does not actually impact the model, it just gives you a visual representation of how predictions would be impacted for that label if a certain confidence threshold was selected. This can help you select a suitable threshold for triggers or using predictions for downstream automation.