Table of Contents
- What is the objective of training a model?
- Why can I not see anything in Discover if I've just uploaded data into the platform?
- How much historical data do I need to train a model?
- Do I need to save my model every time I make a change?
- How do I know what the performance of the model is?
- Why are there only 30 clusters available and can we set them individually?
- How many verbatims are in each cluster?
- What do precision and recall mean?
- Can I return to an earlier version of my model?
- Can I change the name of a label later on?
- How do I find out the number of verbatims I have labelled?
- One of my labels is performing poorly, what can I do to improve it?
- What does the red circle next to my label indicate? How do I get rid of it?
- Should I avoid labelling empty/uninformative verbatims?
As soon as data is uploaded to the platform, Re:infer begins a process called unsupervised learning, whereby it groups verbatims into clusters of similar semantic intent. This process can take up to a couple of hours, depending on the size of the dataset, and clusters will appear once it is complete.
Please Note: Discover requires a minimum of 2048 verbatims in the dataset in order to create clusters.
To be able to train a model, you need a minimum amount of existing historical data. This is used as training data to provide Re:infer with the necessary information to confidently predict each of the relevant concepts for your analysis and/or automation.
The recommendation for any use case is a minimum of 12 months of historical data, in order to properly capture any seasonality or irregularity in the data (e.g. month-end processes and busy seasons).
If you do not have 12 months worth of data, your use case may still be viable and you should speak to one of the Re:infer team for advice.
No, you do not need to save your model after any changes are made. Every time you train the platform on your data (i.e. labelling any verbatims), a new model version is created for your dataset. Performance statistics for older model versions can be viewed in Validation.
Please check the Validation page in the platform, which reports various performance measures and provides a holistic model health rating. This page updates after every training event and it can be used to identify areas where the model may need more training examples or some label corrections in order to ensure consistency.
Please see our Knowledge Base for information on how to use the Validation page, and for full explanations of model performance and how to improve it.
The clusters are a helpful way to help you quickly build up your taxonomy, but users will spend most of their time training in Explore rather than Discover.
If users spend too much time labelling via clusters, there’s a risk of overfitting the model to look for verbatims that only fit these clusters when making predictions. The more varied examples there are for each label, the better the model will be at finding the different ways of expressing the same intent or concept. This is one of the main reasons why we only show 30 clusters at a time.
Once enough training has been completed or a significant volume of data has been added to the platform (see here), however, Discover does retrain. When it retrains, it takes into account the existing training to-date, and will try to present new clusters that are not well covered by the current taxonomy.
For more information on Discover, see here.
There are 30 clusters in total, each containing 12 verbatims. In the platform, you are able to filter the number of verbatims shown on the page in increments between 6 and 12 per page. Our recommendation is labelling 6 at a time to ensure that you reduce the risk of partially labelling any verbatims.
Please see our training best practice guide for further helpful tips.
Precision and recall are metrics used to measure the performance of a machine learning model. A detailed description of each can be found under the Using Validation section of our how-to guides.
You can access the validation overview of earlier models by hovering over ‘Model Version’ in the top left corner of the Validation page. This can be helpful for tracking and comparing progress as you train out your model. However, you cannot return your model to an earlier version. If you are planning to make significant changes to your model, you should make a copy of your taxonomy before doing so. You can access more information on how to copy your taxonomy here.
You can pin a version of a model if you intend to use it to request predictions via the API. You can learn more about using model versions here.
Can I change the name of a label later on?
Yes, it’s really easy to do. You can go into the settings for each label and rename it at any point. You can see how to do it here.
How do I find out the number of verbatims I have labelled?
Information about your dataset, including how many verbatim that have been labelled, is displayed in the Datasets Settings page. To see how to access it, click here.
One of my labels is performing poorly, what can I do to improve it?
If you can see in the Validation page that your label is performing poorly, there are various ways to improve its performance. See here to understand more.
What does the red circle next to my label indicate? How do I get rid of it?
The little red dials next to each label indicate whether more examples are needed for Re:infer to accurately estimate the label's performance. The dial starts to disappear as you provide more training examples and will disappear completely once you reach 25 examples. After this, Re:infer will be able to effectively evaluate the performance of a given label and may return a performance warning if the label is not healthy.
You can read more about label warnings and how to avoid them here.
Should I avoid labelling empty/uninformative verbatims?
Re:infer is able to learn from empty verbatims and uninformative verbatims as long as they are labelled correctly. However, it is worth noting that uninformative labels will likely need a significant number of training examples, as well as to be loosely grouped by concept, to ensure best performance.