Before you begin model training it's really important to understand how to approach creating your taxonomy, including naming and structuring your labels and what those labels should actually capture. This article will cover each of these topics, beginning with naming labels.
The following article will discuss why it's so important to structure your taxonomy properly to meet your business objectives (here).
Deciding names for your labels can seem a daunting or time-consuming prospect - but it doesn't need to be.
For a start, Re:infer doesn't care what a label's name is, to our models, the label name itself is just a number. What's important is that the label name serves a business purpose, and is a useful description of the specific concept it is intended to capture.
You can rename labels at any time (see how), as well as adding levels of hierarchy when needed, so don't spend too much time when first building your model thinking of the perfect name.
Part of naming a label is determining its hierarchy within your taxonomy. Labels can have multiple levels of hierarchy, simply separated by '>', to capture when a label concept is a sub-set of a broader parent concept.
So potential label structures can look like this (also see example in image below):
- [Parent Label]
- [Parent Label] > [Child Label]
- [Parent Label] > [Branch Label] > [Child Label]
You can add more than three levels of hierarchy, but we don't recommend doing this often, as it becomes increasingly complex to train. It might be required in certain cases, but it shouldn't be considered best practice.
Conceptually, it's important that every label nested under another label is a subset of the label above it. This nesting (level of hierarchy) is created using a '>' when typing out the label name.
The visual below illustrates this point using a Venn diagram:
Illustration of how label hierarchies work conceptually
Again, you can add additional levels of hierarchy by renaming your labels later on in the model training process.
How do these hierarchies work in practice?
To understand this, let's take 'Child Label X' from the visual above as an example.
When Re:infer predicts that 'Child Label X' applies to a verbatim, it is also predicting that 'Branch Label C' and 'Parent Label 1' at the same time. This is because 'Child Label X' is a subset of these.
As each level of hierarchy adds an increasingly level of specificity, however, Re:infer can be more confident that a parent or branch label applies than a more specific child label. This means Re:infer can assign different probabilities to different label predictions within the same hierarchy.
So for a particular verbatim, Re:infer could be:
- 99% confident that 'Parent Label 1' applies
- 88% confident that 'Branch Label C' applies
- 75% confident that 'Child Label X' applies.
It's worth noting if a child label is predicted for a verbatim, Re:infer should always predict the parent label (and branch label where applicable) with at least the same confidence as the child label, if not greater.
The fact that Reinfer predicts each label independently is one of the key reasons that it's important that parent labels should capture genuine topics or concepts, rather than abstract ones.
For example, a parent label such as 'Process' used to group together child labels that relate to specific processes, is a bad choice for a parent label. 'Process' itself is an abstract concept and is not something that Re:infer will predict well on its own. In a business context, the name of the specific process something relates to (and which can be identified from the text of a verbatim) would be a useful parent label. Useful branch and child labels could then be relevant sub-processes of the main parent process.
How do I choose which concepts should be parent labels and which should be child labels?
You may sometimes need to make difficult choices regarding the structure of your taxonomy. For instance, it could be difficult choosing whether a label should be a parent label or a child label, as it could logically be a broad parent category with its own sub-categories, or be a specific sub-category of another broader parent category.
For example, imagine a dataset comprised of hotel reviews - there could be many reviews that discuss the pricing of different aspects of their holiday and the hotel - the restaurant, the bar, the rooms, the activities, etc.
You could logically have 'Pricing' as a parent label, and each specific aspect of pricing (e.g. restaurant) as child labels.
You could also, however, have parent labels relating to the specific aspects like 'Restaurant' and 'Rooms' and have 'Pricing' as a child label under each.
So which should you choose?
It's helpful to consider a few things when deciding:
- Will there potentially be a significant number of other concepts you'd like to capture related to this broader topic? If yes, it's likely it should be a parent label
- What's the most important thing to track from an MI or reporting perspective? Considering our example, is it more helpful be able to clearly see in Re:infer's analytics exactly how many people are talking about pricing and it's sub-categories? Or is it more helpful to see overall statistics on the feedback on Rooms, the Restaurant(s), activities, etc., with pricing being one aspect of those?
There is not necessarily a right or wrong answer in these situations - it comes down to what's more important for you and your business.
What can I actually capture with a label?
So far we've discussed how to name labels and structure them in hierarchies, but you may still be wondering what exactly should a label capture?
It's important to remember that Re:infer is a natural language processing tool. The platform reads and interprets each verbatim that is assigned a label, and begins to form an understanding of how to identify that label concept based predominantly on the text within it.
As you add more varied and consistent examples for each label, Re:infer improves its understanding of that label concept. Adding more labels is a process with diminishing returns, however, once a label is performing well. Accepting lots of high confidence predictions for a label does not give Re:infer new information and is a practice that should be avoided.
As Re:infer is using the language of the verbatim to understand and identify what constitutes a label concept, the label needs to be clearly identifiable from the text (i.e. the language) of the verbatims that they're applied to. For an email verbatim, this includes the subject AND the body of the email.
See this example email below, which has the label 'Margin Call > Full Agreement' applied to it. This label is clearly inferable from the subject and the body of the email.
Example email verbatim highlighting the text that Re:infer takes into account when making predictions
Whilst Re:infer is able to take into account certain metadata properties when training, particularly things like NPS score (for customer feedback datasets) that can help it understand sentiment, it is the text of the verbatim that is by far the most important data considered by Re:infer's models.
Please Note: Re:infer does not consider the specific sender or recipient address of an email, so this should not be used whatsoever when determining what label should be applied to an email verbatim.
What this all means is that it's important that each label is specific in what it's trying to capture, otherwise Re:infer will not be able to identify trends and patterns in the language that help it predict a label concept.
Why should I try to avoid using very broad labels?
Extremely broad labels such as 'General Query', or 'Everything else' can be very unhelpful if used to group together lots of different distinct topics and there's no clear pattern or commonality between the examples provided to Re:infer.
If Re:infer is to predict a label well, its need multiple similar examples of the different expressions of each concept captured by a label. Extremely broad labels therefore require very large numbers of examples to be predicted well.
It's usually much better practice to split broad labels out into distinct labels - even if you have 'Everything Else > [Various child labels]'.
If Re:infer is better able to identify a child label because it's more specific and clearly identifiable (compared to a very broad parent category), it can actually significantly boost its ability to predict its parent label too.