When building a taxonomy in a dataset, the intended use case will typically fall into one of two categories: analytics and monitoring or automation.


Before building a taxonomy for either purpose, it's worth understanding a bit about the differences between the two kinds:


Analytics and monitoring focused datasets

  • The objective of an analytics and monitoring dataset is usually to gain a detailed understanding of the various processes, issues, and sentiments within a data source or collection of sources
  • These datasets provide initial insights once the model is trained, and an ongoing ability to monitor changes and trends within the dataset over time
  • Given their purpose, they typically have detailed, extensive taxonomies with lots of labels, though they usually have fewer pinned examples per label than automation focused datasets
  • As they are intended to capture more specific labels across an entire dataset, they typically sacrifice a bit of accuracy in their predictions in order to achieve detailed coverage across a broad range of topics


Automation focused datasets

  • These typically have very small taxonomies with lots of pinned examples for every label
  • Lots of examples are needed per label to ensure high precision and recall and to catch various edge cases in the dataset and ensure precision and recall are as high as required
  • Each label involved in an automation should seek to maximise precision and recall (depending on the use case, you might optimise one slightly over the other), though it is not typically possible for both precision and recall to reach 100%. There will almost always be some exceptions



Previous: Objectives & building a Taxonomy    |     Next: Overview of the training process