PLEASE NOTE: UiPath Communications Mining's Knowledge Base has been fully migrated to UiPath Docs. Please navigate to equivalent articles in UiPath Docs (here) for up to date guidance, as this site will no longer be updated and maintained.

Knowledge Base

Model Training & Maintenance

Guides on how to create, improve and maintain Models in Communications Mining, using platform features such as Discover, Explore and Validation

Multilingual sources and datasets

Communications Mining supports multilingual sources and datasets. This means that the models can understand sources that contain multiple different supported languages, without actually having to translate them. 

 

The languages that are currently 'General Availability' within multilingual sources and datasets are:

 

  • English
  • French
  • German
  • Spanish
  • Italian
  • Portuguese
  • Dutch 

 

And we'll be expanding this list over time!

 

What this means in practice is that if users work and do business in several languages that are supported by Communications Mining, they can train models on verbatims in those languages, rather than translating everything into a single language.


A large list of additional languages are supported 'In Preview' (included at the bottom of this article), meaning that we will be working to fine-tune them over time as our customers and partners begin to use them. A large proportion of these languages will perform very strongly regardless and will require little to no fine-tuning by our teams to achieve high performance.

 


Important considerations when looking to use multilingual sources and datasets:


  • If a dataset is multilingual, users will not be able to see translations of any verbatims (as provided for translated datasets), so they will need to be able to understand all of the languages in the dataset to effectively train their model
  • Understanding multiple languages is a more complex machine learning problem than understanding a single language, so these datasets may potentially experience a slight drop in performance compared to datasets in a single language
  • The platform will only be able to understand language from one of the supported languages listed above. If there are other languages present in the dataset, tagging these verbatims with labels used on verbatims in supported languages will be confusing for the platform. It is better to label these as their own specific labels that capture the language as a label, but the platform will not be able to interpret the specifics of the unsupported language

 


How do you create multilingual sources and datasets?

 

For both data source and datasets, the language family is selected when they are created, and cannot be changed once they are.


Simply select 'multilingual' from the language family dropdown on the create source or create dataset modal (it's typically the last setting to select).

 

Please Note: Multilingual datasets can contain sources of any language family that the platform supports.


For more detail on creating a source in the UI, see here.


For more detail on creating a dataset, see here.


General Availability Languages

  • English
  • Dutch
  • French
  • German
  • Italian
  • Portuguese
  • Spanish

 

Supported languages 'In Preview'

 

  • Afrikaans
  • Albanian
  • Amharic
  • Arabic
  • Armenian
  • Assamese
  • Azerbaijani
  • Basque
  • Belarusian
  • Bengali
  • Bengali (Romanized)
  • Bosnian
  • Breton
  • Bulgarian
  • Burmese
  • Burmese
  • Catalan
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Croatian
  • Czech
  • Danish
  • Esperanto
  • Estonian
  • Filipino
  • Finnish
  • Galician
  • Georgian
  • Greek
  • Gujarati
  • Hausa
  • Hebrew
  • Hindi
  • Hindi (Romanized)
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish
  • Japanese
  • Javanese
  • Kannada
  • Kazakh
  • Khmer
  • Korean
  • Kurdish (Kurmanji)
  • Kyrgyz
  • Lao
  • Latin
  • Latvian
  • Lithuanian
  • Macedonian
  • Malagasy
  • Malay
  • Malayalam
  • Marathi
  • Mongolian
  • Nepali
  • Norwegian
  • Oriya
  • Oromo
  • Pashto
  • Persian
  • Polish
  • Punjabi
  • Romanian
  • Russian
  • Sanskrit
  • Scottish Gaelic
  • Serbian
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Somali
  • Sundanese
  • Swahili
  • Swedish
  • Swiss German
  • Tamil
  • Tamil (Romanized)
  • Telugu
  • Telugu (Romanized)
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Urdu (Romanized)
  • Uyghur
  • Uzbek
  • Vietnamese
  • Welsh
  • Western Frisian
  • Xhosa
  • Yiddish

 


Previous: Create a new dataset     |     Next: Enabling sentiment on a dataset

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.

Sections

View all