Preparing Data for .CSV Upload

User permissions required: 'Sources admin' AND 'Edit verbatims'


You can find instructions on uploading data from a .csv here, along with common error messages you may encounter in the platform.  

Prior to uploading data into Communications Mining, there are a few factors to take into consideration when preparing the data to be ingested by the platform. 

Important Note: Please ensure you are uploading a .csv file, and not an Excel file. 

If you have been opening the .csv in Excel and making changes, this can lead to formatting issues potentially causing issues at the point of upload. To avoid this, please ensure any updates are done in the .csv directly.

Additionally, please check for the following before uploading your .csv into the platform to avoid encountering any errors upon uploading, or data quality issues that will negatively impact the quality of model performance:  




Duplicate rows 

Having the same data repeated multiple times across the data extract

Mismatched headers

Having the wrong headers aligned to the wrong data fields

Hanging rows or columnsNot having all the data contained in sequential rows

Having all comments in Row 1 to 10,000, but having a row with a cell containing data in row 19,999.

Inconsistent date formatting Different rows with inconsistent date formats

Having a number of comments in US date format, and a number of comments in EU date format, all in the same dataset, as this will have issues normalizing downstream.

Incoherent sentencesThese are sentences that contain an assortment of words without a clear syntactic or semantic structure.


'The user is requesting a new portable 28442 298 ticket to be creaportableted' 

Inconsistent spacingWhen there are an irregular number of spaces in between words.

'The policy    is set to     renew' instead of 'The policy is set to renew'

Breaks in wordsWhen there are breaks in the middle of a word, when there shouldn't be.

'The po licy is set. to renew' instead of 'The policy is set to renew'

Erroneous character encodingWhen text data is not properly encoded, resulting in garbled or unreadable characters.

'ThÇ åpp is gré¶t' instead of 'The app is great.'

Blank comments

Communications without any content included in the subject/body

Comments with lots of typos

Text data containing lots of errors in spelling

Headers / footers When there are headers or footers included

Spam warnings, virus scan warnings, etc.
Metadata included in the subject/body instead of as a metadata propertyWhen metadata is included in the subject or body

'[01/01/2023] I would like to renew my policy' as the body of a message, instead of 'I would like to renew my policy' as the message with 01/01/2023 as the date included in the metadata.

Multiple messages combined into one messageWhen there are multiple messages that should have been broken out into separate messages in a thread, combined into a single communication.

