Data annotation is 20 times more work than the engineering time required to train a model. Pre-processing that data can reduce the redundant manual work in the already very labor-intensive and expensive process of annotation. A veteran in managing the complicated training data process will share how to properly prepare your data to head off headaches: from annotators tagging duplicate documents to machine learning models being confused by essentially identical characters encoded in different ways — especially an issue with Chinese, Japanese, Korean, and Arabic script languages, but also accented European languages.