June 3, 2020

Data validation for NLP applications with topic models

In a recent article, we saw how to implement a basic validation pipeline for text data. Once a machine learning model has been deployed its behavior must be monitored. The predictive performance is expected to degrade over time as the environment changes. This is known as concept drift, occurs when the distributions of the input features shift away from the distribution upon which the model was originally trained. Machine Learning pipeline with validation. Read more

January 30, 2020

Data validation for NLP machine learning applications

An important part of machine learning applications, is making sure that there is no data degeneration while a model is in production. Sometimes downstream data processing changes and machine learning models are very prone to silent failure due to this. So data validation is a crucial step of every production machine learning pipeline. The case is relatively easy in the case of well-specified tabular data. But in the case of NLP it’s much harder to write down assumptions about the data and enforce them. Read more

January 21, 2020

Find label issues with confident learning for NLP

In every machine learning project, the training data is the most valuable part of your system. In many real-world machine learning projects the largest gains in performance come from improving training data quality. Training data is often hard to aquire and since the data can be large, quality can be hard to check. In this article I introduce you to a method to find potentially errorously labeled examples in your training data. Read more

Privacy Imprint

© depends-on-the-definition 2017-2020