August 23, 2020

Data augmentation with transformer models for named entity recognition

Language model based pre-trained models such as BERT have provided significant gains across different NLP tasks. For many NLP tasks, labeled training data is scarce and acquiring them is a expensive and demanding task. Data augmentation can help increasing the data efficiency by artificially perturbing the labeled training samples to increase the absolute number of available data points. In NLP this is commonly achieved by replacing words by synonyms based on dictionaries or translating to a different language and back1. Read more

July 7, 2020

How to approach almost any real-world NLP problem

This time, I’m going to talk about how to approach general NLP problems. But we’re not going to look at the standard tips which are tosed around on the internet, for example on platforms like kaggle. Instead we will focus on how to approach NLP problems in the real world. A lot of the things mentioned here do also apply to machine learning projects in general. But here we will look at everything from the perspective of natural language processing and some of the problems that arise there. Read more

June 3, 2020

Data validation for NLP applications with topic models

In a recent article, we saw how to implement a basic validation pipeline for text data. Once a machine learning model has been deployed its behavior must be monitored. The predictive performance is expected to degrade over time as the environment changes. This is known as concept drift, occurs when the distributions of the input features shift away from the distribution upon which the model was originally trained. Machine Learning pipeline with validation. Read more

May 20, 2020

Latent Dirichlet allocation from scratch

Today, I’m going to talk about topic models in NLP. Specifically we will see how the Latent Dirichlet Allocation model works and we will implement it from scratch in numpy. What is a topic model? Assume we are given a large collections of documents. Each of these documents can contain text of one or more topics. The goal of a topic model is to infer the topic distribution of each of the documents. Read more

January 30, 2020

Data validation for NLP machine learning applications

An important part of machine learning applications, is making sure that there is no data degeneration while a model is in production. Sometimes downstream data processing changes and machine learning models are very prone to silent failure due to this. So data validation is a crucial step of every production machine learning pipeline. The case is relatively easy in the case of well-specified tabular data. But in the case of NLP it’s much harder to write down assumptions about the data and enforce them. Read more

January 21, 2020

Find label issues with confident learning for NLP

In every machine learning project, the training data is the most valuable part of your system. In many real-world machine learning projects the largest gains in performance come from improving training data quality. Training data is often hard to aquire and since the data can be large, quality can be hard to check. In this article I introduce you to a method to find potentially errorously labeled examples in your training data. Read more

December 28, 2019

How explainable AI fails and what to do about it

This article heavily relys on "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead" by Cynthia Rudin and finally on some of my personal experiences. I will mainly focus on technical issues and leave out most of the governance and ethics related issues that derive from these. More about this can be found in Cynthia’s paper. Nowadays, there are a lot of people talking about and advertising the methods of “Explainable AI” (XAI). Read more

December 10, 2019

How the LIME algorithm fails

You maybe know the LIME algorithm from some of my earlier blog posts. It can be quite useful to “debug” data sets and understand machine learning models better. But LIME is fooled very easily.

November 24, 2019

Cluster discovery in german recipes

If you are dealing with a large collections of documents, you will often find yourself in the situation where you are looking for some structure and understanding what is contained in the documents. Here I’ll show you a convenient method for discovering and understanding clusters of text documents. The method also works well for non-text features, where you can use it to understand the importance of certain features for the cluster. Read more

October 15, 2019

That was PyCon DE & PyData Berlin 2019

Last weekendPyCon DE and PyData Berlin joined in Berlin for a great conference event that I was lucky to attend. The speaker line-up was great and often it was hard to choose which talk or tutorial to attend. I will give you an overview of the talks I liked and the respective material. Day 1 Apache Airflow for beginners by Varya Karpenko // material Apache Airflow is an open source project that allows you programmatically create, schedule and monitor sequences of tasks. Read more

August 5, 2019

Model uncertainty in deep learning with Monte Carlo dropout in keras

Deep learning models have shown amazing performance in a lot of fields such as autonomous driving, manufacturing, and medicine, to name a few. However, these are fields in which representing model uncertainty is of crucial importance. The standard deep learning tools for regression and classification do not capture model uncertainty. In classification, predictive probabilities obtained at the end of the pipeline (the softmax output) are often erroneously interpreted as model confidence. Read more

Privacy Imprint

© depends-on-the-definition 2017-2020