Posts

February 19, 2023

Causal graphs and the back-door criterion - A practical test on deconfounding

I read into causal inference recently and since I didn’t really have a use-case for it right now, I played around with some data and some causal graphs. In this article, I looked at some causal graphs from the “Book of Why” Chapter 4 by Judea Pearl and Dana Mackenzie and created simulated data based of them. Read more

#XAI | #Algorithms | #from-scratch

July 19, 2022

How to calculate shapley values from scratch

The shapley value is a popular and useful method to explain machine learning models. The shapley value of a feature is the average contribution of a feature value to the prediction. In this article I’ll show you how to compute shapley values from scratch. Read more

#NLP | #shorts

May 12, 2022

How to add new tokens to huggingface transformers vocabulary

In this short article, you’ll learn how to add new tokens to the vocabulary of a huggingface transformer model.

#engineering | #shorts

April 20, 2022

How to test error messages with pytest

In this short article, you will learn, how and when to test the error message of an exception with pytest.

#NLP | #Algorithms

May 24, 2021

Learning unsupervised embeddings for textual similarity with transformers

In this article, we look at SimCSE, a simple contrastive sentence embedding framework, which can be used to produce superior sentence embeddings, from either unlabeled or labeled data. The idea behind the unsupervised SimCSE is to simply predicts the input sentence itself, with only dropout used as noise. Read more

#NLP | #engineering

September 25, 2020

The missing guide on data preparation for language modeling

Language models gained popularity in NLP in the recent years. Sometimes you might have enough data and want to train a language model like BERT or RoBERTa from scratch. While there are many tutorials about tokenization and on how to train the model, there is not much information about how to load the data into the model. This guide aims to close this gap.

#Named entity recognition | #NLP | #machine learning

August 23, 2020

Data augmentation with transformer models for named entity recognition

In this article we sample from pre-trained transformers to augment small, labeled text datasets for named entity recognition.

#NLP | #machine learning

July 7, 2020

How to approach almost any real-world NLP problem

This time, I’m going to talk about how to approach general NLP problems. But we’re not going to look at the standard tips which are tosed around on the internet, for example on platforms like kaggle. Read more

#data quality | #engineering | #NLP | #machine learning

June 3, 2020

Data validation for NLP applications with topic models

In a recent article, we saw how to implement a basic validation pipeline for text data. Once a machine learning model has been deployed its behavior must be monitored. The predictive performance is expected to degrade over time as the environment changes. Read more

#NLP | #Algorithms | #from-scratch

May 20, 2020

Latent Dirichlet allocation from scratch

Today, I’m going to talk about topic models in NLP. Specifically we will see how the Latent Dirichlet Allocation model works and we will implement it from scratch in numpy. What is a topic model? Assume we are given a large collections of documents. Read more

#data quality | #engineering | #NLP | #machine learning

January 30, 2020

Data validation for NLP machine learning applications

An important part of machine learning applications, is making sure that there is no data degeneration while a model is in production. Sometimes downstream data processing changes and machine learning models are very prone to silent failure due to this. Read more