September 25, 2020

The missing guide on data preparation for language modeling

Language models gained popoluarity in NLP in the recent years. Sometimes you might have enought data and want to train a language model like BERT or RoBERTa from scratch. While there are many tutorials about tokenization and on how to train the model, there is not much information about how to load the data into the model. This guide aims to close this gap.

May 20, 2020

Latent Dirichlet allocation from scratch

Today, I’m going to talk about topic models in NLP. Specifically we will see how the Latent Dirichlet Allocation model works and we will implement it from scratch in numpy. What is a topic model? Assume we are given a large collections of documents. Read more

December 10, 2019

How the LIME algorithm fails

You maybe know the LIME algorithm from some of my earlier blog posts. It can be quite useful to “debug” data sets and understand machine learning models better. But LIME is fooled very easily.

November 24, 2019

Cluster discovery in german recipes

If you are dealing with a large collections of documents, you will often find yourself in the situation where you are looking for some structure and understanding what is contained in the documents. Here I’ll show you a convenient method for discovering and understanding clusters of text documents. Read more

Privacy Imprint

© depends-on-the-definition 2017-2020