# Latent Dirichlet allocation from scratch

Today, I’m going to talk about topic models in NLP. Specifically we will see how the Latent Dirichlet Allocation model works and we will implement it from scratch in numpy. What is a topic model? Assume we are given a large collections of documents. Each of these documents can contain text of one or more topics. The goal of a topic model is to infer the topic distribution of each of the documents.

# Find label issues with confident learning for NLP

In every machine learning project, the training data is the most valuable part of your system. In many real-world machine learning projects the largest gains in performance come from improving training data quality. Training data is often hard to aquire and since the data can be large, quality can be hard to check. In this article I introduce you to a method to find potentially errorously labeled examples in your training data.

# How the LIME algorithm fails

You maybe know the LIME algorithm from some of my earlier blog posts. It can be quite useful to “debug” data sets and understand machine learning models better. But LIME is fooled very easily.

# Cluster discovery in german recipes

If you are dealing with a large collections of documents, you will often find yourself in the situation where you are looking for some structure and understanding what is contained in the documents. Here I’ll show you a convenient method for discovering and understanding clusters of text documents. The method also works well for non-text features, where you can use it to understand the importance of certain features for the cluster.

# Model uncertainty in deep learning with Monte Carlo dropout in keras

Deep learning models have shown amazing performance in a lot of fields such as autonomous driving, manufacturing, and medicine, to name a few. However, these are fields in which representing model uncertainty is of crucial importance. The standard deep learning tools for regression and classification do not capture model uncertainty. In classification, predictive probabilities obtained at the end of the pipeline (the softmax output) are often erroneously interpreted as model confidence.

# Introduction to n-gram language models

You might have heard, that neural language models power a lot of the recent advances in natural language processing. Namely large models like Bert and GPT-2. But there is a fairly old approach to language modeling that is quite successful in a way. I always wanted to play with the, so called n-gram language models. So here’s a post about them. What are n-gram language models? Models that assign probabilities to sequences of words are called language models or LMs.

# How to use magnitude with keras

This time we have a look into the magnitude library, a feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. We want to utilize the embeddings magnitude provides and use them in keras. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications.

# Understanding text data with topic models

This is the first post of my series about understanding text datasets. A lot of the current NLP progress is made in predictive performance. But in practice, you often want and need to know, what is going on in your dataset. You may have labels that are generated from external sources and you have to understand how they relate to your text samples. You need to understand potential sources of leakage.

# Debugging black-box text classifiers with LIME

Often in text classification, we use so called black-box classifiers. By black-box classifiers I mean a classification system where the internal workings are completely hidden from you. A famous example are deep neural nets, in text classification often recurrent or convolutional neural nets. But also linear models with a bag of words representation can be considered black-box classifiers, because nobody can fully make sense of thousands of features contributing to a prediction.

# Explain neural networks with keras and eli5

In this post, I’m going to show you how you can use a neural network from keras with the LIME algorithm implemented in the eli5 TextExplainer class. For this we will write a scikit-learn compatible wrapper for a keras bidirectional LSTM model. The wrapper will also handle the tokenization and the storage of the vocabulary.

# Getting started with Multivariate Adaptive Regression Splines

In this post we will introduce multivariate adaptive regression splines model (MARS) using python. This is a regression model that can be seen as a non-parametric extension of the standard linear model. The multivariate adaptive regression splines model MARS builds a model of the from $$f(x) = \sum_{i=0}^k c_i B_i(x_i),$$ where $x$ is a sample vector, $B_i$ is a function from a set of basis functions (later called terms) and $c_i$ the associated coefficient.