August 23, 2020

Data augmentation with transformer models for named entity recognition

Language model based pre-trained models such as BERT have provided significant gains across different NLP tasks. For many NLP tasks, labeled training data is scarce and acquiring them is a expensive and demanding task. Data augmentation can help increasing the data efficiency by artificially perturbing the labeled training samples to increase the absolute number of available data points. In NLP this is commonly achieved by replacing words by synonyms based on dictionaries or translating to a different language and back1. Read more

July 7, 2020

How to approach almost any real-world NLP problem

This time, I’m going to talk about how to approach general NLP problems. But we’re not going to look at the standard tips which are tosed around on the internet, for example on platforms like kaggle. Instead we will focus on how to approach NLP problems in the real world. A lot of the things mentioned here do also apply to machine learning projects in general. But here we will look at everything from the perspective of natural language processing and some of the problems that arise there. Read more

June 3, 2020

Data validation for NLP applications with topic models

In a recent article, we saw how to implement a basic validation pipeline for text data. Once a machine learning model has been deployed its behavior must be monitored. The predictive performance is expected to degrade over time as the environment changes. This is known as concept drift, occurs when the distributions of the input features shift away from the distribution upon which the model was originally trained. Machine Learning pipeline with validation. Read more

January 30, 2020

Data validation for NLP machine learning applications

An important part of machine learning applications, is making sure that there is no data degeneration while a model is in production. Sometimes downstream data processing changes and machine learning models are very prone to silent failure due to this. So data validation is a crucial step of every production machine learning pipeline. The case is relatively easy in the case of well-specified tabular data. But in the case of NLP it’s much harder to write down assumptions about the data and enforce them. Read more

January 21, 2020

Find label issues with confident learning for NLP

In every machine learning project, the training data is the most valuable part of your system. In many real-world machine learning projects the largest gains in performance come from improving training data quality. Training data is often hard to aquire and since the data can be large, quality can be hard to check. In this article I introduce you to a method to find potentially errorously labeled examples in your training data. Read more

November 24, 2019

Cluster discovery in german recipes

If you are dealing with a large collections of documents, you will often find yourself in the situation where you are looking for some structure and understanding what is contained in the documents. Here I’ll show you a convenient method for discovering and understanding clusters of text documents. The method also works well for non-text features, where you can use it to understand the importance of certain features for the cluster. Read more

April 14, 2019

Introduction to entity embeddings with neural networks

Since a lot of people recently asked me how neural networks learn the embeddings for categorical variables, for example words, I’m going to write about it today. You all might have heard about methods like word2vec for creating dense vector representation of words in an unsupervised way. With this words you would initialize the first layer of a neural net for arbitrary NLP tasks and maybe fine-tune them. But the use of embeddings goes far beyond that! Read more

February 12, 2019

How to use magnitude with keras

This time we have a look into the magnitude library, a feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. We want to utilize the embeddings magnitude provides and use them in keras. Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. Read more

March 16, 2018

Guide to word vectors with gensim and keras

Word vectors Today, I tell you what word vectors are, how you create them in python and finally how you can use them with neural networks in keras. For a long time, NLP methods use a vectorspace model to represent words. Commonly one-hot encoded vectors are used. This traditional, so called Bag of Words approach is pretty successful for a lot of tasks. Recently, new methods for representing words in a vectorspace have been proposed and yielded big improvements in a lot of different NLP tasks. Read more

January 27, 2018

Detecting Network Attacks with Isolation Forests

In this post, I will show you how to use the isolation forest algorithm to detect attacks to computer networks in python. The term isolation means separating an instance from the rest of the instances. Since anomalies are ‘few and different’ and therefore they are more susceptible to isolation. In a data-induced random tree, partitioning of instances are repeated recursively until all instances are isolated. This random partitioning produces noticeable shorter paths for anomalies since the fewer instances of anomalies result in a smaller number of partitions – shorter paths in a tree structure, and instances with distinguishable attribute-values are more likely to be separated in early partitioning. Read more

December 23, 2017

A strong and simple baseline to classify toxic comments on wikipedia with keras

This time we’re going to discuss a current machine learning completion on kaggle. In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits. I will show you how to create a strong baseline using python and keras. import pandas as pd import numpy as np import matplotlib. Read more

Privacy Imprint

© depends-on-the-definition 2017-2020