# The missing guide on data preparation for language modeling

Language models gained popoluarity in NLP in the recent years. Often models trained on large corpora of text are adapted to a custom dataset by resuming the training of the model on new data. Sometimes you might have enought data and want to train a language model like BERT or RoBERTa from scratch. Python libraries like the huggingface transformers make it quite easy to do this. While there are many tutorials about tokenization and on how to train the model, there is not much information about how to load the data into the model. This guide aims to close this gap.

## Best practices from research

To understand what we want to do, we have a look at two popular research papers on language modelling with transformer models.

BERT1

In the paper the authors first define how they structure their text.

Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence.

To generate each training input sequence, they sample a span of text from the corpus, which they refer to as “sentences” even though they are typically much longer than single sentences. What they also recommend is, “to use a document-level corpus rather than a shuffled sentence-level corpus (such as the Billion Word Benchmark [Chelba et al., 2013]) in order to extract long contiguous sequences”.

RoBERTa2

First, they notice:

We find that using individual sentences hurts performance on downstream tasks.

They run extensive experiments with different approaches and the best apporach was described as follows.

Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents.

These findings suggest that

• full 512 text segments should be used.
• sampling only from one document at a time does not make much difference from just sampling contiguous text.

Now we find out how to achieve these things with the current huggingface transformers library and have a look at the source code. The library offers two configurations. First the default which uses the TextDataset class for data ingestion. The second option is --line-by-line which uses the LineByLineTextDataset.

• TextDataset: reads the full input text, tokenizes it and cuts it in block_sized chunks. Then adds special tokens (here just <s> or ["SEP"]/["CLS"])
for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
self.examples.append(
tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])
)

• LineByLineTextDataset: reads each line separately, tokenizes and truncates the lines to block_size. Adds special tokens.
with open(file_path, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)


## Conclusions

• don’t think in terms of sentences.
• use TextDataset because --line-by-line will throw away a lot of data if not used correctly.
• if you use --line-by-line you need to be aware of what it does and structure your data yourself.
• things are constantly changing and other libraries might implement different approaches.

That’s it. Let me know what you think and if this guide matches your experience. Cheers.