Training Data

As of spaCy v3.0 the main method in training a custom NLP pipeline is the spacy train command. This method requires data in the spaCy binary format which is serialized with the .spacy extension. There are conversion utilities developed by spaCy for .conllu, .iob, and .json formats. (The .json format is used in spaCy v2.0).

The clu-spacy library also provides utility methods for converting data and checking tokenization of the custom data against the spaCy blank lang. Using the command convert-data on custom CoNLL and IOB data will produce match files and offer some retokenization. Run the following command within the data directory:

docker ...

When the data is not in CoNLL format the user must define the DocBin object manually. This can be accomplished with some preprocessing code, for example (from spaCy documentation, annotated):

import spacy
# The DocBin oject is used for serializing data for training.
from spacy.tokens import DocBin

# A blank Language object is a pipeline with no components.
nlp = spacy.blank("en")

training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]

# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
    # Create doc object for text, tokenization defined by base language ("en")
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./train.spacy")