Training

The following methods for training a spaCy pipeline are based on the spaCy documentation.

Training from command line

A spaCy pipeline can be trained via the command line using a config.cfg file. The config.cfg can be generated in one of two ways:

init method: python -m spacy init config config.cfg [options]
manually: You can use spaCy's quickstart and init fill-config

After the appropriate config.cfg is generated (see here for an example config), run the following:

python -m spacy train config.cfg --output ./output/path [options]

If the paths to the train and dev data are not included in the config.cfg, inclue the tags --paths.train ./train --paths.dev ./dev.

This will automatically launch a training sequence for the specified number of epochs and settings in the config. The output directory will be populated with a best and last model.

Component Options

You can use the included spaCy factory components or add custom components with nlp.add_pipe(). When using the factory components the user can specify which model to use. These models are Thinc models used by spaCy. Custom components can also make use of these models or can include custom coded models. This code can be attached to the trained pipeline when packaged.

Built-in components are (spaCy):

DependencyParser
EntityLinker
EntityRecognizer
Morphologizer
SentenceRecognizer
SpanCategorizer
Tagger
TextCategorizer
Tok2Vec

These components can be trained with custom architecture or using the default architectures by spaCy. When using the default architectures, using the config quickstart will autofill for this option.

Config options

For further details on the options available to the config.cfg see the spaCy docs.

Training from script

Language.to_disk() after Language.add_pipe(), create_pipe() deprecated.

Adding custom components

https://spacy.io/api/language#component

@Language.component("my_component")
def my_component(doc):
    # do soemthing to doc
    return doc

The new my_component can now be added to the pipeline via Language.add_pipe("my_component"). (Language.create_pipe() is now deprecated for user use).

Saving and loading the pipeline

spacy train

Specify output directory, run spacy package /path/to/pipeline /path/to/output.

Language.to_disK()

After modifying Language object, run Language.to_disk("/path/to/pipeline") then spacy package /pipeline /output.