Training

The following methods for training a spaCy pipeline are based on the spaCy documentation.

Training from command line

A spaCy pipeline can be trained via the command line using a config.cfg file. The config.cfg can be generated in one of two ways:

  • init method: python -m spacy init config config.cfg [options]
  • manually: You can use spaCy's quickstart and init fill-config

After the appropriate config.cfg is generated (see here for an example config), run the following:

python -m spacy train config.cfg --output ./output/path [options]

If the paths to the train and dev data are not included in the config.cfg, inclue the tags --paths.train ./train --paths.dev ./dev.

This will automatically launch a training sequence for the specified number of epochs and settings in the config. The output directory will be populated with a best and last model.

Component Options

You can use the included spaCy factory components or add custom components with nlp.add_pipe(). When using the factory components the user can specify which model to use. These models are Thinc models used by spaCy. Custom components can also make use of these models or can include custom coded models. This code can be attached to the trained pipeline when packaged.

Built-in components are (spaCy):

  • DependencyParser
  • EntityLinker
  • EntityRecognizer
  • Morphologizer
  • SentenceRecognizer
  • SpanCategorizer
  • Tagger
  • TextCategorizer
  • Tok2Vec

These components can be trained with custom architecture or using the default architectures by spaCy. When using the default architectures, using the config quickstart will autofill for this option.

Config options

For further details on the options available to the config.cfg see the spaCy docs.

Training from script

Language.to_disk() after Language.add_pipe(), create_pipe() deprecated.

Adding custom components

https://spacy.io/api/language#component

@Language.component("my_component")
def my_component(doc):
    # do soemthing to doc
    return doc

The new my_component can now be added to the pipeline via Language.add_pipe("my_component"). (Language.create_pipe() is now deprecated for user use).

Saving and loading the pipeline

spacy train

Specify output directory, run spacy package /path/to/pipeline /path/to/output.

Language.to_disK()

After modifying Language object, run Language.to_disk("/path/to/pipeline") then spacy package /pipeline /output.