Fine tuning Vs. Training Spacy NER #9233

LeafmanZ · 2021-09-16T17:12:43Z

LeafmanZ
Sep 16, 2021

I am having difficulty understanding if my model is fine tuning or training from scratch. My objective is to fine tune some NER data with the PER, GPE, ORG labels.

My config is as follows:

[paths]
train = "train_data/train.spacy"
dev = "train_data/train.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 1
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

My data had 12 sets of text that was initially in the following format after annotating it:

[("my text", {'entities': [(94, 110, 'PERSON'), (234, 240, 'PERSON'), (278, 289, 'PERSON'), (858, 872, 'PERSON'), (875, 881, 'PERSON'), (1, 10, 'GPE'), (39, 44, 'GPE'), (47, 54, 'GPE'), (122, 129, 'GPE'), (342, 351, 'GPE'), (395, 402, 'GPE'), (410, 415, 'GPE'), (445, 454, 'GPE'), (508, 513, 'GPE'), (518, 527, 'GPE'), (574, 583, 'GPE'), (611, 616, 'GPE'), (626, 632, 'GPE'), (720, 726, 'GPE'), (753, 761, 'GPE'), (766, 776, 'GPE'), (801, 810, 'GPE'), (825, 830, 'GPE'), (924, 929, 'GPE'), (983, 991, 'GPE')]}),
("my text", {'entities': [(94, 110, 'PERSON'), (234, 240, 'PERSON'), (278, 289, 'PERSON'), (858, 872, 'PERSON'), (875, 881, 'PERSON'), (1, 10, 'GPE'), (39, 44, 'GPE'), (47, 54, 'GPE'), (122, 129, 'GPE'), (342, 351, 'GPE'), (395, 402, 'GPE'), (410, 415, 'GPE'), (445, 454, 'GPE'), (508, 513, 'GPE'), (518, 527, 'GPE'), (574, 583, 'GPE'), (611, 616, 'GPE'), (626, 632, 'GPE'), (720, 726, 'GPE'), (753, 761, 'GPE'), (766, 776, 'GPE'), (801, 810, 'GPE'), (825, 830, 'GPE'), (924, 929, 'GPE'), (983, 991, 'GPE')]},
("my text", {'entities': [(94, 110, 'PERSON'), (234, 240, 'PERSON'), (278, 289, 'PERSON'), (858, 872, 'PERSON'), (875, 881, 'PERSON'), (1, 10, 'GPE'), (39, 44, 'GPE'), (47, 54, 'GPE'), (122, 129, 'GPE'), (342, 351, 'GPE'), (395, 402, 'GPE'), (410, 415, 'GPE'), (445, 454, 'GPE'), (508, 513, 'GPE'), (518, 527, 'GPE'), (574, 583, 'GPE'), (611, 616, 'GPE'), (626, 632, 'GPE'), (720, 726, 'GPE'), (753, 761, 'GPE'), (766, 776, 'GPE'), (801, 810, 'GPE'), (825, 830, 'GPE'), (924, 929, 'GPE'), (983, 991, 'GPE')]}]

This was then converted into spacy format like this:

import spacy
from spacy.tokens import DocBin
from spacy.util import filter_spans

nlp = spacy.blank("en")
training_data = my_list_of_tuples_above
db = DocBin()

for text, annotations in training_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations['entities']:
        span = doc.char_span(start, end, label=label)
        ents.append(span)

    filtered = filter_spans(ents)
    doc.ents = filtered
    db.add(doc)

db.to_disk("train_data/train.spacy")

Then to fine tune I ran the following in my terminal:

python -m spacy download en_core_web_trf
python -m spacy debug data config_spacy.cfg
python -m spacy train config_spacy.cfg --output ./trained_models --gpu-id 0

This brings up several questions on my part:

The training data I gave my model was tagged by en_core_web_trf. The entities I was tagging with en_core_web_trf are default trained entities (PER, ORG, GPE). I loaded en_core_web_trf to train on this data. I would expect that the training accuracy in the first epoch would be a perfect ~.99 or 1. However I get .08 instead. Does this mean that the model is being trained from scratch here?
If I was looking to fine tune is my current approach correct? If not what is wrong with my approach and what do I need to adjust to fine tune the en_core_web_trf model?

Answered by polm

Sep 18, 2021

In most cases, if a component has a factory, it's being trained from scratch. If it has a source instead it's being loaded from a pipeline. This is a little confusing with Transformers, since even with a factory they load a pretrained Transformer model, but it's true for most other components.

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

You have a factory so the NER model is being trained from scratch here.

To resume training, change the contents of this block to source = "en_core_web_trf", and remove the other components.ner blocks.

However, note in general re-training models like this is tricky due to catastrophic forgettin…

View full answer

polm · 2021-09-18T05:50:04Z

polm
Sep 18, 2021

In most cases, if a component has a factory, it's being trained from scratch. If it has a source instead it's being loaded from a pipeline. This is a little confusing with Transformers, since even with a factory they load a pretrained Transformer model, but it's true for most other components.

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

You have a factory so the NER model is being trained from scratch here.

To resume training, change the contents of this block to source = "en_core_web_trf", and remove the other components.ner blocks.

However, note in general re-training models like this is tricky due to catastrophic forgetting. You'll typically get better performance by training from a full dataset.

1 reply

LeafmanZ Sep 21, 2021
Author

Thank you. This helped a ton.

crobledolete · 2025-02-24T13:41:09Z

crobledolete
Feb 24, 2025

Hello. I'm facing an issue related to the use of the en_core_web_trf data source for fine-tunning the model. Here is my config.cfg file:

[paths]
train = "train.spacy"
dev = "test.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
source = "en_core_web_trf"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

To fine tune, I run this on a Jupyter Notebook:

!python -m spacy train config.cfg --output output --paths.train train.spacy --paths.dev test.spacy

However, it does not work properly. This is the output message I get:

ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
[/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/thinc/shims/pytorch.py:261](https://file+.vscode-resource.vscode-cdn.net/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/thinc/shims/pytorch.py:261): FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model.load_state_dict(torch.load(filelike, map_location=device))
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
⚠ Aborting and saving the final best model. Encountered exception:
AssertionError()
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/cli/_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/typer/core.py", line 743, in main
    return _main(
           ^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/cli/train.py", line 54, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/cli/train.py", line 84, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/training/loop.py", line 135, in train
    raise e
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/training/loop.py", line 118, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
                                           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/training/loop.py", line 220, in train_while_improving
    nlp.update(
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/language.py", line 1196, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "spacy/pipeline/transition_parser.pyx", line 419, in spacy.pipeline.transition_parser.Parser.update
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/thinc/model.py", line 328, in begin_update
    return self._func(self, X, is_train=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy/ml/tb_framework.py", line 34, in forward
    step_model = ParserStepModel(
                 ^^^^^^^^^^^^^^^^
  File "spacy/ml/parser_model.pyx", line 250, in spacy.ml.parser_model.ParserStepModel.__init__
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cristian.robledo/anaconda3/envs/crypto_env/lib/python3.12/site-packages/spacy_curated_transformers/models/listeners.py", line 392, in last_transformer_layer_listener_forward
    assert _outputs is not None
           ^^^^^^^^^^^^^^^^^^^^
AssertionError

I have been searching some information about this AssertionError but I have not found anything related to it. Do you know why I'm getting this error?

I'm using Python 3.12.9. Also, these are the versions of the different packages I'm using in my virtual environment:

Python 3.12.9
spacy==3.8.4
spacy-alignments==0.9.1
spacy-curated-transformers==0.3.0
spacy-legacy==3.0.12
spacy-loggers==1.0.5
spacy-transformers==1.3.8

I have created my training dataset in the same way the user that opened this thread did, but the entities I have in the dataset are PERSON, LOCATION and ORG.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning Vs. Training Spacy NER #9233

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Fine tuning Vs. Training Spacy NER #9233

LeafmanZ Sep 16, 2021

Replies: 2 comments · 1 reply

polm Sep 18, 2021

LeafmanZ Sep 21, 2021 Author

crobledolete Feb 24, 2025

LeafmanZ
Sep 16, 2021

Replies: 2 comments 1 reply

polm
Sep 18, 2021

LeafmanZ Sep 21, 2021
Author

crobledolete
Feb 24, 2025