Skip to content

Commit 2df0a3a

Browse files
eu9enemarco-cgregtatum
authored
Update the training guide (#239)
* Update training guide * Fix docs * Add index file * Remove header * Fix docs link * Remove tensorboard section * Add theme * Update navigation * Add logo * Use absolute links * Fix code links * Fix code links * Fix link * Clarify what config is * Fix note for bicleaner Co-authored-by: Marco Castelluccio <[email protected]> * Fix typo Co-authored-by: Greg Tatum <[email protected]> * Fix link * Fix mentioning of Marian Co-authored-by: Greg Tatum <[email protected]> * Remove "my" * Make note about snakemake more visible * Fix phrasing * Add link to bilceaner paper * Add clarifications * Add links to default training configs * Add reference to bilceaner section * Small fixes --------- Co-authored-by: Marco Castelluccio <[email protected]> Co-authored-by: Greg Tatum <[email protected]>
1 parent cf51faa commit 2df0a3a

15 files changed

+465
-184
lines changed

Makefile

+2-2
Original file line numberDiff line numberDiff line change
@@ -119,13 +119,13 @@ dag:
119119
################################################
120120

121121
# OpusCleaner is a data cleaner for training corpus
122-
# More details are in docs/opus-cleaner.md
122+
# More details are in docs/cleaning.md
123123
opuscleaner-ui:
124124
poetry install --only opuscleaner
125125
opuscleaner-server serve --host=0.0.0.0 --port=8000
126126

127127
# Utils to find corpus etc
128-
install utils:
128+
install-utils:
129129
poetry install --only utils
130130

131131
# Black is a code formatter for Python files. Running this command will check that

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ power the Firefox web page translation starting with version 118.
77

88
The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.
99

10-
[Documentation](/docs)
10+
[Documentation](https://mozilla.github.io/firefox-translations-training/)
1111

1212
## Pipeline
1313

docs/_config.yml

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
remote_theme: just-the-docs/just-the-docs
2+
#color_scheme: dark
3+
title: Firefox Translations Training
4+
description: Documentation for the Firefox Translations training pipelines
5+
heading_anchors: true
6+
# doesn't work
7+
favicon_ico: "img/logo.svg"
8+
# Aux links for the upper right navigation
9+
aux_links:
10+
"GitHub":
11+
- "https://github.com/mozilla/firefox-translations-training"
12+

docs/cleaning.md

+84
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
layout: default
3+
title: Data cleaning
4+
nav_order: 5
5+
---
6+
7+
# Data cleaning
8+
9+
Making datasets less noisy to improve quality of translation.
10+
11+
## Regular pipeline
12+
13+
14+
Config setting:
15+
```
16+
use-opuscleaner: false
17+
```
18+
19+
### Dataset fixing
20+
21+
Some datasets require fixes like detokenization.
22+
Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes).
23+
Naming convention:
24+
- `<dataset_name>.sh` for parallel dataset cleaning
25+
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
26+
- `/` in dataset name should be replaced with `_`
27+
28+
### Cleaning scripts
29+
30+
Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script.
31+
32+
33+
### Bicleaner
34+
35+
It is recommended to use Bicleaner ML models to filter noisy data.
36+
See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner).
37+
38+
39+
## OpusCleaner
40+
41+
Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
42+
43+
Config setting:
44+
```
45+
use-opuscleaner: true
46+
```
47+
48+
## Custom filter configs
49+
The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset
50+
to get a training corpus with less noise and train higher quality translation models.
51+
52+
Filtering rules can be tuned in an interactive UI.
53+
54+
### Installation
55+
56+
Install the OpusCleaner UI on a server.
57+
See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner).
58+
59+
For local usage: run from a poetry shell `make opuscleaner-ui`.
60+
Then go to `http://0.0.0.0:8000`.
61+
62+
### Making filters
63+
64+
Choose a language pair and download the required OPUS datasets.
65+
They will correspond to `opus_...` training datasets in the training pipeline config.
66+
67+
Configure cleaning rules for the datasets in the UI.
68+
69+
Copy JSON files for the produced filters `data/train-parts/*.filter.json` to
70+
`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/`.
71+
72+
### Default config
73+
74+
If no custom config was specifed for the dataset,
75+
the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.
76+
77+
Modify if needed. Some rules require specifying source or target language.
78+
The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair.
79+
The generated default config will be copied to the target dataset cleaning directory.
80+
81+
### Running
82+
83+
Enable OpusCleaner in the training pipeline config and run the pipeline as usual.
84+
OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script.

docs/data.md

+10-39
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
# Data
1+
---
2+
layout: default
3+
title: Datasets
4+
nav_order: 4
5+
---
26

3-
This section includes instructions on how to find and configure datasets and cleaning procedures.
7+
# Dataset importers
48

5-
## Dataset importers
6-
7-
Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml).
9+
Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.test.yml).
810

911
Example:
1012
```
@@ -25,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da
2527
[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
2628
Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"
2729

28-
You can also use [find-corpus](/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
30+
You can also use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
2931

3032
Set up a local [poetry](https://python-poetry.org/) environment.
3133
```
@@ -36,38 +38,7 @@ python utils/find-corpus.py en ru sacrebleu
3638
```
3739
Make sure to check licenses of the datasets before using them.
3840

39-
### Adding a new importer
41+
## Adding a new importer
4042

41-
Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `<prefix>.sh`
43+
Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/mono) which is named as `<prefix>.sh`
4244
and accepts the same parameters as the other scripts from the same folder.
43-
44-
## Dataset fixing
45-
46-
Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes](/pipeline/clean/fixes).
47-
Naming convention:
48-
- `<dataset_name>.sh` for parallel dataset cleaning
49-
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
50-
- `/` in dataset name should be replaced with `_`
51-
52-
## Dataset cleaning
53-
Some parallel datasets require more aggressive filtering.
54-
Dataset specific Bicleaner thresholds can be set in config.
55-
`0` means skipping filtering entirely (useful for Paracrawl).
56-
57-
Example:
58-
59-
```
60-
experiment:
61-
...
62-
bicleaner:
63-
default-threshold: 0.5
64-
dataset-thresholds:
65-
opus_ParaCrawl/v8: 0
66-
mtdata_neulab_tedtalksv1_train: 0.6
67-
```
68-
69-
### OpusCleaner
70-
71-
Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
72-
73-
See more details in the [dedicated doc](opus-cleaner.md).

docs/development.md

+6
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
---
2+
layout: default
3+
title: Development
4+
nav_order: 7
5+
---
6+
17
# Development
28

39
## Architecture

docs/img/logo.svg

+4
Loading

docs/index.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
---
2+
layout: default
3+
title: Home
4+
nav_order: 1
5+
description: "Firefox Translations Training documentation."
6+
permalink: /
7+
---
8+
9+
# Firefox Translations training
10+
Training pipelines for Firefox Translations machine translation models.
11+
12+
The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository,
13+
compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and
14+
power the Firefox web page translation starting with version 118.
15+
16+
The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.
17+
18+
## Training pipeline
19+
20+
The pipeline is capable of training a translation model for a language pair end to end.
21+
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters.
22+
Some settings, especially low resource languages might require extra tuning.
23+
24+
We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine .
25+
26+
## Learning resources
27+
28+
- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
29+
- [Model training guide](training-guide.md) - practical advice on how to use the pipeline
30+
- [Reference papers](references.md)
31+
32+
33+
## Acknowledgements
34+
This project uses materials developed by:
35+
- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
36+
- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
37+
- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/))
38+
- Many other open source projects and research papers (see [References](references.md))

docs/opus-cleaner.md

-47
This file was deleted.

docs/orchestrators.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
layout: default
3+
title: Orchestrators
4+
nav_order: 6
5+
has_children: true
6+
has_toc: false
7+
---
8+
9+
# Orchestrators
10+
11+
An orchestrator is responsible for workflow management and parallelization.
12+
13+
Supported orchestrators:
14+
15+
- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI.
16+
It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability.
17+
[Usage instructions](task-cluster.md).
18+
- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster.
19+
[Usage instructions](snakemake.md).
20+
21+
Mozilla is currently switching to Taskcluster and the Snakemake workflow will be less actively maintained in the future.

docs/pipeline-steps.md

+8-3
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
---
2+
layout: default
3+
title: Pipeline steps
4+
nav_order: 3
5+
---
16

27
# Pipeline steps
38

@@ -10,14 +15,14 @@ Step | Description | Bottleneck | Comments
1015
--- | --- | --- | ---
1116
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
1217
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
13-
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py).
18+
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py).
1419
Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning).
1520
Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk |
1621
Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU |
1722
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
1823
Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
19-
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
20-
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
24+
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
25+
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
2126
Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode.
2227
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive.
2328
Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization.

docs/references.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
---
2+
layout: default
3+
title: References
4+
nav_order: 8
5+
---
6+
17
# References
28

39
Here is a list of selected publications on which the training pipeline is based.
@@ -15,7 +21,6 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
1521

1622
3. Mölder F, Jablonski KP, Letcher B, et al. [Sustainable data analysis with Snakemake](https://pubmed.ncbi.nlm.nih.gov/34035898/). F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2
1723

18-
1924
4. [Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26) (Bogoychev et al., NGT 2020)
2025

2126
5. [From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632) (Kim et al., EMNLP 2019)
@@ -32,3 +37,5 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
3237
14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). [A Simple, Fast, and Effective Reparameterization of IBM Model 2](http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf). In Proc. of NAACL.
3338
15. [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016)
3439
16. [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) (Taku Kudo, 2018)
40+
17. [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf) (Zaragoza-Bernabeu et al., LREC 2022)
41+
18. [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947) (Yoon Kim, Alexander M. Rush, EMNLP 2016)

docs/snakemake.md

+7-13
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,10 @@
1+
---
2+
layout: default
3+
title: Snakemake
4+
nav_order: 2
5+
parent: Orchestrators
6+
---
7+
18
# Snakemake
29

310
This section included the instructions on how to run the pipeline
@@ -284,16 +291,3 @@ The main directories inside `SHARED_ROOT` are:
284291
│ └ ru-en
285292
│ └ test
286293
│ └ clean_corpus.log
287-
288-
289-
## Utilities
290-
291-
### Tensorboard
292-
293-
To see training graphs run tensorboard:
294-
295-
```
296-
make install-tensorboard
297-
make tensorboard
298-
```
299-
Then port forward 6006.

0 commit comments

Comments
 (0)