You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ power the Firefox web page translation starting with version 118.
7
7
8
8
The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.
Making datasets less noisy to improve quality of translation.
10
+
11
+
## Regular pipeline
12
+
13
+
14
+
Config setting:
15
+
```
16
+
use-opuscleaner: false
17
+
```
18
+
19
+
### Dataset fixing
20
+
21
+
Some datasets require fixes like detokenization.
22
+
Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes).
23
+
Naming convention:
24
+
-`<dataset_name>.sh` for parallel dataset cleaning
25
+
-`<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
26
+
-`/` in dataset name should be replaced with `_`
27
+
28
+
### Cleaning scripts
29
+
30
+
Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script.
31
+
32
+
33
+
### Bicleaner
34
+
35
+
It is recommended to use Bicleaner ML models to filter noisy data.
36
+
See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner).
37
+
38
+
39
+
## OpusCleaner
40
+
41
+
Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
42
+
43
+
Config setting:
44
+
```
45
+
use-opuscleaner: true
46
+
```
47
+
48
+
## Custom filter configs
49
+
The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset
50
+
to get a training corpus with less noise and train higher quality translation models.
51
+
52
+
Filtering rules can be tuned in an interactive UI.
53
+
54
+
### Installation
55
+
56
+
Install the OpusCleaner UI on a server.
57
+
See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner).
58
+
59
+
For local usage: run from a poetry shell `make opuscleaner-ui`.
60
+
Then go to `http://0.0.0.0:8000`.
61
+
62
+
### Making filters
63
+
64
+
Choose a language pair and download the required OPUS datasets.
65
+
They will correspond to `opus_...` training datasets in the training pipeline config.
66
+
67
+
Configure cleaning rules for the datasets in the UI.
68
+
69
+
Copy JSON files for the produced filters `data/train-parts/*.filter.json` to
the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.
76
+
77
+
Modify if needed. Some rules require specifying source or target language.
78
+
The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair.
79
+
The generated default config will be copied to the target dataset cleaning directory.
80
+
81
+
### Running
82
+
83
+
Enable OpusCleaner in the training pipeline config and run the pipeline as usual.
84
+
OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script.
This section includes instructions on how to find and configure datasets and cleaning procedures.
7
+
# Dataset importers
4
8
5
-
## Dataset importers
6
-
7
-
Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml).
9
+
Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.test.yml).
8
10
9
11
Example:
10
12
```
@@ -25,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da
25
27
[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
26
28
Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"
27
29
28
-
You can also use [find-corpus](/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
30
+
You can also use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
29
31
30
32
Set up a local [poetry](https://python-poetry.org/) environment.
31
33
```
@@ -36,38 +38,7 @@ python utils/find-corpus.py en ru sacrebleu
36
38
```
37
39
Make sure to check licenses of the datasets before using them.
38
40
39
-
###Adding a new importer
41
+
## Adding a new importer
40
42
41
-
Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `<prefix>.sh`
43
+
Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/mono) which is named as `<prefix>.sh`
42
44
and accepts the same parameters as the other scripts from the same folder.
43
-
44
-
## Dataset fixing
45
-
46
-
Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes](/pipeline/clean/fixes).
47
-
Naming convention:
48
-
-`<dataset_name>.sh` for parallel dataset cleaning
49
-
-`<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
50
-
-`/` in dataset name should be replaced with `_`
51
-
52
-
## Dataset cleaning
53
-
Some parallel datasets require more aggressive filtering.
54
-
Dataset specific Bicleaner thresholds can be set in config.
55
-
`0` means skipping filtering entirely (useful for Paracrawl).
56
-
57
-
Example:
58
-
59
-
```
60
-
experiment:
61
-
...
62
-
bicleaner:
63
-
default-threshold: 0.5
64
-
dataset-thresholds:
65
-
opus_ParaCrawl/v8: 0
66
-
mtdata_neulab_tedtalksv1_train: 0.6
67
-
```
68
-
69
-
### OpusCleaner
70
-
71
-
Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
72
-
73
-
See more details in the [dedicated doc](opus-cleaner.md).
description: "Firefox Translations Training documentation."
6
+
permalink: /
7
+
---
8
+
9
+
# Firefox Translations training
10
+
Training pipelines for Firefox Translations machine translation models.
11
+
12
+
The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository,
13
+
compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and
14
+
power the Firefox web page translation starting with version 118.
15
+
16
+
The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.
17
+
18
+
## Training pipeline
19
+
20
+
The pipeline is capable of training a translation model for a language pair end to end.
21
+
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters.
22
+
Some settings, especially low resource languages might require extra tuning.
23
+
24
+
We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine .
25
+
26
+
## Learning resources
27
+
28
+
- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
29
+
-[Model training guide](training-guide.md) - practical advice on how to use the pipeline
30
+
-[Reference papers](references.md)
31
+
32
+
33
+
## Acknowledgements
34
+
This project uses materials developed by:
35
+
- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
36
+
- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
12
17
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
13
-
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py).
18
+
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py).
14
19
Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning).
15
20
Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk |
16
21
Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU |
17
22
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
18
23
Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
19
-
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
20
-
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
24
+
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
25
+
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
21
26
Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode.
22
27
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive.
23
28
Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization.
Copy file name to clipboardexpand all lines: docs/references.md
+8-1
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,9 @@
1
+
---
2
+
layout: default
3
+
title: References
4
+
nav_order: 8
5
+
---
6
+
1
7
# References
2
8
3
9
Here is a list of selected publications on which the training pipeline is based.
@@ -15,7 +21,6 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
15
21
16
22
3. Mölder F, Jablonski KP, Letcher B, et al. [Sustainable data analysis with Snakemake](https://pubmed.ncbi.nlm.nih.gov/34035898/). F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2
17
23
18
-
19
24
4.[Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26) (Bogoychev et al., NGT 2020)
20
25
21
26
5.[From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632) (Kim et al., EMNLP 2019)
@@ -32,3 +37,5 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
32
37
14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). [A Simple, Fast, and Effective Reparameterization of IBM Model 2](http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf). In Proc. of NAACL.
33
38
15.[Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016)
0 commit comments