Name		Name	Last commit message	Last commit date
parent directory ..
analyses		analyses
experiments		experiments
README.md		README.md
_install_cuda.sh		_install_cuda.sh
_preamble.sh		_preamble.sh
_zones.py		_zones.py
delete_old_test_logs.py		delete_old_test_logs.py
launch.py		launch.py

README.md

Run the experiment in Google Cloud

Note that, in total, running the experiments in the paper will cost around $250 in compute credits. You'll get $300 of compute credits when you make a GCP account.

Setup

First, create a bucket called pretrain-on-test-accuracies.

Also consider:

Adding an alert to notify you if any errors pop up during a cloud run.
Increasing your quotas for GPUS_ALL_REGIONS and NVIDIA_T4_GPUS to 1 (or q > 1 for running experiments in parallel) and SSD_TOTAL_GB to 250 * q. The default region is us-west4. Or make quota requests according to error messages. FYI I couldn't get my quota past 4 GPUs.
Adding a secret, HF_TOKEN, which has your Hugging Face login token if you need to use Mistral or other models which require this authorization before downloading weights. Then give your service account permission to access this secret.

Consider locally testing that cloud logging and storage works

Run a mini experiment on your computer and check that data was uploaded to GCP.

Install the gcp requirements (at the repo root):
```
python -m pip install ".[gcp]"
```

From the repo root, run the mini CPU test (after ensuring your gcloud is set to whatever project hosts the bucket):

PRETRAIN_ON_TEST_CLOUD_PROVIDER="gcp" \
PRETRAIN_ON_TEST_BUCKET_NAME="pretrain-on-test-accuracies" \
./experiment_mini.sh

Check that stuff was logged (search for the latest log group with the name run-) and that data was uploaded to the bucket pretrain-on-test-accuracies.

Test that cloud launches work

Launch a cloud instance which will run a mini experiment, and check that data was uploaded to GCP.

Run the mini CPU test (after ensuring your gcloud is set to whatever project hosts the bucket):
```
python launch.py --run_type cpu-test
```
Check that stuff was logged (search for the latest log group with the name run-) and that data was uploaded to the bucket pretrain-on-test-accuracies.
Consider deleting these logs:
```
python delete_old_test_logs.py
```

Run the experiment on GPUs

Launch a cloud GPU instance which will run the full experiment, and check that data was uploaded to GCP. Note that the instance will stop even if there's an error.

Run the full experiment (after ensuring your gcloud is set to whatever project hosts the bucket):
```
python launch.py
```
To run multiple experiments in parallel / on multiple instances, put some bash files in a directory (e.g., ./experiments/m100/n500/bert/) and run:
```
python launch.py --sh_dir_or_filename experiments/m100/n500/bert/
```
If you're getting an error with code ZONE_RESOURCE_POOL_EXHAUSTED (b/c there aren't any T4 GPUs available in the requested zone), then consider adding the flag --any_zone to the launch.py command. This flag causes the script to automatically try to find a zone with availability.
Check that stuff was logged (search for the latest log group with the name run-) and that data was uploaded to the bucket pretrain-on-test-accuracies.

Merge data

After running all of the experiments, merge their data into a single directory which can be used for analysis.

cd to:
```
cd ../../analysis/dirty_file_processing
```

Copy data from GCP storage to runs, for example:

mkdir -p runs

cd runs

gsutil -m cp -r \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_06-52-44-m50_n100_gpt2_4" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_06-52-52-m50_n100_gpt2_2" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_06-52-52-m50_n100_gpt2_5" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_06-53-09-m50_n100_gpt2_7" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_14-23-58-m50_n100_gpt2_6" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_14-24-05-m50_n100_gpt2_3" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_14-24-48-m50_n100_gpt2_1" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_19-59-41-m50_n100_bert_2" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_20-18-05-m50_n100_bert_4" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_20-19-25-m50_n100_bert_6" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_20-21-02-m50_n100_bert_5" \
  "gs://pretrain-on-test-accuracies/run-2024-06-17_21-46-23-m50_n100_bert_7" \
  "gs://pretrain-on-test-accuracies/run-2024-06-18_00-25-45-m50_n100_bert_1" \
  "gs://pretrain-on-test-accuracies/run-2024-06-18_03-00-59-m50_n100_bert_3" \
  .

cd ..

Merge them into a new directory, accuracies:

python merge_runs.py --runs_dir runs --destination_dir accuracies

Verify that the datasets are the same:

diff <(ls accuracies/m50/n100/bert) <(ls accuracies/m50/n100/gpt2)

When you're ready to analyze this data, copy-paste (or move, whatever you prefer) accuracies into the analysis dir:
```
cp -a accuracies ../
```

Run the analysis

All of the analyses can be run locally, but I was hitting performance issues for $n = 50$ and $100$ b/c the number of subsamples for each dataset is $100$. Also, multicore isn't working locally. I ran it in the cloud instead.

Launch a high-memory, 4-core CPU instance which will run the analyses, e.g., in ./analyses/m100:

python launch.py \
   --run_type analysis \
   --sh_dir_or_filename analyses/m100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcp

gcp

README.md

Run the experiment in Google Cloud

Setup

Run the experiment on GPUs

Merge data

Run the analysis

Files

gcp

Directory actions

More options

Directory actions

More options

Latest commit

History

gcp

Folders and files

parent directory

README.md

Run the experiment in Google Cloud

Setup

Run the experiment on GPUs

Merge data

Run the analysis