Name		Name	Last commit message	Last commit date
parent directory ..
accuracies_from_paper		accuracies_from_paper
accuracies_gpt2_epochs_2		accuracies_gpt2_epochs_2
accuracies_zero_shot/m100/n100/mistral-qlora-zero-shot		accuracies_zero_shot/m100/n100/mistral-qlora-zero-shot
accuracies_zero_shot_packing/m100/n100/mistral-qlora-zero-shot-packing		accuracies_zero_shot_packing/m100/n100/mistral-qlora-zero-shot-packing
contamination		contamination
dirty_file_processing		dirty_file_processing
fit_posteriors		fit_posteriors
meta		meta
results		results
README.md		README.md
dataset.ipynb		dataset.ipynb
dataset_level.ipynb		dataset_level.ipynb
describe_datasets.ipynb		describe_datasets.ipynb
leaderboard.ipynb		leaderboard.ipynb
pca.ipynb		pca.ipynb
plot_posterior_distr.ipynb		plot_posterior_distr.ipynb
run.py		run.py
test.ipynb		test.ipynb
utils.py		utils.py

README.md

analysis

Notebooks which analyze the accuracy scores in ./accuracies_*/.

Setup

At the repo root, install these dependencies (in a virtual environment):

python -m pip install ".[stat]"

Notebooks

dataset.ipynb visualizes $\text{acc}_\text{base}$, $\text{acc}_\text{extra}$, and $\text{acc}_\text{test}$, and tests that $\text{E}[\text{acc}_\text{test} - \text{acc}_\text{extra}] = 0$ for each dataset.

./fit_posteriors/ fits the hierarchical models.

./meta/ assesses the importance of repeated subsampling of each dataset.

./results/ visualizes effects of interest. This is the main result.

./contamination/ sees if a particular contamination test raises an arguably false alarm.

test.ipynb tests that the inference code statistically works.

Why is m 100 in the zero-shot experiments?

The ./accuracies_zero_shot*/ data were run from experiments with --num_train 100 (the default value), but the training dataset is unused. --num_train 100 is supplied to keep every subsample's test split identical across few-shot and zero-shot experiments, in case we ever want to compare them. The subsampling code works by stratify-sampling (by the label) $m$ train observations, and then it randomly draws $n$ extra and $n$ test observations from the rest of the data. The seed is the subsample number, which is universal across experiment configurations. Changing $m$ would cause the test split to be different, which makes comparisons less controlled.

Why do this fancy stuff?

A small evaluation bias is noteworthy. We need to say how confident we are in our measurement. The paper explains why naively computing standard errors is not great for the confidence part. (It usually is great and sufficient.)

The paper also motivates the estimation of task-level effects. When picking lemons or cherries from a big tree, the magnitude of effects are overestimated due to selection bias. Priors shift them towards 0 to improve estimation.

All in all, I want to expose and communicate the considerable variance involved in this research, and reduce estimation bias.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

README.md

analysis

Setup

Notebooks

Files

analysis

Directory actions

More options

Directory actions

More options

Latest commit

History

analysis

Folders and files

parent directory

README.md

analysis

Setup

Notebooks