Notebooks which analyze the accuracy scores in ./accuracies_*/
.
At the repo root, install these dependencies (in a virtual environment):
python -m pip install ".[stat]"
dataset.ipynb
visualizes
./fit_posteriors/
fits the hierarchical models.
./meta/
assesses the importance of repeated subsampling of each dataset.
./results/
visualizes effects of interest. This is the main result.
./contamination/
sees if a particular contamination test raises an
arguably false alarm.
test.ipynb
tests that the inference code statistically works.
Why is m 100 in the zero-shot experiments?
The ./accuracies_zero_shot*/
data were run from
experiments with --num_train 100
(the default
value), but the training dataset is
unused.
--num_train 100
is supplied to keep every subsample's test split identical across
few-shot and zero-shot experiments, in case we ever want to compare them. The
subsampling
code
works by stratify-sampling (by the label)
Why do this fancy stuff?
A small evaluation bias is noteworthy. We need to say how confident we are in our measurement. The paper explains why naively computing standard errors is not great for the confidence part. (It usually is great and sufficient.)
The paper also motivates the estimation of task-level effects. When picking lemons or cherries from a big tree, the magnitude of effects are overestimated due to selection bias. Priors shift them towards 0 to improve estimation.
All in all, I want to expose and communicate the considerable variance involved in this research, and reduce estimation bias.