Skip to content

Latest commit

 

History

History
78 lines (69 loc) · 4.08 KB

README.md

File metadata and controls

78 lines (69 loc) · 4.08 KB

Evaluating Uncertainty in Neural Text Generators Against Human Production Variability

This repository contains code for the paper What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability.

Data

Models

Data Pre-Processing

We provide several methods in utils.py to transform the WMT, ASSET, and DailyDialog++ datasets into a unified format. WritingPrompts can be converted using the python file convert_writingprompts.py, which also allows to filter data instances according to minimum number of story completions available for each prompt.

Fine-tuning

For convenience and reproducibility, we provide the fine-tuning scripts we used for text simplification and story generation: finetune_flanT5.py and finetune_gpt.py. We recommend using up-to-date Huggingface code; see, for example, the Huggingface trainers documentation and finetuning script.

Generation

You can launch the generation script by providing a configuration file in json format:

python src/generate.py --configs_path generation_config.json

This is what a configuration file looks like:

{
    "config_name": "FlanT5-large finetuned typical d=0.95", 
    "task": "simplification", 
    "context_key": "source",   # the key to access the context/input in the data file
    "data_path": "data/asset/val.json", 
    "out_path": "data/asset/flanT5_large_finetuned-typical_095-val.json",
    "model_name": "models/asset/flan-t5-large", 
    "do_sample": true, 
    "n_samples": 50, 
    "max_length": 100, 
    "typical_p": 0.95,
    "temperature": null, 
    "top_k": null, 
    "top_p": null, 
    "n_beams": null
}

Alternatively, you can pass arguments directly (e.g., --data_path PATH --out_path PATH --model_name MODEL). See all available arguments in src/generate.py.

Computing similarity scores

To compute similarity scores between two systems, you can use the script src/write_scores.py. For example, the following configuration will output similarity scores between the outputs of one NLG system cand_system_name:

python src/write_scores.py \
    --cand_system "cand_system_name" \
    --tgt_system "cand_system_name" \
    --cand_path "/path/to/cand_system_outputs.json" \
    --tgt_path "/path/to/tgt_system_outputs.json" \
    --cand_response_key "generation" \
    --tgt_response_key "generation" \
    --context_key "input" \
    --out_path "/path/to/output/" \
    --n_cand_samples "10" \
    --n_tgt_samples "10" \
    --n_contexts "1000" \
    --dataset "asset" \
    --max_response_length "100"

Precommputed scores for a large variety of models and decoding algorithms are available at: https://doi.org/10.5281/zenodo.10025272.

Probing Representations of Uncertainty and Assessing Statistical Fitness

We assess the statistical fitness of the candidate system by analysing its outputs w.r.t. itself (self-variability) or a system known to be plausible (human references) for different divergence measures. This analysis is done in fitness_analysis.ipynb. This notebook also contains the plots included in the paper.