# dynamic-evaluation **Repository Path**: xsongx/dynamic-evaluation ## Basic Information - **Project Name**: dynamic-evaluation - **Description**: Dynamic evaluation for pytorch language models, now includes hyperparameter tuning - **Primary Language**: Python - **License**: BSD-2-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2020-11-05 - **Last Updated**: 2022-06-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README #### Dynamic evaluation for pytorch language models as implemented in [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432). #### Requirements: python 3 (tested in 3.5, 3.6), pytorch (tested in 0.1.12, 0.2) #### Instructions for use: 1. Train a language model using an existing repository, such as the [pytorch language modeling tutorial](https://github.com/pytorch/examples/tree/master/word_language_model) . This should save a .pt file with the trained model 2. Copy the file dynamiceval.py into the repository 3. Run dynamic evaluation with: `python dynamiceval.py --model modelname.pt` #### AWD-LSTM To replicate results in paper for AWD-LSTM + dynamic eval, train the language model using the [Salesforce AWD-LSTM repository](https://github.com/salesforce/awd-lstm-lm). We used the original codebase from this repository, with the goal of exact replication of results from their paper (which we failed to achieve). The default settings and hyper-parameters for dynamiceval.py are tuned for for AWD-LSTM + dynamic eval on PTB. #### AWD-QRNN This code also supports the [pytorch QRNN](https://github.com/salesforce/pytorch-qrnn) with the --QRNN option. AWD-QRNN + dynamic eval obtains very similar results to AWD-LSTM + dynamic eval, and is much faster to train and evaluate. Training an AWD-QRNN on PTB using the Salesforce AWD-LSTM repository, and running dynamic eval with the default settings gives a test perplexity of 50.5. Increasing the sequence segment length from 5 to 20 runs 3x faster (1 minute vs. 3 minutes on PTB), and gives validation (use --val flag) and test perplexities of `51.4/50.5` with the following arguments: `python dynamiceval.py --model PTB.pt --QRNN --lr 0.00012 --lamb 0.02 --bptt 20` AWD-QRNN trained on wikitext-2 gives validation (use --val flag) and test perplexities of `45.9/44.0` with the following arguments: `python dynamiceval.py --model WT2.pt --QRNN --lr 0.00012 --lamb 0.008 --bptt 20 --data data/wikitext-2` #### Hyper-parameter search To get stronger results with any other model or dataset, you can run with: `python dynamiceval.py --model modelname.pt --grid` This will do a hyper-parameter search on the validation set, takes a few hours on PTB with LSTM and default settings, can be much faster with QRNN and/or larger --bptt. If the default model size/settings are changed too much, the hyper-parameters in `lrlist` and `lamblist` may need to be changed for best results. If you want to do a faster search, you can try running with --gridfast to use a subset of the validation set, or you can reduce the number of elements in lamblist (tuning lr is more important). #### Command line arguments: `--model` (required) -filename of the trained model to be evaluated `--data` -location of the data corpus `--grid` -hyper-parameter grid search over lambda and eta, gives both valid and test error `--gridfast` -same as grid, but only uses first 30k validation tokens for search `--val ` -measure validation error instead of test error `--gpu` -specify a gpu device, uses device 0 by default (set negative for cpu) `--QRNN` -apply dynamic eval to a QRNN `--bptt` -sequence segment length for dynamic eval, also used for gradient statistics on training data `--batch_size` -batch size for gradient statistics on training data `--lr` -learning rate eta (ignored if --grid is set) `--lamb` -decay rate lambda (ignored if --grid is set) `--epsilon` -stabilization parameter epsilon `--max_batches` -max number of batches for training gradient statistics (-1 uses full training set) `--oldhyper` -The original version code inadvertently scaled a couple of terms differently from the equations described in the paper, which affects the hyper-parameters. The code was changed to reflect the paper equations, and this flag applies a hyper-parameter transformation in a way that accounts for this change. If you run this version of the code with the --oldhyper flag, it is equivalent to running the old version of the code. This will also print out hyper-parameter values that can be used with the new version of the code (without this flag) to obtain the same results. Some previous results that report hyper-parameters with the old code would require applying this hyper-parameter scaling to achieve the same results with this code. The following replicates the exact settings used to obtain results for AWD-LSTM in the paper for PTB: `python dynamiceval.py --model PTB.pt --lr 0.002 --lamb 0.02 --epsilon 0.001 --oldhyper` and for wikitext-2: `python dynamiceval.py --model WT2.pt --lr 0.002 --lamb 0.02 --epsilon 0.001 --oldhyper --data data/wikitext-2` Note that while the original hyper-parameters are the same for Wikitext-2 and PTB, the new/scaled hyper-parameters are a little different.