目录
使用tf,但独立于tensorflow/models之外的一个seq2seq的框架:
主页:https://google.github.io/seq2seq/
对应的论文: Massive Exploration of Neural Machine Translation Architectures
整体架构图:
首先安装python-tk(python的gui)
apt-get install python-tk
然后设置matplotlib的backend
echo "backend : Agg" >> $HOME/.config/matplotlib/matplotlibrc
然后安装seq2seq
cd seq2seq
pip install -e .
测试一下
python -m unittest seq2seq.test.pipeline_test
经典:
正常的分词是空格切分或者常用的分词工具(如Moses的tokenizer.perl、框架有spacy/nltk/stanford的分词)。
但用于nmt时存在如下问题:
对于这种open vocabulary的问题一,一种解决方法就是从给定的文本中学习subword units。例如,loved可以被分成lov和ed,loving可以被分成lov和ing。这样,一方面可以产生新的词(unknown words),另一方面,可以缩减vocabulary size。本示例用到的就是Byte Pair Encoding (BPE)
用法:
# Clone from Github
git clone https://github.com/rsennrich/subword-nmt
cd subword-nmt
# Learn a vocabulary using 10,000 merge operations
./learn_bpe.py -s 10000 < train.tok > codes.bpe
# Apply the vocabulary to the training file
./apply_bpe.py -c codes.bpe < train.tok > train.tok.bpe
结果如下,会用对不常见的词(如例子中的Nikitin)使用@@
进行切分。
Madam President , I should like to draw your attention to a case in which this Parliament has consistently shown an interest. It is the case of Alexander Ni@@ ki@@ tin .
使用预处理过的English-German WMT’16 Translation Task的数据集进行训练。 参考wmt16_en_de.sh,需要自定义OUTPUT_DIR
变量。这个脚本下载数据、使用Moses Tokenizer进行分词、清理了training data、学习了约32,000的vocabulary大小的subword units.也可以直接去google drive下载预处理过的WMT’16 EN-DE Data (502MB)。
格式如下:
文件名 | 描述 |
---|---|
train.tok.clean.bpe.32000.en | The English training data, one sentence per line, processed using BPE. |
train.tok.clean.bpe.32000.de | The German training data, one sentence per line, processed using BPE. |
vocab.bpe.32000 | The full vocabulary used in the training data, one token per line. |
newstestXXXX.* | Development and test data sets, in the same format as the training data. We provide both pre-processed and original data files used for evaluation. |
然后需要设置环境变量:
# Set this to where you extracted the downloaded file
export DATA_PATH=
export VOCAB_SOURCE=${DATA_PATH}/vocab.bpe.32000
export VOCAB_TARGET=${DATA_PATH}/vocab.bpe.32000
export TRAIN_SOURCES=${DATA_PATH}/train.tok.clean.bpe.32000.en
export TRAIN_TARGETS=${DATA_PATH}/train.tok.clean.bpe.32000.de
export DEV_SOURCES=${DATA_PATH}/newstest2013.tok.bpe.32000.en
export DEV_TARGETS=${DATA_PATH}/newstest2013.tok.bpe.32000.de
export DEV_TARGETS_REF=${DATA_PATH}/newstest2013.tok.de
export TRAIN_STEPS=1000000
直接跑toy.sh就行了,然后设置一下环境变量
export DATA_PATH=
export DATA_TYPE=copy # or reverse
export VOCAB_SOURCE=${DATA_PATH}/nmt_data/toy_${DATA_TYPE}/train/vocab.sources.txt
export VOCAB_TARGET=${DATA_PATH}/nmt_data/toy_${DATA_TYPE}/train/vocab.targets.txt
export TRAIN_SOURCES=${DATA_PATH}/nmt_data/toy_${DATA_TYPE}/train/sources.txt
export TRAIN_TARGETS=${DATA_PATH}/nmt_data/toy_${DATA_TYPE}/train/targets.txt
export DEV_SOURCES=${DATA_PATH}/nmt_data/toy_${DATA_TYPE}/dev/sources.txt
export DEV_TARGETS=${DATA_PATH}/nmt_data/toy_${DATA_TYPE}/dev/targets.txt
export DEV_TARGETS_REF=${DATA_PATH}/nmt_data/toy_${DATA_TYPE}/dev/targets.txt
export TRAIN_STEPS=1000
标准的模型是seq2seq with attention,它有大量的超参数。在example_configs下有small/medium/large等conf,例如,medium的conf:
model: AttentionSeq2Seq
model_params:
attention.class: seq2seq.decoders.attention.AttentionLayerBahdanau
attention.params:
num_units: 256
bridge.class: seq2seq.models.bridges.ZeroBridge
embedding.dim: 256
encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
encoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 256
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
decoder.class: seq2seq.decoders.AttentionDecoder
decoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 256
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 2
optimizer.name: Adam
optimizer.params:
epsilon: 0.0000008
optimizer.learning_rate: 0.0001
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
单GPU(例如TitanX),即使是small模型,训练WMT’16 English-German数据要收敛得好几天。在8GPU的集群上,用tf的分布式训练,large模型需要2-3天。而对于toy数据,在cpu上,1000 step大概要10min。
export MODEL_DIR=${TMPDIR:-/tmp}/nmt_tutorial
mkdir -p $MODEL_DIR
python -m bin.train \
--config_paths="
./example_configs/nmt_small.yml,
./example_configs/train_seq2seq.yml,
./example_configs/text_metrics_bpe.yml" \
--model_params "
vocab_source: $VOCAB_SOURCE
vocab_target: $VOCAB_TARGET" \
--input_pipeline_train "
class: ParallelTextInputPipeline
params:
source_files:
- $TRAIN_SOURCES
target_files:
- $TRAIN_TARGETS" \
--input_pipeline_dev "
class: ParallelTextInputPipeline
params:
source_files:
- $DEV_SOURCES
target_files:
- $DEV_TARGETS" \
--batch_size 32 \
--train_steps $TRAIN_STEPS \
--output_dir $MODEL_DIR
其中,参数如下:
使用tensorboard监控输出目录(有log_perplexity和bleu等指标):
tensorboard --logdir $MODEL_DIR
export PRED_DIR=${MODEL_DIR}/pred
mkdir -p ${PRED_DIR}
python -m bin.infer \
--tasks "
- class: DecodeText" \
--model_dir $MODEL_DIR \
--input_pipeline "
class: ParallelTextInputPipeline
params:
source_files:
- $DEV_SOURCES" \
> ${PRED_DIR}/predictions.txt
其中,
beamsearch并不是使用贪心的方法去寻找最可能的词,而是keeps several hypotheses, or "beams", in memory and chooses the best one based on a scoring function.
。可以通过指定model_params来使用beamsearch。另外,使用beamsearch会使预测时间变得significantly longer。
python -m bin.infer \
--tasks "
- class: DecodeText
- class: DumpBeams
params:
file: ${PRED_DIR}/beams.npz" \
--model_dir $MODEL_DIR \
--model_params "
inference.beam_search.beam_width: 5" \
--input_pipeline "
class: ParallelTextInputPipeline
params:
source_files:
- $DEV_SOURCES" \
> ${PRED_DIR}/predictions.txt
另外,上面的例子显示了,tasks参数可以传多个task进去;另外,还把结果存放在了${PRED_DIR}/beams.npz中。用法:
import numpy as np
r = np.load("/tmp/nmt_tutorial/pred/beams.npz")
print r.files
##得到:['predicted_ids', 'beam_parent_ids', 'log_probs', 'scores']
infer脚本默认只评估最新的checkpoint,如果想指定checkpoint,可以传checkpoint_path
参数。
以下命令可以计算bleu:
./bin/tools/multi-bleu.perl ${DEV_TARGETS_REF} < ${PRED_DIR}/predictions.txt
注意:
如果你使用toy造数据,可以直接git clone https://github.com/daiwk/seq2seq.git
,然后运行dwk_train.sh即可。