首页 > 常用平台 > 正文

老Paddle使用整理

标签：老paddle

2016-10-14

老Paddle使用整理
- 1. 预处理
  - 1.1 预处理基本知识
  - 1.1.1 预处理的文本结构：
- 2. 训练

老Paddle使用整理

以bi-lstm+crf进行品牌词识别为例，对老paddle的使用进行总结。

1. 预处理

1.1 预处理基本知识

1.1.1 预处理的文本结构：

label;slotid0 fea1[:weight] fea2[:weight];slotid1 fea1[:weight] ...;

fea:可以是离散值也可以是连续值
slot数至少为1，也就是，至少有两个分号

说明	样本例子
连续特征，单slot，特征维数是5	0;0 3.15 0 0 0 6.28; 1;0 3.14 6.28 9.42; 非法，对于连续特征，每个slot必须包含的特征数应该跟维数一致。
连续特征，2个slot，特征维数都是5	0;0 0 1 2 3 4.5;1 3.14 0 0 0 0; 1;1 3.14 6.28 0 0 0; 合法，slot0为空。
离散特征，单slot，特征维数是1024	0;0 1 2 3 5; 1;0 1 32 64 512 1023; 0;0 1023 1024; 非法，特征维数是1024，从0开始，所以最大的特征index只能是1023，特征越界。
离散特征，3个slot，第1个slot有1024维特征，第2、3个slot有512维特征	0;0 1 4 1023;1 4 7 3;2 2 6 511; 1;2 1 5 88 511;
离散带权特征，单slot，特征维数是1024	0;0 1:3.14 2:6.28 0:0 4:12.56 1023:3.1415926;

1.1.2 预处理的7个param

param1：不需要修改，默认dnn_inst_
param2：多少个slot，每个slot分别是什么类型

param2取值	含义
0	连续值特征，fea为浮点数的情况
1	离散值特征，不带权，fea为整数的情况
2	离散值特征，带权，fea为整数，后面跟着冒号和浮点数weight的情况
3	id特征，仅包含一个整数，可作为样本的label或者id （从Paddle 1.0.0.7版本开始支持）

使用场景：

使用场景	用法
样本只有1个slot，该slot由离散型特征组成	param2=1
样本只有1个slot，该slot由连续型特征组合	param2=0
样本有3个slot，其中slot0包含连续特征，其余slot包含离散特征	param2=“0 1 1” 【记得这里有双引号】
样本有4个slot，其中slot0和slot2的特征是连续的，其他都是离散型特征	param2=“0 1 0 1” 记得这里有双引号
样本有若干个slot，所有slot的特征都是离散型特征	param2=1

param3：维数1 维数2 … label数例如：

param3	说明
"100 100 336 10"	多slot样本，三个slot的维数分别为100,100和336，label为10，即十分类问题
"536 2"	单slot样本，第一个slot为536维特征，label为2，即二分类问题

param4：是否对转换后二进制格式的数据压缩，默认不压缩(False)
param5：label/哪个slot需要使用词典替换特征。例如：param5=”–dict=0:dict11,2:dict2” 表示，label用dict1替换，slot0，slot1【写1,2，实际上是slot0,1】用dict2替换
param6：如果文本格式中的slot_id是不是从0开始的整数，而是字符串等，需要设置该参数，设置样例：param6=”–slotid=slotid0 slotid1 slotid2”
param7：如果忽略不在词典中的word，需要设置该参数为True，设置样例：param7=”–ignore=True”

1.1.3 proto格式

假定一条样本为中文句子“百度公司创立于 2000年”，样本的类别为“0”。假定“百度”在词表中的id为23，“公司”为35，“创立于”为10，“2000年”为87，词表大小为23984，共有3个类别。首先我们将其转换成文本格式（label; slot_id word_id1 word_id2 word_id3 word_id4……）【注意，slotid没有实际意义，只是一个编号，表示有多少维特征，从0开始递增】:

0;0 23 35 10 87;

而这里我们要需要进行序列标注，所以在wordid这一维特征之外，还要用到别的两个特征，总共如下：

wordid：假设训练集使用jieba分词完有4870个词。
词性：pos，假设使用jieba进行词性标注，那么有110个类别。ICTCLAS 汉语词性标注集
占位符：全部写0【老paddle需要这个字段，新paddle不用】

比如，一个单词有8个汉字，那么，我们转化为：

0;0 1383 2523 4396 1253 3967 4333 490 613;1 48 94 94 32 86 17 70 25;2 0 0 0 0 0 0 0 0;

接下来，使用txt2proto工具进行转换：

1.1.3.1 local预处理

使用txt2proto这个bin，三个参数，第一个是输出文件，第二个是param2，第三个是param3

cat INPUT_FILE | txt2proto OUTPUT_FILE "1" "23984 3"

1.1.3.2 cluster预处理

需要编写两个conf：

以train为例：

# Local dir.
LOCAL_HADOOP_HOME=xxxxx

# HDFS.
conf_hdfs=hdfs://xxxxxx
conf_ugi=xxxx,xxxx
conf_tracker=xxxx

# Data dir.
input_dir="xxx/brand_recognize/input/data_test"
output_dir="xxx/brand_recognize/preprocess/data_test_pb"
# Dictionary dir
dict_dir=
#If force_reuse_output_path is True ,paddle will remove outputdir without check outputdir exist
force_reuse_output_path=
# Job parameters.
JOBNAME="xxxx_gen_proto_test_brand_recognize"
MAPCAP=5000
REDCAP=5000
MAPNUM=1000
REDNUM=1000

1.1.4 使用pyDataProvider格式

2. 训练

需要准备common.conf和trainer_config.conf

集群版的common.conf: local版的common.conf：

trainer.conf:

import math

################################### Data Configuration ###################################
word_dim = 4870
pos_dim = 110
label_dim = 4
TrainData(ProtoData(files = "train.list", type = "proto_sequence"))
TestData(ProtoData(files = "test.list", type = "proto_sequence"))

################################### Algorithm Configuration ###################################
Settings(
         algorithm='sgd',
         learning_rate_decay_a=0,
         learning_rate_decay_b=0,
         batch_size=12,
         learning_rate=0.01,
         learning_method='adagrad',
         #ada_epsilon=1.0,
         #num_batches_per_send_parameter=2,
         #num_batches_per_get_parameter=1,
        )
################################### Network Configuration ###################################
Inputs("word", "pos", "label", "place_holder")
embedding_size = 256
hidden_dim = 256
sentence_vec_dim = 256
output_dim = 256
pos_embedding_size = 20
num_rnn_layers = 1

lr_hid_col = 1
lr_output = 1
lr_keep = 0.1

Layer(name = "word", type = "data", size = word_dim)
Layer(name = "pos", type = "data", size = pos_dim)
Layer(name = "label", type = "data", size = label_dim)
Layer(name = "place_holder", type = "data", size = 1)

name = "word"
parameter_name = ["word"]
offset = 0

MixedLayer(
    name = name + "_word_embedding",
    size = embedding_size,
    bias = False,
    inputs = TableProjection(
        name, 
        initial_std = 1 / math.sqrt(embedding_size),
        learning_rate = lr_keep,
        parameter_name = "embedding",
        #sparse_remote_update=True
        )
)

for j in range(num_rnn_layers):
    input_name = (name + "_reverse_rnn%d" % (j-1)) if j > 0 else name + "_word_embedding"
    MixedLayer(
        name = name + "_rnn%d_input" % j,
        size = hidden_dim, 
        bias = False,
        inputs = FullMatrixProjection(input_name,
                initial_std = 1 / math.sqrt(hidden_dim), 
                parameter_name="%s_rnn%d_input_to_hidden" % (parameter_name[offset], j),
                learning_rate =lr_hid_col)
    )
    Layer(
        name = name + "_rnn%d" % j,
        type = "lstmemory",
        active_type = "tanh",
        active_state_type = "tanh",
        active_gate_type = "sigmoid",
        bias = Bias(initial_std=0, parameter_name = '%s_rnn%d_bias' % (parameter_name[offset], j), learning_rate = lr_hid_col if j > 0 else lr_keep),
        inputs = Input(name + "_rnn%d_input" % j, 
            initial_std = 1/math.sqrt(hidden_dim),
            parameter_name="%s_rnn%d_weight" % (parameter_name[offset], j),
            learning_rate = lr_hid_col)
    )
    MixedLayer(
        name = name + "_reverse_rnn%d_input" % j,
        size = hidden_dim, 
        bias = False,
        inputs = [FullMatrixProjection(name + "_rnn%d" % j,
                initial_std = 1 / math.sqrt(hidden_dim), 
                parameter_name = "%s_reverse_rnn%d_input1_to_hidden" % (parameter_name[offset], j),
                learning_rate =lr_hid_col), 
            FullMatrixProjection(input_name, 
                initial_std = 1 / math.sqrt(hidden_dim),
                parameter_name = '%s_reverse_rnn%d_input2_to_hidden' % (parameter_name[offset], j),
                learning_rate = lr_hid_col)]
    )
    Layer(
        name = name + "_reverse_rnn%d" % j,
        type = "lstmemory",
        active_type = "tanh",
        active_state_type = "tanh",
        active_gate_type = "sigmoid",
        reversed = True,
        bias = Bias(initial_std=0, parameter_name = '%s_reverse_rnn%d_bias' % (parameter_name[offset], j), learning_rate = lr_hid_col),
        inputs = Input(name + "_reverse_rnn%d_input" % j, 
            initial_std = 1/math.sqrt(hidden_dim),
            parameter_name="%s_reverse_rnn%d_weight" % (parameter_name[offset], j),
            learning_rate = lr_hid_col)
    )

MixedLayer(
    name = name + "_embeddding_and_hidden",
    size = sentence_vec_dim,
    active_type = "relu",
    bias = False,
    inputs = [FullMatrixProjection(name + "_rnn%d" % j,
            initial_std = 0.000001 / math.sqrt(output_dim),
            parameter_name="%s_rnn%d_hidden_to_pooling" % (parameter_name[offset], j),
            learning_rate = lr_hid_col) for j in range(num_rnn_layers)] +
        [FullMatrixProjection(name + "_reverse_rnn%d" % j,
            initial_std = 0.000001 / math.sqrt(output_dim),
            parameter_name="%s_reverse_rnn%d_hidden_to_pooling" % (parameter_name[offset], j),
            learning_rate = lr_hid_col) for j in range(num_rnn_layers)] +
        [FullMatrixProjection(name + "_word_embedding",
            initial_std = 0.000001 / math.sqrt(output_dim),
            parameter_name = "%s_embedding_to_pooling" % (parameter_name[offset]),
            learning_rate = lr_hid_col)]
)

MixedLayer(
    name = "pos_embedding",
    size = pos_embedding_size,
    bias = False,
    inputs = TableProjection(
        "pos", 
        initial_std = 1 / math.sqrt(pos_embedding_size),
        learning_rate = lr_output,
        parameter_name = "pos_embedding",
        #sparse_remote_update=True
        )
)

FCLayer(
    name = "combine_vec",
    size = output_dim,
    active_type = "tanh",
    bias = Bias(initial_std = 0, parameter_name = 'combine_vec_bias', learning_rate = lr_output),
    inputs = [Input(
        name + "_embeddding_and_hidden",
        learning_rate = lr_output, 
        initial_std = 1 / math.sqrt(sentence_vec_dim),
        parameter_name = 'combine_%s_weight' % name,
        )] + 
        [Input(
        "pos_embedding",
        learning_rate = lr_output,
        initial_std = 1 / math.sqrt(pos_embedding_size),
        parameter_name = 'combine_pos_weight'
        )]
)


FCLayer(
    name = "similarity",
    size = output_dim,
    active_type = "tanh",
    bias = Bias(initial_std = 0, parameter_name = 'similarity_bias', learning_rate = lr_output),
    inputs = [FullMatrixProjection(
        "combine_vec",
        learning_rate = lr_output, 
        initial_std = 1 / math.sqrt(output_dim),
        parameter_name = 'similarity_weight'
        )]
)

MixedLayer(
    name = "output",
    size = label_dim,
    #active_type = "softmax",
    #active_type = "sigmoid",
    #bias = Bias(initial_std = 0, parameter_name = 'output_bias', learning_rate = lr_output),
    bias = False,
    inputs = [FullMatrixProjection(
        "similarity",
        #learning_rate = lr_output, 
        initial_std = 1 / math.sqrt(output_dim),
        #parameter_name = 'output_weight'
        )]
)

CRFLayer(
    name = "crf_cost",
    size = label_dim,
    inputs = [
    Input("output", parameter_name="crfw"),
    Input("label")
    ]
)

Layer(
    name = "crf_layer",
    size = label_dim,
    type = 'crf_decoding',
    #bias = Bias(initial_std = 0, parameter_name = 'crf_bias', learning_rate = lr_output),
    inputs = [
        Input("output"),
        #Input("label")
    ]
)

Evaluator(
    name = "error",
    type = "sum",
    inputs = "crf_layer",
)

#Evaluator(
#    name = "chunk_f1",
#    type = "chunk",
#    inputs = ["crf_layer", "label"],
#    #chunk_scheme = "plain",
#    chunk_scheme = "IOB",
#    #num_chunk_types = 3,
#    num_chunk_types = 2,
#)

Outputs("crf_cost")

老paddle支持的所有layer： http://wiki.baidu.com/pages/viewpage.action?pageId=42366874

注意：运行submit的时候，会自己产出train.list和test.list，并且会生成tester_config.conf

使用的是ecom的线下集群，

查看所有任务：http://xxxx/job/

查看当前任务（jobid=287811.xxx）运行： http://xxxx/job/i-287811/

干掉这个job： qdel 287811.xxxx

预测： predict这个bin的用法： I1020 12:35:10.277036 25364 Main.cpp:44] Model path or feature type or both missing. Please read the usage below: I1020 12:35:10.277549 25364 Main.cpp:46] ./predict model_path feature_type(s) [logLevel] [isSequence] I1020 12:35:10.277555 25364 Main.cpp:47] @model_path: path where the model stored. the directory indicated by model_path should contain 1 binary network configuration file and 1 sub directory naming ‘model’ that contains the model itself. I1020 12:35:10.277562 25364 Main.cpp:52] @feature_type: [0|1|2] integer(s) to indicate the type of features in each instance. I1020 12:35:10.277567 25364 Main.cpp:54] 0 -> continues values. floating points. I1020 12:35:10.277572 25364 Main.cpp:55] 1 -> discrete values without weights. integers. I1020 12:35:10.277577 25364 Main.cpp:56] 2 -> discrete values with weights. integers:float. I1020 12:35:10.277581 25364 Main.cpp:57] @logLevel: from 0 to 4(default), 0 for print debug info (which will lead to core dump on failure), 4 for suppress debug info. I1020 12:35:10.277586 25364 Main.cpp:60] isSequence is used for SparseNonValuePredictor or MultipleTypesPredictor. I1020 12:35:10.277591 25364 Main.cpp:61] If isSequence=1, sparse_non_value slots (and only those slots) will be treated as SEQUENCE. I1020 12:35:10.277596 25364 Main.cpp:62] example: I1020 12:35:10.277601 25364 Main.cpp:63] ./predict ./myModel 0 I1020 12:35:10.277604 25364 Main.cpp:64] ./predict ./myModel “0 0 1 2” <- instance with 4 slots, each slot contain different type of features.

[INFO][PredictorInternal.cpp][readBinaryConf][40] content length of binary conf is [3923]. [INFO][PredictorInternal.cpp][readBinaryConf][72] read binary conf file done. [INFO][PredictorInternal.cpp][init][420] output[0]: dim=[4], name=[crf_layer]. [INFO][PredictorInternal.cpp][init][429] input[0]: dim=[4870], name=[word]. [INFO][PredictorInternal.cpp][init][429] input[1]: dim=[110], name=[pos]. [INFO][PredictorInternal.cpp][init][429] input[2]: dim=[1], name=[place_holder].

原创文章，转载请注明出处！
本文链接：http://hxhlwf.github.io/posts/platform-old-paddle-tutorial.html

上篇： paddlepaddle快速入门教程

下篇： paddlepaddle中的seq2seq的demo

comment here..