理论基础

基于循环神经网络实现语言模型

下图是一个基于字符级循环神经网络的语言模型，能够基于当前的输入与过去的输入序列，预测序列的下一个字符。循环神经网络引入了一个隐藏变量 $H$ ，用 $H_{t}$ 表示 $H$ 在时间步 $t$ 的值。 $H_{t}$ 的计算基于 $X_{t}$ 和 $H_{t-1}$ ，即 $H_{t}$ 记录了到当前字符为止的序列信息，然后再利用 $H_{t}$ 对序列的下一个字符进行预测。

循环神经网络的构造

循环神经网络有很多不同的构造方法，这里使用常见的一种。假设 $\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$ 是时间步 $t$ 的小批量输入， $\boldsymbol{H}_t \in \mathbb{R}^{n \times h}$ 是该时间步的隐藏变量，则：

$\boldsymbol{H}_t = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh} + \boldsymbol{H}_{t-1} \boldsymbol{W}_{hh} + \boldsymbol{b}_h).$

其中， $\boldsymbol{W}_{xh} \in \mathbb{R}^{d \times h}$ ， $\boldsymbol{W}_{hh} \in \mathbb{R}^{h \times h}$ ， $\boldsymbol{b}_{h} \in \mathbb{R}^{1 \times h}$ ， $\phi$ 函数是非线性激活函数。由于引入了 $\boldsymbol{H}_{t-1} \boldsymbol{W}_{hh}$ ， $H_{t}$ 能够捕捉截至当前时间步的序列的历史信息，就像是神经网络当前时间步的状态或记忆一样。由于 $H_{t}$ 的计算基于 $H_{t-1}$ ，上式的计算是循环的，使用循环计算的网络即循环神经网络（recurrent neural network - RNN）。

在时间步 $t$ ，输出层的输出为：

$\boldsymbol{O}_t = \boldsymbol{H}_t \boldsymbol{W}_{hq} + \boldsymbol{b}_q.$

其中 $\boldsymbol{W}_{hq} \in \mathbb{R}^{h \times q}$ ， $\boldsymbol{b}_q \in \mathbb{R}^{1 \times q}$ 。

注意：

即便在不同时间步，循环神经网络也始终使用这些模型参数。因此，循环神经网络模型参数的数量不随时间步的增加而增长；

批量训练的过程中，参数是以批为单位更新的，每个批次内模型的参数都是一样的；

RNN通过不断循环使用同一组参数来应对不同长度的序列，即网络参数数量与输入序列长度无关；

隐藏状态 $H_{t}$ 的值依赖于 $H_{1},...,H_{t-1}$ ,故不能并行计算。

DIY实现RNN

读取数据

导入上一篇文章中对Jay歌词数据预处理后的语料数据。

python

deeplearning_03.pyview raw

import torch
import torch.nn as nn
import time
import math
import deeplearning_02 as dl_2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 导入歌词数据
(corpus_indices, char_to_idx, idx_to_char, vocab_size) = dl_2.load_data_jay_lyrics()

字符表示

将字符表示成向量，才能送进神经网络进行矢量运算，这里采用one-hot编码方式。

one-hot编码

假设词典大小是 $N$ ，每次字符对应一个从 $0$ 到 $N-1$ 的唯一的索引，则该字符的向量是一个长度为 $N$ 的向量，若字符的索引是 $i$ ，则该向量的第 $i$ 个位置为 $1$ ，其他位置为 $0$ 。

python

deeplearning_03.pyview raw

def one_hot(x, n_class, dtype=torch.float32):
    result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device)  # shape: (n, n_class)
    result.scatter_(1, x.long().view(-1, 1), 1)  # result[i, x[i, 0]] = 1
    return result

批量字符表示

每次采样的小批量的形状是（batch_size, num_steps），将每个样本中的每个字符用one-hot编码后，会将这样的小批量变换成多个形状为（batch_size, 词典大小 $N$ ）的矩阵，而矩阵个数等于时间步数。也就是说，时间步 $t$ 的输入为 $\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$ ，其中 $n$ 为批量大小， $d$ 为词向量大小，即one-hot向量长度（词典大小）。

python

deeplearning_03.pyview raw

1 2	def to_onehot(X, n_class): return [one_hot(X[:, i], n_class) for i in range(X.shape[1])]

初始化模型参数

python

deeplearning_03.pyview raw

num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size

def get_params():
    def _one(shape):
        param = torch.zeros(shape, device=device, dtype=torch.float32)
        nn.init.normal_(param, 0, 0.01)
        return torch.nn.Parameter(param)

    # 隐藏层参数
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device))
    # 输出层参数
    W_hq = _one((num_hiddens, num_outputs))
    b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device))
    return (W_xh, W_hh, b_h, W_hq, b_q)

num_inputs = d = 特征数 = one-hot向量长度 = 词典大小
num_hiddens = h = 隐藏单元的个数（超参数）
num_outputs = q = 输出个数（= 分类类别数）

所以回顾上文RNN构造的数学表达式，会更好理解：

$\boldsymbol{H}_t = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh} + \boldsymbol{H}_{t-1} \boldsymbol{W}_{hh} + \boldsymbol{b}_h).$

其中 $\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$ ， $\boldsymbol{H}_t \in \mathbb{R}^{n \times h}$ ， $\boldsymbol{W}_{xh} \in \mathbb{R}^{d \times h}$ ， $\boldsymbol{W}_{hh} \in \mathbb{R}^{h \times h}$ ， $\boldsymbol{b}_{h} \in \mathbb{R}^{1 \times h}$ 。

$\boldsymbol{O}_t = \boldsymbol{H}_t \boldsymbol{W}_{hq} + \boldsymbol{b}_q.$

其中 $\boldsymbol{W}_{hq} \in \mathbb{R}^{h \times q}$ ， $\boldsymbol{b}_q \in \mathbb{R}^{1 \times q}$ 。

定义模型

首先初始化隐藏状态，返回由一个形状为(批量大小, 隐藏单元个数)的值为0的NDArray组成的元组。使用元组是为了更便于处理隐藏状态含有多个NDArray的情况：

python

deeplearning_03.pyview raw

1 2	def init_rnn_state(batch_size, num_hiddens, device): return (torch.zeros((batch_size, num_hiddens), device=device), )

再定义RNN网络结构：

python

deeplearning_03.pyview raw

def rnn(inputs, state, params):
    # inputs和outputs皆为num_steps个形状为(batch_size, vocab_size)的矩阵
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    # 用循环的方式依次完成循环神经网络每个时间步的计算
    for X in inputs:
        H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h)
        Y = torch.matmul(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)

最后记得检查下参数维度：

python

deeplearning_03.pyview raw

state = init_rnn_state(X.shape[0], num_hiddens, ctx)
inputs = to_onehot(X.as_in_context(ctx), vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
print(len(outputs), outputs[0].shape, state_new[0].shape)

(5, (2, 1027), (2, 256))

定义训练和预测函数

为了在迭代模型参数优化计算效率，在训练过程中评价模型，先引入裁剪梯度和困惑度两个概念。

裁剪梯度

循环神经网络中较容易出现梯度衰减或梯度爆炸，这会导致网络几乎无法训练。裁剪梯度（clip gradient）是一种应对梯度爆炸的方法。假设把所有模型参数的梯度拼接成一个向量 $\boldsymbol{g}$ ，并设裁剪的阈值是 $\theta$ 。裁剪后的梯度

$\min\left(\frac{\theta}{\|\boldsymbol{g}\|}, 1\right)\boldsymbol{g}$

的 $L_2$ 范数不超过 $\theta$ 。

python

deeplearning_03.pyview raw

def grad_clipping(params, theta, device):
    norm = torch.tensor([0.0], device=device)
    for param in params:
        norm += (param.grad.data ** 2).sum()
    norm = norm.sqrt().item()
    if norm > theta:
        for param in params:
            param.grad.data *= (theta / norm)

困惑度

用于来评价语言模型的好坏，是对交叉熵损失函数做指数运算后得到的值。

最佳情况：模型总是把标签类别的概率预测为1，此时困惑度为1；
最坏情况：模型总是把标签类别的概率预测为0，此时困惑度为正无穷；
基线情况（随机分类模型）：模型总是预测所有类别的概率都相同，此时困惑度为类别个数。

可见，任何一个有效模型的困惑度必须小于类别个数。因此，在此处困惑度必须小于词典大小vocab_size。

训练和预测函数

引入上篇文章中对时序数据采用随机采样和相邻采样方法。

注意：不同采样方法隐藏状态初始化不同
相邻采样的前后两个批量的数据在时间步上是连续的，所以模型会使用上一个批量的隐藏状态初始化当前的隐藏状态，表现形式就是不需要在一个epoch的每次迭代时随机初始化隐藏状态。
假如没有detach_()的操作，每次迭代之后的输出是一个叶子节点，并且该叶子节点的requires_grad = True(从上面的计算图就可以看出)，也就意味着两次或者说多次的迭代，计算图一直都是连着的，因为没有遇到梯度计算的结束位置，这样将会一直持续到下一次隐藏状态的初始化。所以这将会导致计算图非常的大，进而导致计算开销非常大。
反之，每次将参数detach_()出来，其实就是相当于每次迭代之后虽然是使用上一次迭代的隐藏状态，只不过我们希望重新开始，具体的操作就是把上一次的输出节点的参数requires_grad设置为False的叶子节点。

python

deeplearning_03.pyview raw

def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                          vocab_size, device, corpus_indices, idx_to_char,
                          char_to_idx, is_random_iter, num_epochs, num_steps,
                          lr, clipping_theta, batch_size, pred_period,
                          pred_len, prefixes):
    if is_random_iter:
        data_iter_fn = dl_2.data_iter_random
    else:
        data_iter_fn = dl_2.data_iter_consecutive
    params = get_params()
    loss = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        if not is_random_iter:  # 如使用相邻采样，在epoch开始时初始化隐藏状态
            state = init_rnn_state(batch_size, num_hiddens, device)
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device)
        for X, Y in data_iter:
            if is_random_iter:  # 如使用随机采样，在每个小批量更新前初始化隐藏状态
                state = init_rnn_state(batch_size, num_hiddens, device)
            else:  # 否则需要使用detach函数从计算图分离隐藏状态
                for s in state:
                    s.detach_()
            # inputs是num_steps个形状为(batch_size, vocab_size)的矩阵
            inputs = to_onehot(X, vocab_size)
            # outputs有num_steps个形状为(batch_size, vocab_size)的矩阵
            (outputs, state) = rnn(inputs, state, params)
            # 拼接之后形状为(num_steps * batch_size, vocab_size)
            outputs = torch.cat(outputs, dim=0)
            # Y的形状是(batch_size, num_steps)，转置后再变成形状为
            # (num_steps * batch_size,)的向量，这样跟输出的行一一对应
            y = torch.flatten(Y.T)
            # 使用交叉熵损失计算平均分类误差
            l = loss(outputs, y.long())
            
            # 梯度清0
            if params[0].grad is not None:
                for param in params:
                    param.grad.data.zero_()
            l.backward()
            grad_clipping(params, clipping_theta, device)  # 裁剪梯度
            dl_2.sgd(params, lr, 1)  # 因为误差已经取过均值，梯度不用再做平均
            l_sum += l.item() * y.shape[0]
            n += y.shape[0]

        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print(predict_rnn(prefix, pred_len, rnn, params, init_rnn_state,
                    num_hiddens, vocab_size, device, idx_to_char, char_to_idx))

训练模型并创作歌词

每过50个迭代周期pred_period便根据前缀“喜欢”和“分手”分别创作长度pred_len为50个字符（不考虑前缀长度）的一段歌词。

python

deeplearning_03.pyview raw

1 2	num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2 pred_period, pred_len, prefixes = 50, 50, ['喜欢', '分手']

基于随机采样的模型效果：

python

deeplearning_03.pyview raw

train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, device, corpus_indices, idx_to_char,
                      char_to_idx, True, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)

epoch 50, perplexity 70.843629, time 0.60 sec
喜欢我不要再想我不要再想我不要再想我不要再想我不要再想我不要再想我不要再想我不要再想我
分手我不要再想我不要再想我不要再想我不要再想我不要再想我不要再想我不要再想我不要再想我
…
epoch 250, perplexity 1.301393, time 0.60 sec
喜欢一只在娘妥依话就停驳别底在角落不爽就反驳到底拽什么懂不懂篮球有种不要走三对三斗牛三
分手那只么一步两步三颗四步望著天看星星一颗两颗三颗四颗连成线一著背默默许下心愿看远方的星是否

基于相邻采样的模型效果：

python

deeplearning_03.pyview raw

train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, device, corpus_indices, idx_to_char,
                      char_to_idx, False, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)

epoch 50, perplexity 61.914999, time 0.61 sec
喜欢我想要这你谁我有你想一直我想一空我想你的可爱女人坏坏我有你谁我有你想一直我想一空
分手我想要这你谁我有你想一直我想一空我想你的可爱女人坏坏我有你谁我有你想一直我想一空
…
epoch 250, perplexity 1.160371, time 0.60 sec
喜欢一候在一只悲的我有你的有模有样什么兵器最喜欢双截棍柔中带刚想要去河南嵩山学少林跟武当快
分手一候她如果我都没有错亏我叫你一声爸爸我回来了不要再这样打我妈妈你以你当榜样好多的假像

Pytorch实现RNN

定义模型

几个构造函数参数

input_size - The number of expected features in the input x
hidden_size – The number of features in the hidden state h
nonlinearity – The non-linearity to use. Can be either ‘tanh’ or ‘relu’. Default: ‘tanh’
batch_first – If True, then the input and output tensors are provided as (batch_size, num_steps, input_size). Default: False

这里的batch_first决定了输入的形状，默认为False，对应的输入形状是 (num_steps, batch_size, input_size)。

forward函数的参数为：

input of shape (num_steps, batch_size, input_size): tensor containing the features of the input sequence.
h_0 of shape (num_layers * num_directions, batch_size, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, num_directions should be 2, else it should be 1.

forward函数的返回值是：

output of shape (num_steps, batch_size, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the RNN, for each t.
h_n of shape (num_layers * num_directions, batch_size, hidden_size): tensor containing the hidden state for t = num_steps.

构建RNN模型

python

deeplearning_03.pyview raw

class RNNModel(nn.Module):
    def __init__(self, rnn_layer, vocab_size):
        super(RNNModel, self).__init__()
        self.rnn = rnn_layer
        self.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) 
        self.vocab_size = vocab_size
        self.dense = nn.Linear(self.hidden_size, vocab_size)

    def forward(self, inputs, state):
        # inputs.shape: (batch_size, num_steps)
        X = to_onehot(inputs, vocab_size)
        X = torch.stack(X)  # X.shape: (num_steps, batch_size, vocab_size)
        hiddens, state = self.rnn(X, state)
        hiddens = hiddens.view(-1, hiddens.shape[-1])  # hiddens.shape: (num_steps * batch_size, hidden_size)
        output = self.dense(hiddens)
        return output, state

定义预测函数

python

deeplearning_03.pyview raw

def predict_rnn_pytorch(prefix, num_chars, model, vocab_size, device, idx_to_char,
                      char_to_idx):
    state = None
    output = [char_to_idx[prefix[0]]]  # output记录prefix加上预测的num_chars个字符
    for t in range(num_chars + len(prefix) - 1):
        X = torch.tensor([output[-1]], device=device).view(1, 1)
        (Y, state) = model(X, state)  # 前向计算不需要传入模型参数
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(Y.argmax(dim=1).item())
    return ''.join([idx_to_char[i] for i in output])

定义训练函数

仅使用相邻采样

python

deeplearning_03.pyview raw

def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
                                corpus_indices, idx_to_char, char_to_idx,
                                num_epochs, num_steps, lr, clipping_theta,
                                batch_size, pred_period, pred_len, prefixes):
    loss = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.to(device)
    for epoch in range(num_epochs):
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = dl_2.data_iter_consecutive(corpus_indices, batch_size, num_steps, device) # 相邻采样
        state = None
        for X, Y in data_iter:
            if state is not None:
                # 使用detach函数从计算图分离隐藏状态
                if isinstance (state, tuple): # LSTM, state:(h, c)  
                    state[0].detach_()
                    state[1].detach_()
                else: 
                    state.detach_()
            (output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size)
            y = torch.flatten(Y.T)
            l = loss(output, y.long())
            
            optimizer.zero_grad()
            l.backward()
            grad_clipping(model.parameters(), clipping_theta, device)
            optimizer.step()
            l_sum += l.item() * y.shape[0]
            n += y.shape[0]
        
        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print(predict_rnn_pytorch(
                    prefix, pred_len, model, vocab_size, device, idx_to_char,
                    char_to_idx))

训练模型

python

deeplearning_03.pyview raw

num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['喜欢', '分手']
train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
                            corpus_indices, idx_to_char, char_to_idx,
                            num_epochs, num_steps, lr, clipping_theta,
                            batch_size, pred_period, pred_len, prefixes)

epoch 50, perplexity 1.015603, time 0.37 sec
喜欢人潮中你只属于我的那画面经过苏美女神身边我以女神之名许愿思念像底格里斯河般的漫延当古文明只
分手一切当年家你想大声布对你依依不舍连隔壁邻居都猜到我现在的感受河边的风在吹着头发飘动
…
epoch 250, perplexity 1.006805, time 0.37 sec
喜欢在潮中你融化在宇宙里我每天每天每天在想想想想著你这样的甜蜜让我开始乡相信命运感谢地心引力
分手那回忆的路上时间变好慢老街坊小弄堂是属于那年代白墙黑瓦的淡淡的忧伤消失的旧时光一九