DL 深度学习笔记

Xiaopeng Xu included in Technology

Jun 14, 2021 1741 words 4 minutes

NN 基础

二分类问题

对 64x64x3 的图片，判断是否有猫（1/0）

基本标记

X 的每一行对应的是一个样本，每一列对应的是一个特征。

逻辑回归 (Logistics regression)

二分类情况下，对应的 delta 函数是 sigmoid 函数。

Sigmoid 函数

代价函数 (cost function)

目标：

损失函数 (loss/error function)

损失函数 (loss function) 针对每一个样本计算的误差
MSE 通常会受限于局部最小值 (local minimum) 而不能找到全局最小值 (global minimum) 因而实际中不常使用。
实际中更常用的是下面这个：
- 举例分析：

代价函数 (cost function)

代价函数是损失函数的平均值，在一个循环中只计算一次。

梯度下降法 (Gradient descent)

目标：找到 w 和 b，以最小化损失函数 (loss function)

梯度 (gradient)

思想：在每次循环中，叠加导数值，这样可以逼近最优点 (mimimum)
通过偏导数，来计算导数函数。

导数 (derivative)

线性函数的导数

二次方程的导数

计算图

按计算图计算导数

深度神经网络

表示符号
- L 表示层数
- n[1], 表示 1 层中的节点数
- z[1], 表示 1 层中的 (W[1]*x[0] +b[1])
- a[1], 表示 1 层中的激活值 g(z[1])，g 常用 ReLU

RNN

可以直观理解为相比传统的 NN 增加了时序记忆能力。但 Transformer 这种大网络中，也增加了类似 CNN 的抽象能力 (即 Multi-head attention)。

应用

语音识别、音乐生成、情感分类、DNA 序列分析、机器翻译、视频行为识别、命名实体识别

词表示 (word representation)

为什么不用标准的神经网络？

简单的 RNN

前向传播 (Forward propagation)

简化的标记符

时间反向传播 (backpropagation through time)

RNN 类型

多对多，多对一，一对一，一对多

GRU (Gated Recurrent Unit)

RNN unit

GRU (simplified)

完整的 GRU

u=update, r=remember

*LSTM (long short term memory)

u=update, r=remember, 0=output, f=forget

GRU removed a in passing, removed forget gate, using “1 - update” instead, and remember gate is similar to output gate.

Significantly improve computing effenciency.

前向传播 Forward Illustration

A gate is a sigmoid function with w and b. Softmax also contains a w and b.

Forget gate

Wf: forget gate weight 𝐖𝑓
bf: forget gate bias 𝐛𝑓
ft: forget gate Γ⟨𝑡⟩

Candidate value

cct: candidate value 𝐜̃⟨𝑡⟩

Update gate

Wi is the update gate weight 𝐖𝑖
bi is the update gate bias 𝐛𝑖
it is the update gate 𝚪⟨𝑡⟩𝑖

Cell state

c: cell state, including all time steps, 𝐜 shape (𝑛𝑎,𝑚,𝑇𝑥)
c_next: new (next) cell state, 𝐜⟨𝑡⟩ shape (𝑛𝑎,𝑚)
c_prev: previous cell state, 𝐜⟨𝑡−1⟩, shape (𝑛𝑎,𝑚)

Output gate

Wo: output gate weight, 𝐖𝐨
bo: output gate bias, 𝐛𝐨
ot: output gate, 𝚪⟨𝑡⟩𝑜
a: hidden state, including time steps. 𝐚 has shape (𝑛𝑎,𝑚,𝑇𝑥)
a_prev: hidden state from previous time step. 𝐚⟨𝑡−1⟩ has shape (𝑛𝑎,𝑚)
a_next: hidden state for next time step. 𝐚⟨𝑡⟩ has shape (𝑛𝑎,𝑚)
y_pred: prediction, including all time steps. 𝐲𝑝𝑟𝑒𝑑 has shape (𝑛𝑦,𝑚,𝑇𝑥)
yt_pred: prediction for the current time step 𝑡. 𝐲⟨𝑡⟩𝑝𝑟𝑒𝑑 has shape (𝑛𝑦,𝑚)

代码

def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"] # forget gate weight
    bf = parameters["bf"]
    Wi = parameters["Wi"] # update gate weight (notice the variable name)
    bi = parameters["bi"] # (notice the variable name)
    Wc = parameters["Wc"] # candidate value weight
    bc = parameters["bc"]
    Wo = parameters["Wo"] # output gate weight
    bo = parameters["bo"]
    Wy = parameters["Wy"] # prediction weight
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    # Concatenate a_prev and xt (≈1 line)
    print(a_prev.shape)
    print(xt.shape)
    concat = np.concatenate((a_prev, xt), axis=0)

    # Compute values for ft, it, cct, c_next, ot, a_next using the formulas given figure (4) (≈6 lines)
    ft = sigmoid(Wf.dot(concat) + bf)
    it = sigmoid(Wi.dot(concat) + bi)
    cct = np.tanh(Wc.dot(concat) + bc)
    c_next = ft * c_prev + it * cct
    ot = sigmoid(Wo.dot(concat) + bo)
    a_next = ot * np.tanh(c_next)
    
    # Compute prediction of the LSTM cell (≈1 line)
    yt_pred = softmax(Wy.dot(a_next) + by)

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

反向传播

公式

𝑑𝑥⟨𝑡⟩ is represented by dxt,
𝑑𝑊𝑎𝑥 is represented by dWax,
𝑑𝑎𝑝𝑟𝑒𝑣 is represented by da_prev,
𝑑𝑊𝑎𝑎 is represented by dWaa,
𝑑𝑏𝑎 is represented by dba,
dz is not derived above but can optionally be derived by students to simplify the repeated calculations.

代码

ef rnn_cell_backward(da_next, cache):
    # Retrieve values from cache
    (a_next, a_prev, xt, parameters) = cache
    
    # Retrieve values from parameters
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]

    # compute the gradient of dtanh term using a_next and da_next (≈1 line)
    dtanh = da_next * (1 - np.tanh(Wax.dot(xt) + Waa.dot(a_prev) + ba) ** 2)

    # compute the gradient of the loss with respect to Wax (≈2 lines)
    dxt = Wax.transpose().dot(dtanh)
    dWax = dtanh.dot(xt.transpose())

    # compute the gradient with respect to Waa (≈2 lines)
    da_prev = Waa.transpose().dot(dtanh)
    dWaa = dtanh.dot(a_prev.transpose())

    # compute the gradient with respect to b (≈1 line)
    dba = np.sum(dtanh, axis=1, keepdims = True)
    
    # Store the gradients in a python dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
    
    return gradients