ML|Hung-yi Lee

date
Oct 19, 2024
slug
ml-hung-yi-lee
status
Published
tags
AI
Pytorch
type
Post

Introduction

Machine Learning\approxLook For Function
  • Regression: The function outputs a scalar
  • Classification: Given options, the function outputs the correct one
  • Structured Learning: Create something with structure
Steps to find the function(or ML)
  1. function with unknown
  1. define loss from training data
  1. optimization

Neuron and Neuron Network

我们可以用很多Sigmoid函数叠加,去拟合任何函数;而通过调整w、b和c,可以创建出我们想要的Sigmoid函数。每个Sigmoid函数是一个神经元
notion image
notion image
notion image
notion image

Hw1

Code

General Guide

Optimization Issue

notion image
  • Gaining the insights from comparison
  • Start from shallower networks (or other models),which are easier to optimize.
  • If deeper networks do not obtain smaller loss on training data, then there is optimization issue.
Solution: More powerful optimization technology

Local minima and Saddle point

把gradient=0的点称为critical point
notion image
可以用泰勒级数估计L(θ)L(θ)+(θθ)Tg+12(θθ)TH(θθ)L(\theta) \approx L(\theta') + (\theta - \theta')^T g + \frac{1}{2} (\theta - \theta')^T H (\theta - \theta')
g=L(θ)g = \nabla L(\theta'),是梯度也是一阶偏导;H是Hessian矩阵,是二阶偏导,Hij=2Lθiθj(θ)H_{ij} = \frac{\partial^2 L}{\partial \theta_i \partial \theta_j}(\theta')
notion image
  1. For all v\mathbf{v}vTHv>0\mathbf{v}^T H \mathbf{v} > 0
      • Around θ\theta' : L(θ)>L(θ)L(\theta) > L(\theta')
      • Local minima
      H is positive definite=All eigen values are positive.\text{H is positive definite} = \text{All eigen values are positive.}
  1. For all v\mathbf{v}vTHv<0\mathbf{v}^T H \mathbf{v} < 0
      • Around θ\theta' : L(θ)<L(θ)L(\theta) < L(\theta')
      • Local maxima
      H is negative definite=All eigen values are negative.H \text{ is negative definite} = \text{All eigen values are negative.}
  1. Sometimes vTHv>0\mathbf{v}^T H \mathbf{v} > 0 , sometimes vTHv<0\mathbf{v}^T H \mathbf{v} < 0
      • Saddle point
      Some eigen values are positive, and some are negative.\text{Some eigen values are positive, and some are negative.}
      此时如果把v\mathbf{v}取特征向量u\mathbf{u},则uTHu=uT(λu)=λu2\mathbf{u}^T H \mathbf{u} = \mathbf{u}^T(\lambda\mathbf{u})=\lambda||\mathbf{u}||^2
      取特征值λ<0\lambda<0,则uTHu<0\mathbf{u}^TH\mathbf{u}<0L(θ)<L(θ)L(\theta)<L(\theta')。取θ=θ+u\theta=\theta'+\mathbf{u},就可以脱离Saddle Point让L下降,但实际上不会用这种方法,计算量太大。
当参数很多的时候,Local minima是很少的,大多是Saddle Point;在高维下,总有路可以走。

Overfitting

notion image
notion image

Cross Validation

notion image

N-fold Cross Validation

notion image

Batch and Momentum

Small Batch and Large Batch

notion image

Momentum

notion image
Starting at 𝜽𝟎,Movement 𝒎𝟎 = 𝟎
Compute gradient 𝒈𝟎,Movement 𝒎𝟏 = λ𝒎𝟎− 𝜂𝒈𝟎,Move to 𝜽𝟏 = 𝜽𝟎 + 𝒎𝟏
Compute gradient 𝒈𝟏,Movement 𝒎𝟐 = λ𝒎𝟏− 𝜂𝒈𝟏,Move to 𝜽𝟐 = 𝜽𝟏 + 𝒎𝟐

Adaptive Learning Rate

Adagrad

θit+1θitησitgit\theta_i^{t+1} \leftarrow \theta_i^t - \frac{\eta}{\sigma_i^t} g_i^tσit=1t+1i=0t(git)2\sigma_i^t = \sqrt{\frac{1}{t+1} \sum_{i=0}^t \left(g_i^t\right)^2}
notion image

RMSProp

θit+1θitησitgit\theta_i^{t+1} \leftarrow \theta_i^t - \frac{\eta}{\sigma_i^t} g_i^tσit=α(σit1)2+(1α)(git)2\sigma_i^t = \sqrt{\alpha (\sigma_i^{t-1})^2 + (1 - \alpha)(g_i^t)^2}
增加了最近的梯度的权重,这样过去的梯度影响更小。
Adam=Adagrad+RMSProp

Warm Up

notion image
notion image
左侧是一般的Learning Rate,右侧是Warm Up。Warm Up可以使得模型在一开始用小步长探索信息,降低统计信息σit\sigma_i^t的方差

Batch Normalization

notion image
如果x2x_2很大,x1x_1很小,那么x2x_2对L的贡献就会比x1x_1大得多,就会出现左图的情况。通过Normalization,可以变成右图
有因为我们不可能考虑整个Network的平均值和标准差,因此只考虑一个Batch

Hw2

Code

CNN

notion image
notion image
Pooling可以没有,比如AlphaGo使用了CNN,但没有Pooling
这部分内容可以看另一篇文章的CNN部分

Hw3

Code

ResNet18

from CSDN
notion image
notion image

Self-attention

notion image
notion image
notion image
notion image

Muti-head

notion image
notion image
Muti-Head Self-Attention是CNN Pro Max
Muti-Head=CNN里多个卷积核(决定输出通道数)
同时每个Head的Self-Attention又可以通过调整权重自定义考虑的范围,而不是CNN里的Kernal Size大小的方框
每个Head对于每个Pixel都会做Self-Attention,相当于Filter扫一次图片

Positional Encoding

No position information in self-attention.
  • Each position has a unique positional vector 𝑒𝑖
  • hand-crafted
  • learned from data
notion image

Hw4

Code(没调参,稀烂)

Transformer

Encoder

notion image
notion image

Decoder

Comparison

除了当中和Encoder输出Cross Attention的部分,和Encoder结构一样
notion image

Masked Self-attention

为了使得decoder不能看见未来的信息,也就是对于一个序列中的第i个token,解码的时候只能够依靠i时刻之前(包括i)的的输出,而不能依赖于i时刻之后的输出,我们要采取一个遮盖的方法(Mask)使得其在计算self-attention的时候只用i个时刻之前的token进行计算。因为Decoder是用来做预测的,而在训练预测能力的时候,我们不能够"提前看答案",因此要将未来的信息给遮盖住。
notion image
比如Decoder输入“器”的时候,只能看到“Start”、“机”和“器”

Autoregressive和Non-autoregressive

Autoregressive会在词库中加入一个END MARK,如果Decoder判断下一个词是END MARK,那么这个Sequence就结束了。
notion image
Non-autoregressive可以:
Another predictor for output length Output a very long sequence, ignore tokens after END
好处是可以并行化,缺点是效果不如Autoregressive
notion image

Encoder-Decoder

notion image

Cross Attention

notion image
不一定要和Encoder最后一层的Output做Cross Attention
notion image

Training

训练的时候,Decoder看到的是Ground Truth,即真实值,而Testing的时候Decoder看到的是Decoder上一时刻的输出(左边是Training)
notion image
notion image
 
这样会有问题:训练的时候都用真实值,没见过噪声。那么在测试的时候如果某时刻的输出错了,那么之后整个序列都会爆炸。可以用Scheduled Sampling(把Decoder输出和Ground Truth一起喂)
 
notion image
左边是交叉熵,右边BLEU(越大越好)
BLEU没法微分,没法做梯度下降。When you don’t know how to optimize, just use reinforcement learning (RL)!

Hw5

to do
感觉挺难的
对于本文内容有任何疑问, 可与我联系.