아래 내용은 Udemy에서 Pytorch: Deep Learning and Artificial Intelligence를 보고 정리한 내용이다.
Gradient Descent
backbone of deep learning
k-means clustering
hidden markov models
matric factorization
Big learning rate
small learning rate
Stochastic Gradient Descent
optimizer = 'sgd'
stochastic
Momentum
SGD
움직이는 것
physics momentum
zigzag
momentum이 없으면 엄청 zigzig가 많이 있다.
momentum이 있으면 빨리 내리온다.
Variable and Adaptive Learning Rates
momentum is nice: huge performance gains, almost no work. 0.9 is usually fine.
learning rate scheduling
#1. step decay
exponentail decay
leaning rate:
너무 크게 해도 안좋고 너무 작게 해도 안좋다.
too slow -> increase learning rate
be careful! may hit a plateau temporarily
AdaGrad:Adaptive Learning Rete Techniques
everything is element-wise
each scalar parameter and its learning rate is updated independently of the others
It has been observed that AdaGrad decreases learning rate too aggressively
RMSProp
Introduced by Geoff Hinton+ team
since cache is growing too fast, let's decrease it on each update:
어떤것이 맞는지도 알 수 가 없다.
major packages have implemented both
Tensorflow initializers cache = 1
Kears initializzers cache = 0
AdaGrad:
at every batch:
cache+= gradient * 2
param = param- learning_rate * gradient/ sqrt(cache_ epsilon)
RMSProp
At every batch:
cache = decay * cache + (1-decay) * gradient ** 2
param = param- learning_rate * gradient/ sqrt(cache_ epsilon)
epsilon = 10 ** -8, 10 ** -9, 10 ** -10 ,etc ..., decay = 0.9, 0.99, 0.999, 0.9999, etc...
Adam optimizer
go-to default thes days
"RMSprop with momentum"
exponentailly-Smoothed Averages
RMS = "root mean square"
Adam:
m and v