반응형

아래 내용은 Udemy에서 Pytorch: Deep Learning and Artificial Intelligence를 보고 정리한 내용이다.

Gradient Descent

backbone of deep learning 

 

k-means clustering

hidden markov models

matric factorization

 

Big learning rate 

small learning rate

Stochastic Gradient Descent

optimizer = 'sgd'

 

stochastic

 

Momentum

SGD

움직이는 것 

physics momentum

zigzag

momentum이 없으면 엄청 zigzig가 많이 있다. 

momentum이 있으면 빨리 내리온다.

 

 

Variable and Adaptive Learning Rates

momentum is nice: huge performance gains, almost no work. 0.9 is usually fine.

learning rate scheduling

#1. step decay

exponentail decay 

 

leaning rate: 

너무 크게 해도 안좋고 너무 작게 해도 안좋다.

too slow -> increase learning rate

be careful! may hit a plateau temporarily

 

AdaGrad:Adaptive Learning Rete Techniques

everything is element-wise

each scalar parameter and its learning rate is updated independently of the others

It has been observed that AdaGrad decreases learning rate too aggressively

 

RMSProp

Introduced by Geoff Hinton+ team

since cache is growing too fast, let's decrease it on each update:

 

어떤것이 맞는지도 알 수 가 없다.

major packages have implemented both

Tensorflow initializers cache = 1

Kears initializzers cache = 0

 

AdaGrad:

at every batch:

cache+= gradient * 2

param = param- learning_rate * gradient/ sqrt(cache_ epsilon)

 

RMSProp

At every batch:

cache = decay * cache + (1-decay) * gradient ** 2

param = param- learning_rate * gradient/ sqrt(cache_ epsilon)

 

epsilon = 10 ** -8, 10 ** -9, 10 ** -10 ,etc ..., decay = 0.9, 0.99, 0.999, 0.9999, etc...

 

Adam optimizer

go-to default thes days

"RMSprop with momentum"

exponentailly-Smoothed Averages

 

RMS = "root mean square"

 

Adam:

m and v

 

반응형

+ Recent posts