transformer 논문 리뷰

오늘은 transformer 논문을 리뷰해보려 한다. 다만, 전에 transformer에 대해 간단하게 포스팅 한 적이 있기 때문에 해당 부분은 빼고 리뷰해볼 것이다.

https://roki9914.tistory.com/15

Transformer에 대해 알아보자

Transformer가 나오기 전까지 Sequential data는 RNN, LSTM, GRU와 같은 모델들을 이용해 처리했으나, transformer의 등장으로 sequential data를 처리할 때 대부분 transformer를 사용하게 되었으며, tranformer를 이용해

roki9914.tistory.com

https://arxiv.org/abs/1706.03762

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org

Introduction

transformer는 기존에 sequence modeling에서 우위를 점하던 RNN 계열의 모델들이 가지는 문제를 해결하고, attention mechanism의 활용법을 찾은 모델이다.

기존의 RNN 계열 모델들은 재귀적으로 연결된 모델들로, 병렬화가 불가능해 sequence의 길이가 길어질 경우 vanishing gradient가 발생하는등, 여러 문제가 있었다. 이후, 후속 연구들에서 factorizational trick이나 conditional computation을 통해 계산 효율성을 늘리는 등의 성과가 있었지만, 여전히 sequential computation의 문제가 남아있었다.

또한, 이 논문이 핵심으로 사용한 attention mechanism또한 이전부터 있었지만, RNN과 함께 쓰이고 있었다.

이 논문은 recurrence가 없이, attention mechanism만으로 translation task에서 state of the art를 달성했으며, 그 내용을 담고있다.

Background

기존에 sequential computation 문제를 해결하려 한 논문들은 external Neural GPU, ByteNet, ConvS2S와 같은 방법론들을 제시했는데, 이들은 모두 CNN을 basic building block으로 사용했으며, 여전히 input, output 거리에서 의존성을 학습하기 어려운 문제가 있었다.

Self attention의 경우, 이미 여러 task(reading comprehension, textual entailment)들에서 좋은 성능을 보여준 바 있다.

end-to-end network의 경우 sequence aligned recurrence가 아닌 recurrent attention mechanism에 기반을 두고 있으며, simple language question answering이나 language modeling task에서 좋은 성능을 보였다.

tansformer는 이런 연구들에 기반을 두어, self attention만을 사용한 최초의 transduction model이다.

Model Architecture

모델 구조에 대한 설명은 이전 포스팅에서 전부 다루었다.

https://roki9914.tistory.com/15

Transformer에 대해 알아보자

roki9914.tistory.com

다만, 이전 포스팅에서는 positional encoding이 정해진 방법으로 시행된다고만 말했는데, 그 방법은 삼각함수를 이용한 식을 각 position에 넣어주는 것이다.

아래 식의 pos가 각 토큰의 위치이고, i 는 차원의 크기, d_model은 모델의 dimension이다.

논문은 학습된 positional embedding이 아닌 위의 식을 사용한 이유로, 위의 방식으로 학습했을 때 positional embedding학습에 사용된 sequence보다 긴 sequence를 처리할때 잘 처리하기 때문이라고 한다.

Why Self-Attention

이 부분에서 논문은 self-attention과 recurrent 방법론, convolutoin 방법론을 여러 방면에서 비교하여 왜 self-attention이 우수한지 설명한다.

위 표에서 n은 input sequence의 길이, d는 representation dimensionality, k는 kernel size이다.

위의 표를 보면, self attention은 n<d일때 layer당 계산복잡도(complexity per layer)가 recurrent layer보다 빠르다. 보통의 경우 n이 d보다 작기때문에 이는 상당한 이점이다.

sequential operations는 각 computation당 필요한 operation의 개수로, 작을수록 parallize가 쉽다. self-attention은 여기서 상수항을 가지므로 recurrent를 제외한 다른 방법과 함께 가장 우수한 값을 가진다.

maximum path length를 지표로 삼은 이유는 다음과 같다:

network에서 순회해야하는 forward와 backward의 path 길이가 dependency를 학습하는 데에 큰 영향을 미치는데, 이때 input sequence와 output sequence의 position 조합 간에 path길이가 짧을수록 long-range dependency를 학습하기 쉬우므로 maximum path length가 작을수록 학습이 용이하다. self-attention은 유일하게 이 항목에서 상수항을 가지므로, 학습이 가장 용이하다고 볼 수 있다.

Training

학습데이터

이 논문에서는 학습 데이터로 English-German dataset(4.5M sentence pairs)와 English-French dataset(36M sentence pair, 32K word-piece vocabulary)를 사용했다.

하드웨어

8개의 NVIDIA P100 GPU로 base model은 12시간동안 100,000step, big model은 3.5일동안 300,000step 학습했다.

Optimizer

Adam optimizer를 사용했으며, learning rate, 배타 1, 2 값과 입실론 값을 아래와 같이 설정했다.

Regularization

각 sub-layer의 output, embedding의 합과 positional encoding에 dropout을 적용했으며, label smoothing ( $ϵ_{l s} = 0.1$ )을 적용하였다.

Result

결과적으로 해당 모델은 machine translation에서 state of the art 수준에 등극했으며, 그 외에 model variation이나 english constituency parsing에서도 우수한 성능을 보였다.

지금은 이 transformer를 이용한 유명한 모델들과 서비스들이 있다. BERT, chatGPT, DALL-E등 여러 활용점들이 있으니, AI역사에 한 획을 그엇다고 할 수 있는 논문이다.

개발자를 위한 발걸음