Longformer: The Long-Document Transformer 논문 정리

2020. 10. 7. 19:50reading

반응형

Longformer: The Long-Document Transformer

 

 

Longformer

Transformer

complexity

O(n^2)

scales linearly

O(n)

scales quadratically

attention

local windowed attention + global attention (d)

self-attention (a)

max length

4,096

512

 

 

[ Attention Pattern ]

 

 

 

1) Sliding Window

• fixed-size window attention for local context

• complexity: O(n × w)

   • n: input sequence length

   • w: fixed window size (layer마다 달라질 수 있음)

small window size

large window size

local information

higher-level representation

efficiency

performance

 

• stacking layers lead to high level features (similar w/ CNN)

• receptive field size: l × w

   • l: num of layers

 

2) Dilated Sliding Window

• analogues to dilated CNN

    • window has gaps of size dilation

 

 

• receptive field size: l × d × w

   • d: size dilation

 

3) Global Attention

• to learn task-specific representations

• add global attention on few pre-selected input locations

    ex) classification task: [CLS] token, QA task: all question tokens

• symmetric

    • a token with a global attention attends to all tokens

    • all tokens attend to a token with a global attention

 

[ Pretraining and Finetuning ]

• continues from the RoBERTa checkpoint (instead of pretraining from scratch)

• sliding window attention w/ window size 512

• add extra positional embeddings (up to 4,096)

    • copy 512 positional embedding from RoBERTa (instead of randomly initializing)

 

 

cf. BigBird Model’s attention

 

 

 

 

참고

longformer original paper: https://arxiv.org/pdf/2004.05150.pdf

BigBird paper explained: https://youtu.be/WVPE62Gk3EM

 

반응형