Moonde Moonde

SAM (Sharpness Aware Minimization)

2021-07-25T00:00:00+00:00

Data Structure

2020-04-29T00:00:00+00:00

Data Structure

전산학 전반에서 사용되는 기본적인 DS 를 소개한다.

Preliminaries

Big O notation

Let f, g be real valued function.

Then, f(x) = O(g(x)) if there is exist x` such that

f(x)

≤ M

g(x)

for any x > x’

Example.

$3n + 4 = O(n)$

$4n^2 + 3n + 1 = O(n^2)$

이걸 왜 쓰나?
- 알고리즘의 효율성을 계산하기 위해서!
좋은 알고리즘이란?
- 필요한 용량이 적음 (low space complexity)
- 빠름 (low time complexity)

Sorting Algorithm

selection sort

time complexity : $O(n^2)$

def selectionSort(arr):
  
    for i in range(len(arr)):
        min = i
        for j in range(i, len(arr)):
                if arr[min] > arr[j]:
                        min = j
                
        arr[min], arr[i] = arr[i], arr[min]
    
    return arr

Bubble sort

time complexity : $O(n^2)$

def bubbleSort(arr):
    
    for i in range(len(arr)):
        for j in range(len(arr)-i-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]

    return arr

Merge sort

time complexity : $O(nlog(n))$

def merge(arr1, arr2):
    result = []

    for i in range(len(arr1) + len(arr2)):
        if len(arr1) == 0:
            return result + arr2 
        if len(arr2) == 0:
            return result + arr1
        
        if arr1[0] < arr2[0]:
            result.append(arr1.pop(0))
        else:
            result.append(arr2.pop(0))
    
    return result

def mergeSort(arr):

    if len(arr) == 0 or len(arr) == 1:
        return arr
    
    if len(arr) == 2:
        if arr[0] < arr[1]:
            return arr
        else:
            return [arr[1], arr[0]]

    mid = len(arr) // 2
    leftHalf = mergeSort(arr[:mid])
    rightHalf = mergeSort(arr[mid:])

    return merge(leftHalf, rightHalf)

Quick sort

time complexity : $O(nlog(n))$

def quickSort(arr, low, high):
    if low >= high:
        return
    
    target = arr[low]
    mid = low
    for i in range(low+1, high): 
        if arr[i] < target:
            mid = mid + 1
            arr[i], arr[mid] = arr[mid], arr[i]
     
    arr[low], arr[mid] = arr[mid], arr[low]
    quickSort(arr, mid+1, high)
    quickSort(arr, low, mid)        

Dynamic Programming

Dynamic programming 이 아닐 때

def fibo(n):
    if n == 0 or n == 1:
        return 1
    return fibo(n-1) + fibo(n-2)

Dynamic programmin 을 사용하였을 때
- 관계식과 Cache 사용하기!
- space complexity 와 time complexity 의 trade off

def fibo(n):
    fibo_arr = []    
    for i in range(n+1):
        if i == 0:
            fibo[0] = 1
        elif i == 1:
            fibo[1] = 1
            
        fibo[i] = fibo[i-1] + fibo[i-2]
 
    return fibo[n];

Array

add?
remove?

Linked List

singly linked list

circularly linked list
doubly linked list

Stack

python list

Queue

queue
circular queue
doubly-ended queue (deque)
priority queue

Heap

parent 값보다 항상 큰 binary tree
complete 하다 := 각 level 의 node 가 maximal
insert : log n. 밑 바닥에 넣음
delete : log n. 맨 위 값(min) 를 뺌.
array 로 구현할 때
- 왼쪽 : $2f(p) + 1$
- 오른 쪽 : $2f(p) + 2$
heap 을 priority queue 만들 때 쓰면?
- 넣고 뺄 때 $O(log(n))$
- unsorted list / sorted list / heap 과 비교
  - insert 자주 일어나면 unsorted list
  - min 조회 자주 일어나면 heap or sorted list
  - 둘 다 자주 적당히 일어나면 heap
buttom up 방식으로 construction 하면 $O(n)$
heap sort
- min 값 계속 빼면 sort 됨 ㅋ
- heap 만드는데 O(n) + 하나 빼는데 O(log n) * n 개

Hash table, Map, Skip List

Map : (key, value) 로 저장하는 형태

Hash map

hash function mod

chain
probe
- linear probe
- quadratic probe

skip list

linked list 를 binary search

Tree

Tree
Binary Search Tree
Balanced Search Tree
- rotation
  - single rotation
  - double rotation (trinode restructuring)
AVL Tree
- Height Balance Property : For every interval position p of T, the heights of the children of p differ by at most 1
- AVL tree := 위 조건을 만족하는 binary tree
- height 변화 감지하여 trinode restructuring
(2, 4) Tree
- multiway search tree 중에 depth 같고 children 최대 4명
- 넣다가 4명 넘으면 중간 값 하나 위로 올림 (split)
- 값이 비면 parent 에서 값 가저옴 (fusing)
AVL tree & 2,4 tree 는 balancing 에 많은 시간 쏟음
- O(1) 이긴 한데 여러번의 step 이 필요할 수도 있음
Red - Black Tree
- balance 에 O(1)
- rule
  - root is black
  - external node is black
  - children of red is black
  - every external node has same black depth
- insertion
  - sibling 이 black -> trinode reconstruction
  - sibling 이 red -> 검은색 한 단계 아래로 보냄
- deletion
  - 없어지는게 흰색이면?
    - child 중에 흰색인거 옮기거나
    - 둘 다 검은색이면 검은색 중 하나 흰색으로 만들면 됨
  - sibling 이 black & sibling 이 red child 가지고 있음 -> trinode reconstruction
  - sibling 이 black & children 둘 다 black -> recoloring
  - sibling 이 red -> single rotation & recoloring
- https://www.cs.usfca.edu/~galles/visualization/RedBlack.html 이 사이트에서 실험해보기!

Graph

DFS & BFS

Shortest Path
- Dijkstra’s Algorithm
- Bellman fold Algorithm
Minimum Spanning Tree
- Kruskal’s Algorithm
- Prim - Jarnik Alrogithm

Compiler

2020-03-19T00:00:00+00:00

Compiler

Introduction

컴파일러는 인간 친화적인 소스 코드를 기계 친화적인 머신 코드로 바꿔주는 프로그램이다. 현대적인 컴파일러는 실행 코드의 성능을 최대한으로 높히기 위해 여러가지 최적화 기법을 사용한다. 컴파일러는 중요하다. 하드웨어 제품은 새로운 아키텍쳐에 대해 그에 맞는 컴파일러가 필요로 하고 컴파일러가 하드웨어에서 사용되는 최적화를 인지하고 있어야 한다. 현대 소프트웨어에서는 런타임에 컴파일하는 JIT 컴파일러가 Javascript 나 web assembly 에 사용되고 있다.

컴파일러의 일반적인 구조는 다음과 같다.

1D string 인 소스 코드는 Parser 로 부터 2D dimension 으로 구조화된다. 이렇게 만들어진 트리 구조를 소스 코드 AST(abstract syntax tree) 라고 한다. AST 는 코드를 보다 분석하기 용이하다. AST 는 Type checker 를 통해, 이 프로그램이 돌아가기 전에 잘 돌아갈 것인지 결정된다. Type checker 가 syntax error 의 위험이 없다고 판단하면 IRgen 의 input 으로 들어간다.

IR gen 은 AST 를 source code 와 machine code 의 중간 정도 단계에 있는 IR (Intermediate representation) 으로 만든다. IR 는 memory 에 변수를 저장하지 않고 register 에 로드하여 사용한다거나 변수명을 중복해서 사용하지 않는 등 여러가지 특징이 있는데 이는 소스 코드를 최적화 하기 쉬운 언어로 바꾸기 위함이다. 최적화 하는 과정에서는 interference 로 인해 코드의 의도가 바뀌는지 잘 확인해야한다.

Asmgen 은 IR 를 받아서 머신 코드의 AST 로 변환한다. IR 에서는 register 가 무한하다고 가정하는데 이를 현실에 맞게 조정하거나 하드웨어 아키텍쳐에 맞는 instruction 으로 바꿔주는 역할을 수행한다. 이 부분은 컴파일러에서 하드웨어에 의해 결정되는 유일한 부분이다.

정리하면, Parser 가 소스코드를 받아 AST tree 를 만들고 IRgen 은 AST 를 IR 로 바꾸고 최적화 과정을 거친 뒤, Asmgen 으로부터 유한한 물리적 자원으로 돌아가게 할 수 있게끔 코드를 하드웨어에 맞춰준다. 이후 Printer 가 machine code AST 를 1D string array 로 변환하고 이 변환된 machine code 가 하드웨어에서 돌아가게 된다.

IR (Intermediate Representation)

IR 은 소스 코드의 AST 를 최적화에 용이한 형태의 표현된 것을 말한다. IR 에는 여러가지 특징이 있다. 우선, Instruction block, jump, conditional jump 로만 이루어진 CFG (control-flow graph) 의 구조를 갖는다. Register 을 적극 활용함으로써 non-interference 를 보장한다. 또 변수명을 모호하지 않게 바꾸고 명확한 타입을 사용한다. 추가적으로 register 가 코드에서 최대 한 번만 정의되는 것을 보장하는 SSA (Static Single Assignment) 를 사용한다.

CFG (Control-Flow Graph)

AST 를 Instruction block 으로 구성하고 각 block 을 jump 로 연결하는 구조로 IR 을 구성한다. 이렇게 구성하므로써 exceptional control flow 를 같은 형식으로 표현한다. 이렇게 jump 로만 instruction 을 표현하면 나중에 assembly 로 바꾸기도 쉽고 최적화하기도 쉽다.

Register Machine

메모리 접근은 load, store 로 하고 register 로 값을 불러와 register 의 값으로만 계산한다. 이러면 non-interference 하다는 이점이 있다. 메모리 값은 주소의 값을 변경하는 식의 중간 코드로부터 값이 바뀔 수 있지만 register 는 한 번 할당하면 재할당 하지 않는 이상 값이 바뀌지 않는다. 따라서 마음 편하게 최적화 할 수 있다.
No ambiguous variable names

IR 에서는 C 에서 i 를 여기저기에 쓰듯 하나의 변수명에 여러 값을 혼용하여 사용하지 않는다. i 가 여러번 나온다면 각각의 i 에 대해 다른 이름을 붙여준다. 이때 i 가 어디서 어떤 것을 의미하고 있는지 알아내는 것이 필요하는데 그 역할을 해주는 것을 name resolution 이라 한다. name resolution 은 name -> variable 의 맵핑을 저장한다. semantic analysis 나 symbol table lookup 이라고도 불린다.

언어에는 함수와 같은 것이 실행할 때의 환경에 의해 변수 맵핑이 결정되는 dynamic scoping 과 AST 에 의해 변수 맵핑이 결정되는 static scope 이 있다. 현대에는 static scoping 이 모듈화 하기 쉬워서 많이 사용된다.
Explicitly annotated types

IR 에서는 변수의 타입을 명확히 명시한다. 이는 Asmgen 에서 instruction 을 고를 때 필요하기 때문이다. 예를 들어 42 + 0.42 는 우선 42 를 int 32 에서 float 32로 바꾸고 float 32 와 float 32 를 더하는 instruction 을 사용한다. 보통 C 에서 사용하는 scalar, pointer, strut, array 와 같은 것을 사용한다.
SSA (Static Single Assignment)

SSA 는 register 가 한 execution 에서 최대 한 번만 정의되는 것을 말한다. loop 에서와 같이 condition 에 따라 어디로 jump 해야하는지 결정되는 코드는 code block이 여러면 실행됨으로써 재정의된다.

Rust

2020-03-10T00:00:00+00:00

Word level embedding

2020-02-17T00:00:00+00:00

한 줄 요약

유명한 word 단위 임베딩 기법을 설명한다.

NPLM

Word2Vec

FastText

GloVe

Swivel

Fundamental Real Analysis

2020-02-16T00:00:00+00:00

실수

실수란 ordered field 중 least upper bound property 를 가지고 있는 것이다. 이 말에 포함된 단어들을 하나씩 분석하겠다.

ordered set 는 어떠한 집합에서 임의의 두 원소 사이에 <, >, = 중 하나의 관계를 가지고 있는 것이다.

field 란 어떤 집합에서 덧셈과 곱셈이 정의되고 그것들이 몇 가지 특성을 만족하는- associative, commutative 등 10가지- 집합이다.

ordered field 는 어떠한 집합이 ordered set 이고 field 이면서 field 에서 정의된 덧셈과 곱셈, order 의 관계가 다음 두 가지 조건을 만족하는 것이다. $1.\ x + y < x + z if x, y, z \in F\ and\ y < z \\ 2.\ xy > 0 if x \in F, y \in F, x > 0, and\ y > 0$ least upper bound property 는 ordered set 에서 bounded above 한 subset 은 항상 least upper bound (supremum) 를 가지고 있는 특성이다.

이러한 실수의 특성으로 인해 archimedean property, Q 의 density in R, 1-1 correspondence of function $x^{1/n}$ 을 보일 수 있다.

$thm$ archimedean property : $\exist n \ s.t. \ nx > y \ for \ y \in R, \ positive \ x \in R$

$pf$

suppose there is no exist such n and let A be the set of nx where n is positive integer. then nx is upper bounded by y, so there exist sup A by least upper bound property. sup A - x < sup A and sup A - x is not least upper bound, there exist m s.t mx > supA - x by definition of supremum. then (m-1)x > sup A so (m-1)x is upper bound of A but it is in A. => contradict!

$thm$ Q is dense in R, which implies that there exist p $\in$ Q s.t. x<p<y if x, y $\in$ R and x < y

$pf$

there exist n s.t. (y-x)n > 1 by archimedean.

there exist m s.t. m-1 $\le$ nx $\lt$ m by archimedean.

so p is $m \over n$

$thm$ $\exist ! y \ s.t. \ y^n=x \ for \ x\in R, integer \ n $

$pf$

E 를 $t^n < x$ 인 t 들의 집합이라고 정의하자. 이 집합은 실수 집합이고 upper bound 가 있으니까 sup 이 있다. 이 sup이 n 제곱 했을 때 $x$ 보다 크지도 않고 작지도 않음을 보이면 같음을 보일 수 있다. 과정은 너무 테크닉적이라 생략한다.

실수는 존재한다.

$pf$ 다음과 같이 cuts 을 정의하자.

cuts 는 Q 의 proper subset 이다. 이것은 not empty 이다. $p \in cuts$ 에 대해, $p < r\ for\ some\ r \in \alpha \in cuts $ and if q < p for $q \in Q$, $ q\in cuts$ 이다. 그렇다면 cuts 은 least upper bound property 을 가진 ordered field 이고 따라서 R 이다. 더 나아가, cuts 에서의 subfield 을 Q 와 isomorphic 하게 정의할 수 있다. 또, 이러한 R 은 유일하다. 즉, 만약 어떤 집합이 실수의 정의를 따른다면 그 집합은 실수와 isomorphic 하고 Q 와 isomorphic 한 집합을 subfield 로 갖는다.

앞으로 실수에 관하여 논의할 것인데, 그에 앞서 기본적으로 알아야 할 위상 개념을 정리하겠다.

Basit Topology

$def$

Attention is all you need

2020-02-11T00:00:00+00:00

한 줄 요약

Transformer 구조에 대해 알아본다.

Introduction

이전까지 주로 사용해왔던 모델 - RNN, LSTM, GRU- 은 $h_{t}$ 를 계산해야 $h_{t+1}$ 을 계산할 수 있게끔 설계되었다. 이러한 구조는 parallel 하게 계산하기 힘들고 input 에 따라 실행시간이 달라지는 문제 등이 있다. 이 논문에서는 attention 을 극한으로 이용해서 시간에 종속되지 않는 neural network architecture; tranformer 를 제안한다.

Architecture

기본 구조는 다음과 같다. Decoder 의 input 으로 shifted right 되어 들어가는데, 그 이유는 바로 복사되어 output 이 되는걸 막기 위해서다. 이렇게 함으로써 i 번째 위치의 단어는 i-1,..,1 번째 위치만 보고 추론하게 된다. training 할 때, 처음에는 outputs 에 첫 element 만 start of sentence 토큰을 넣고 inference 하고 그 뒤로 나온 단어를 두 번째 element 로 넣고 inference 하는 식으로 outputs 을 구성해 가면서 inference 한다.

class Transformer(nn.Module):
    def __init__(self, params):
        super(Transformer, self).__init__()
        self.encoder = Encoder(params)
        self.decoder = Decoder(params)

    def forward(self, source, target):
        # source = [batch size, source length]
        # target = [batch size, target length]
        encoder_output = self.encoder(source)                            # [batch size, source length, hidden dim]
        output, attn_map = self.decoder(target, source, encoder_output)  # [batch size, target length, output dim]
        return output, attn_map

이제 이 그림에 나오는 것들을 하나씩 살펴보겠다.

Encoder

6 개의 같은 layer 로 구성되어 있고 각 layer 는 2개의 sublayer 로 구성된다. 하나는 multi-head self-attention mechanism 이고 나머지 하나는 potision wise fully connected feed-forward network 이다. 각 sublayer 는 residual block 이 적용된다. 즉, norm(x+sublayer(x)) 의 형태로 sublayer 의 output 이 출력된다. 이를 위해 embedding 을 포함하여 sublayer 의 output dimension 은 512 로 고정한다.

class EncoderLayer(nn.Module):
    def __init__(self, params):
        super(EncoderLayer, self).__init__()
        self.layer_norm = nn.LayerNorm(params.hidden_dim, eps=1e-6)
        self.self_attention = MultiHeadAttention(params)
        self.position_wise_ffn = PositionWiseFeedForward(params)

    def forward(self, source, source_mask):
        # source          = [batch size, source length, hidden dim]
        # source_mask     = [batch size, source length, source length]

        # Original Implementation: LayerNorm(x + SubLayer(x)) -> Updated Implementation: x + SubLayer(LayerNorm(x))
        normalized_source = self.layer_norm(source)
        output = source + self.self_attention(normalized_source, normalized_source, normalized_source, source_mask)[0]

        normalized_output = self.layer_norm(output)
        output = output + self.position_wise_ffn(normalized_output)
        # output = [batch size, source length, hidden dim]

        return output

Decoder

Encoder 와 구조는 비슷한데, encoder 의 output에 대해 multi-head attetion 을 취하는 sub-layer 가 하나 추가된 형태로 layer 을 구성하고 그것이 6개 있다. 하나의 multi-head attention 은 masking 을 취해서 변형시키는데, 이 변형으로 이전 위치의 값이 적용되는걸 막는다. 따라서 i 위치에 대한 예측이 i 이전의 값들을 보고 예측한다는 것을 보장한다.

class DecoderLayer(nn.Module):
    def __init__(self, params):
        super(DecoderLayer, self).__init__()
        self.layer_norm = nn.LayerNorm(params.hidden_dim, eps=1e-6)
        self.self_attention = MultiHeadAttention(params)
        self.encoder_attention = MultiHeadAttention(params)
        self.position_wise_ffn = PositionWiseFeedForward(params)

    def forward(self, target, encoder_output, target_mask, dec_enc_mask):
        # target          = [batch size, target length, hidden dim]
        # encoder_output  = [batch size, source length, hidden dim]
        # target_mask     = [batch size, target length, target length]
        # dec_enc_mask    = [batch size, target length, source length]

        # Original Implementation: LayerNorm(x + SubLayer(x)) -> Updated Implementation: x + SubLayer(LayerNorm(x))
        norm_target = self.layer_norm(target)
        output = target + self.self_attention(norm_target, norm_target, norm_target, target_mask)[0]

        # In Decoder stack, query is the output from below layer and key & value are the output from the Encoder
        norm_output = self.layer_norm(output)
        sub_layer, attn_map = self.encoder_attention(norm_output, encoder_output, encoder_output, dec_enc_mask)
        output = output + sub_layer

        norm_output = self.layer_norm(output)
        output = output + self.position_wise_ffn(norm_output)
        # output = [batch size, target length, hidden dim]

        return output, attn_map

이제부터 구체적인 sub-layer 들과 구조에 대해 서술하겠다.

Embeddings & Softmax

구체적으로 어떤 임베딩을 썼는지는 나와있지 않지만 단순한 word 단위 임베딩을 사용한 것 같다. Language model 에서 weight tying 쓰듯이, input 과 output 에서 쓰이는 embeeding layers 와 pre-softmax linear transformaction matrix 에서 동일한 weight matrix 를 사용한다. 또, embedding layers 에 $\sqrt {d_{model}}$ 를 곱해 scale을 바꿔준다.

=> encoder 와 decoder 의 embedding 을 같은 weight matrix 로 사용하면 사전적 의미는 유사하지만 뉘앙스가 다른 단어의 경우를 잘 처리하지 못할 것 같은데 왜 잘되는걸까?

Positional encoding

transformer 방식에선 위치 정보가 없기 때문에 embedding vector 에 위치 정보를 넣어줘야 한다. 임베딩과 같은 차원의 벡터를 생성하고 임베딩 벡터에 더하는 식으로 인코딩한다. positional encoding vector 을 생성하는 방식은 짝수번째 차원에서는 $sin(pos/10000^{2i \over d_{model}})$ 을 넣어주고 홀수번째 차원에서는 $cos(pos/10000^{2i \over d_{model}})$ 을 넣는다. (Pos 는 sentence 에서 word 위치, i positional encoding vector 의 element 위치)

Attention

Attention 은 기본적으로 query, set of key-value pairs에서 outputs 으로 가는 함수이고 이 함수는 query 와 key 로 만든 compabitility function 이 value 에 weighted sum 하여 만들어진다. Attention 에는 크게 scaled-dot product attention 과 additive attention 이 있다. additive attention 은 compatibility function 을 하나의 feed-forward network 로 만드는 방식이고 scale-dot product attention 은 이 논문에서 사용한 방식이며 앞으로 설명할 것이다. 논문에 따르면 두 방식 모두 이론적으로 비슷한 복잡도를 내지면 실제로는 후자의 방식이 속도와 공간 복잡도 면에서 유리하다고 한다.

- scaled-dot product attention

이 때, attention function은 attention(Q,K,V) = $softmax({QK^T \over {\sqrt d_{k}}})V$ 가 된다. 즉, query 와 value 가 비슷해질 수록 대응되는 value 의 element 값이 커진다. 여기서 $d_{k}$ 는 query 와 key 의 dimension 인데 이것의 역수로 스케일해주는 이유는 dot-product 값이 커져서 gradient 가 작아지는 것을 막기 위함이다. 따라서 좀 더 안정적인 gradient 를 생성할 수 있다.

class SelfAttention(nn.Module):
    def __init__(self, params):
        super(SelfAttention, self).__init__()
        self.hidden_dim = params.hidden_dim
        self.attention_dim = params.hidden_dim // params.n_head

        self.q_w = nn.Linear(self.hidden_dim, self.attention_dim, bias=False)
        self.k_w = nn.Linear(self.hidden_dim, self.attention_dim, bias=False)
        self.v_w = nn.Linear(self.hidden_dim, self.attention_dim, bias=False)
        init_weight(self.q_w)
        init_weight(self.k_w)
        init_weight(self.v_w)

        self.dropout = nn.Dropout(params.dropout)
        self.scale_factor = torch.sqrt(torch.FloatTensor([self.attention_dim])).to(params.device)

    def forward(self, query, key, value, mask=None):
        # query, key, value = [batch size, sentence length, hidden dim]

        # create Q, K, V matrices using identical input sentence to calculate self-attention score
        q = self.q_w(query)
        k = self.k_w(key)
        v = self.v_w(value)
        # q, k, v = [batch size, sentence length, attention dim]

        self_attention = torch.bmm(q, k.permute(0, 2, 1))
        self_attention = self_attention / self.scale_factor
        # self_attention = [batch size, sentence length, sentence length]

        if mask is not None:
            self_attention = self_attention.masked_fill(mask, -np.inf)

        # normalize self attention score by applying soft max function on each row
        attention_score = F.softmax(self_attention, dim=-1)
        norm_attention_score = self.dropout(attention_score)
        # attention_score = [batch size, sentence length, sentence length]

        # compute "weighted" value matrix using self attention score and V matrix
        weighted_v = torch.bmm(norm_attention_score, v)
        # weighted_v = [batch size, sentence length, attention dim]

        return self.dropout(weighted_v), attention_score

- Multi-head attention

Multi-head attention 에서는 Q, K, V 에 $d_{model}$ 차원의 embedding vector 를 사용하지 말고 num_head 만큼 나누고 각각을 linear projection 한 Q, K, V 에 scaled dot product attention 을 적용하는 방식이다. 즉, Embedding vector 를 X 라 할때, $WqX, WkX, Wv*X$ 로 Query(Q), Key(K), Value(V) 를 구한다. 이렇게 했을 때의 이점은, num_head 만큼의 다른 시각을 가진 query, key, value 를 제공하여 정보를 수집할 수 있어 일종의 앙상블 효과를 노린 것으로 추측된다.

이렇게 만들어진 Attention 함수의 값을 concat 하여 $d_{model}$로 linear projection 한 후, 그 값을 output 으로 사용한다. 즉,

$MultiHead(Q, K, V) = Concat(head_{1}, …, head_{8})W^O \ where \ head_{i}=Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$ 이다.

class MultiHeadAttention(nn.Module):
    def __init__(self, params):
        super(MultiHeadAttention, self).__init__()
        assert params.hidden_dim % params.n_head == 0
        self.attentions = nn.ModuleList([SelfAttention(params)
                                         for _ in range(params.n_head)])
        self.o_w = nn.Linear(params.hidden_dim, params.hidden_dim, bias=False)
        init_weight(self.o_w)
        self.dropout = nn.Dropout(params.dropout)

    def forward(self, query, key, value, mask=None):
        # query, key, value = [batch size, sentence length, hidden dim]

        self_attentions = [attention(query, key, value, mask) for attention in self.attentions]
        # self_attentions = [batch size, sentence length, attention dim] * num head
        weighted_vs = [weighted_v[0] for weighted_v in self_attentions]
        attentions = [weighted_v[1] for weighted_v in self_attentions]

        weighted_v = torch.cat(weighted_vs, dim=-1)
        # weighted_v = [batch size, sentence length, hidden dim]

        output = self.dropout(self.o_w(weighted_v))
        # output = [batch size, sentence length, hidden dim]

        return output, attentions

Position-wise Feed-forward Networks

attention 이 끝나고 $max(0, xW_{1} + b_{1})W_{2}+b_{2}$ 인 feed-forward network 를 추가하여 계산한다. 동일한 구조이지만 다른 parameter 로 각각의 layer 에 구성한다. input 과 output dimention 은 512 로 맞춰주고 FFN 내부의 차원은 2048로 했다.

class PositionWiseFeedForward(nn.Module):
    def __init__(self, params):
        super(PositionWiseFeedForward, self).__init__()
        # nn.Conv1d takes input whose size is (N, C): N is a batch size, C denotes a number of channels
        self.conv1 = nn.Conv1d(params.hidden_dim, params.feed_forward_dim, kernel_size=1)
        self.conv2 = nn.Conv1d(params.feed_forward_dim, params.hidden_dim, kernel_size=1)
        init_weight(self.conv1)
        init_weight(self.conv2)
        self.dropout = nn.Dropout(params.dropout)

    def forward(self, x):
        # x = [batch size, sentence length, hidden dim]

        # permute x's indices to apply nn.Conv1d on input 'x'
        x = x.permute(0, 2, 1)                        # x = [batch size, hidden dim, sentence length]
        output = self.dropout(F.relu(self.conv1(x)))  # output = [batch size, feed forward dim, sentence length)
        output = self.conv2(output)                   # output = [batch size, hidden dim, sentence length)

        # permute again to restore output's original indices
        output = output.permute(0, 2, 1)              # output = [batch size, sentence length, hidden dim]
        return self.dropout(output)

Non Autoregressive Neural Machine Translation

2020-02-11T00:00:00+00:00

Distilling the Knowledge in a Neural Network

2020-02-08T00:00:00+00:00

한 줄 요약

model compression 을 위한 데이타로 softmax 를 통과해 나온 확률 분포를 사용하고 speicalist model 을 이용한 새로운 방식의 ensemble technique 소개.

Introduction

곤충이 환경에 따라 유충 상태로 있거나 성충이 되는 것 처럼, 머신러닝에서도 비슷한 과정이 필요하다. 연구와 개발은 다르기 때문이다. 연구에서는 경제성 등을 크게 고려할 필요는 없지만 실제로 개발하기 위해선 모델 크기, 실행 속도 등 고려해야할게 많아진다. 이를 해결하기 위한 좋은 방법이 큰 모델에서 작은 모델로 지식을 이전시키는 것이다.

보통 모델을 학습할 때, 맞는 라벨에 대해 $log$ probability 를 증가하는 식으로 objective 를 정의한다. 이 과정에서 모델은 옳지 않은 라벨에 대한 확률도 나타낸다. 이 확률은 많은 정보를 준다. 가령 BMW 의 이미지가 주어졌을 때, garbage truck 에 대한 확률이 당근보다는 높은 식으로 말이다. 확률 라벨을 soft target 이라고 정의하고 기존의 라벨을 hard target 이라 정의하자. soft label 로 small model 을 학습 시키면, hard label 로 학습시켰을 때 보다 데이타 당 더 많은 정보를 학습할 수 있고 각 데이타가 주는 gradient 의 variance 도 줄어든다. 따라서 더 적은 데이타로 높은 학습률을 달성할 수 있다.

더 나아가서 softmax 함수의 temperature 을 올리는 방법을 제안한다. 이 뜻은, 확률 벡터가 충분히 커질 때 까지 softmax 를 살짝 변형한 $exp(z_{i}/T) \over \Sigma_{j}exp(z_{j}/T)$ 함수를 사용하겠다는 말이다. 이 함수를 이용하여 probability 가 $10^{-8}$ 이 되서 충분히 정보를 포함하지 못하는 라벨이 생기는 것을 막을 수 있다. 이 방법을 Distillation 이라고 부르겠다.

Distillation

\[exp(z_{i}/T) \over \Sigma_{j}exp(z_{j}/T)\]

여기서 T 는 temperature 이라 부르고 보통 softmax 에선 1 을 쓴다. 높은 T 값은 확률 분포를 더 soft 하게 만든다. 기본적인 Distillation 과정은 우선 큰 모델을 학습시키고 학습된 모델로부터 적당히 큰 T 값으로 transfer set 을 구한다. 이 transfer set 을 이용하여 distill 할 작은 모델을 큰 모델와 같은 T 값을 사용하여 학습시킨다. 학습이 끝난 후엔 distilled model 의 T 값을 1로 바꾼다.

Soft target 에 correct label 의 정보가 추가된다면 더 정확한 학습이 가능하다. 이를 위해 object function 으로 soft target 에 대한 cross entropy 와 correct label 에 대한 cross entropy 의 가중 평균을 사용한다. correct label 에 대한 cross entropy에 적은 가중을 두었을 때 결과가 잘 나왔다.

Experiment

MNIST 데이타 셋에서 실험해 본 결과, distilled 된 모델이 그렇지 않은 모델에 비해 훨씬 더 잘 나왔다. unit per layer 가 300 정도로 많으면 T 값은 8 이상이 적당했고, 30 정도 수준이라면 2.5 - 4 에서 잘 동작했다. bias 만 잘 맞춰준다면, 못 본 라벨에 대해서도 맞추었다. digit 3 에 대한 예를 한 번도 보여주지 않은 distilled model 도 3에 대해 높은 정확도를 가졌다. 대신 이 경우엔 bias 를 조정해주어야 한다.

Speech recognition 에서도 실험해 보았더니, Baseline 모델의 정확도가 58.9% 였고 이 모델 10개의 ensemble 이 61.1 % 의 정확도를 보여주었는데 distilled single model 이 60.8 % 의 정확도를 내었다.

Ensemble with specialist models

cumbersome model 과 data 가 크지 않으면 위에서 언급한 방법으로 ensemble 을 해결 할 수 있지만 그렇지 않은 경우엔 병렬화해서 ensemble 시켜도 학습 시간이 길다. 이 문제는 애매한 label 만 분류하는 specialist model 을 global model 과 ensemble 하여 해결할 수 있다. 이렇게 했을 때 단점은 specialist model 의 overfitting 이 크다는 것인데, soft target 을 활용해서 보안할 수 있다. 먼저, general model 을 학습하고 그것의 weight 을 speical model 에 이식, data 의 반은 애매한 label, 나머지는 원래의 data 중 랜덤하게 뽑아서 training set 을 구성한 뒤 학습시킨다. 애매한 label 에 대한 cluster 는 generalist model 의 결과의 covariance matrix 의 column vector 에 K-means algorithm 을 적용해 구한다.

KD 할 때 soft target 을 사용하는 것은 regularization 효과도 크다. 논문의 실험에 따르면, hard target 을 사용한 distilled model 은 soft target 을 사용한 것과 training accuracy 에서 큰 차이를 보이진 않았으나 test accuracy 에서 soft target 이 월등히 좋았다. Soft target 을 사용한 것은 알아서 수렴했기 때문에 early stopping 을 사용할 필요도 없었다.

Model Compression

2020-02-08T00:00:00+00:00

한 줄 요약

Ensemble 로 만든 모델을 이용하여 unlabeled 된 데이타를 labeled 시키고, 이렇게 얻어진 labeled 데이타를 작은 neural net 에 학습시킨다.

Introduction

괜찮은 모델 여러개를 ensemble 시키면 좀 더 괜찮은 모델이 나온다. 하지만 이 모델은 너무 크고, 실행 시간이 길기 때문에 컴퓨팅 파워가 작거나 real-time 이 필요한 일이라면 사용하기 곤란하다. 이를 해결하기 위해 저자가 내세운 방법은, neural network 는 univeral approximators property 를 갖고 있으니까 데이타 양만 충분하다면 어떤 함수든 근사할 수 있을 것이란 가정 하에, 많은 데이타를 만들어서 작은 모델에 학습시키겠다는 것이다. 이 때, labeled 된 데이타는 기존의 좋은 모델들을 ensemble 시켜서 얻는다. 결과적으로 neural network 는 기존의 모델만큼의 성능을 낼 수 있을 것이란 기대를 할 수 있다. 이로써 작은 모델로 ensemble 을 여러 모델에 활용한 큰 모델의 효과를 얻을 수 있는 것이다. unlabeled 된 데이타 조차 얻기 힘든 상황이라면 어떻게 할 것인가? 저자는 그 해결책으로 training set 과 같은 분포를 가지는 data 를 합성하는 MUNGE 라는 방법을 소개한다.

빠르고 compact 하고 expressive 한 모델이 충분한 pseudo data 로부터 학습되면 오버피팅이 발생하지 않을 것이다

라는 가정의 근거로 쓰인 univeral approximator property 는 비판적으로 받아들여야 할 부분이다. 이 property 는 what 을 설명해줄 순 있지만 how 를 알려주진 않는다. 구체적으러 어떤 함수로 만들어야하는지에 대한 문제는 상당히 중요하며, 주어진 일에 따라 잘 맞는 neural network 구조가 있는 것은 모든 neural network 이 그 일을 잘 푸는 model 이 되진 않는다는 것의 반증이 된다.

Method

어떻게 pseudo data 를 만들어 낼 것인지가 중요하다. 합성한 데이타의 분포와 원래 데이타의 분포가 비슷해야 정확한 target function 을 유추할 수 있다. 이를 위해 3 가지 방법; RANDOM, Naive Bayes Estimation, MUNGE 를 사용한다.

RANDOM

각 attribute 을 marginal distribution 에 속한다고 보고 그것이 normal, uniform 등의 distribution 이라 가정하여 데이타 합성하는 방법. 일차원적이고 conditional 한 관계가 사라진다. 데이타도 기존의 데이타 분포에서 broad한 부분만 나오게된다.

NBE (Naive Bayes Estimation)

joint distribution 구하고 이 distribution 에서 sampling 하는 방법. joint distribution 은 mixture model 으로 구할 수도 있지만, attribute 가 conditional independent 하다고 가정하고 간단하게 NBE 로 구할 수도 있다. 이 논문에선 NBE 를 사용했는데 bayesian network 를 같은 데이타에서 학습한 것과 비슷한 효과를 냈다.

MUNGE

이 방법은 full joint distribution 을 구하기 위해 이 논문에서 제안한 알고리즘으로 방식은 간단하다. 데이타에서 하나를 고르고 euclidean distance 로 가장 가까운 벡터를 구한다. 이 두개의 벡터를 $a, b$ 라 하자. 또 $sb = \frac {|a[i] - b[i]|}{s}$ 라 하자.

각 attribute 마다 $p$ 의 확률로 $a[i] = norm(b[i], sb), b[i] = norm(a[i], sb)$ 를 입력한다. 이 과정을 모든 데이타에 해준 뒤 이렇게 만들어진 데이타를 본 데이타에 포함시킨다.

실제 이차원 데이타와 위의 세 가지 데이타 생성 모델을 이용하여 나온 결과물을 시각화한 그래프는 다음과 같다.

Conclusion

논문의 실험 결과에 따르면, 복잡한 데이타에 대해서 MUNGE 는 생각만큼 잘 동작하지 않았고 실제 데이타가 많다면 그것을 쓰는게 좋다. Model compression 은 예상대로 잘 작동했다. performance 는 조금 줄었는데 실행시간과 크기는 1000배 가량 줄었다.

Moonde Moonde

SAM (Sharpness Aware Minimization)

Data Structure

Data Structure

Preliminaries

Big O notation

Sorting Algorithm

selection sort

Bubble sort

Merge sort

Quick sort

Dynamic Programming

Array

Linked List

Stack

Queue

Heap

Hash table, Map, Skip List

Hash map

skip list

Tree

Graph

Shortest Path

Minimum Spanning Tree

Compiler

Compiler

Introduction

IR (Intermediate Representation)

CFG (Control-Flow Graph)

Register Machine

No ambiguous variable names

Explicitly annotated types

SSA (Static Single Assignment)

Rust

Word level embedding

한 줄 요약

NPLM

Word2Vec

FastText

GloVe

Swivel

Fundamental Real Analysis

실수

Basit Topology

Attention is all you need

한 줄 요약

Introduction

Architecture

Encoder

Decoder

Embeddings & Softmax

Positional encoding

Attention

- scaled-dot product attention

- Multi-head attention

Position-wise Feed-forward Networks

Non Autoregressive Neural Machine Translation

Distilling the Knowledge in a Neural Network

한 줄 요약

Introduction

Distillation

Experiment

Ensemble with specialist models

Model Compression

한 줄 요약

Introduction

Method

RANDOM

NBE (Naive Bayes Estimation)

MUNGE

Conclusion