Description

Recurrent Neural Network


[Vanilla RNN for parity function: 4 points]

Let us define a sequence parity function as a function that takes in a sequence of binary inputs and returns a sequence indicating the number of 1’s in the input so far; specifically, if at time t the 1’s in the input so far is even it returns 1, and 0 if it is odd. For example, given input sequence [0; 1; 0; 1; 1; 0], the parity sequence is [1; 0; 0; 1; 0; 0]. Let x_{t} denote the input at time t and y_{t} be the boolean output of this parity function at time t.
Design a vanilla recurrent neural network to implement the parity function. Your implementation should include the equations of your hidden units and the details about activations with diﬀerent input at x_{t} (0 or 1).
Hint: Recall that we have implemented AND and OR functions in problem set 2, which may be useful here.


[LSTM for parity function: 4 points]

Let us recall diﬀerent gates in an LSTM Network. First gate is the “forget gate layer”:







f_{t} = (W_{f} :[h_{t} _{1}; x_{t}] + b_{f} )
(1)






where f_{t} is the output of forget gate, W_{f} is the weight matrix, h_{t} _{1} is the hidden state of step t1, x_{t} is the current input and b_{t} is the bias value.
Next we have “input gate layer”:

i_{t} = (W_{i}:[h_{t}
_{1}; x_{t}] + b_{i})
(2)
~
_{1}; x_{t}] + b_{C} )
(3)
C_{t} = tanh(W_{C} :[h_{t}
where i_{t} decides which values we will update and
~
C_{t} are the new candidate values that could
be added to the cell state. Next we have new cell state candidate values:
C_{t} = f_{t} C_{t}
~
(4)
_{1} + i_{t} C_{t}
Finally, we have the output and hidden state
o_{t} = (W_{o}:[h_{t}
_{1}; x_{t}] + b_{o})
(5)
h_{t} = o_{t} tanh(C_{t})
(6)
Design an LSTM Network for the bit parity problem mentioned in Question 1. Specifically, given W_{f} = [0; 1], provide values for b _{f} , W_{i}, b_{i} , W_{C} , b_{C} , W_{o} and b_{o} such that the cell state C_{t} store the parity of bit string. Please mention any assumptions you make. For this problem, you can assume below for Sigmoid and tanh function:







1;
if
x > 0
(x) = ^{(}_{0;}
otherwise
(7)
tanh(x) =
8
1;
if x > 0
(8)
0;
x = 0
>
<
1;
if
x < 0
>
:






Hint: Recall that XOR of x and y can be represented as (x ^ y) _ (x ^ y). Think about how you can leverage this information for equation (4).

[When to stop in beam search?: 4 points]
Beam Search is a technique widely used to improve the output text quality of neural text generation. But it is diﬃcult to decide when to stop beam search to obtain optimality because hypotheses can finish in diﬀerent steps. Assume an input x from which we generate an output sentence y which is a completed hypothesis:






y = argmax
Y
(9)
p(y_{i}jx; y_{<i})





y:comp(y) _{i jyj}
where y_{<i} is a shorthand notation of y_{0}y_{1}:::y_{i} _{1} and comp(y) is shorthand for completed hypothesis. We say that a hypothesis y is completed (comp(y)), if its last word is </s>, i.e.,
def
comp(y) = (y_{jyj} = </s>)
in which case it will not be further expanded.
Given B_{i} _{1} is an b length ordered list, consisting of pairs hy; si of the current prefix y and the accumulated score (product of probabilities) s , we use beam search to approximately find the best output y . Beam search can be defined as an operator topb that selects the top b scoring items:





b
(10)
B_{i} =topfhy^{0} y_{i}; s p(y_{i}jx; y)i j hy^{0}; si 2 B_{i} _{1}g




Let us define a beam search algorithm that will always return the optimal score complete hypothesis (modulo beam size) and finish as soon as optimality is reached. Let best _{i} be the best completed hypothesis so far up to step i, i.e.






n
o
(11)
best _{i}=max y 2 [_{j i}B_{j}jcomp(y)





We update it every time we find a completed hypothesis (if there is none yet, then it remains undefined). Given that at any step i, if best _{i} is defined and the highest scoring item in the current beam B_{i} scores worse than or equal to best _{i}, prove that the current best completed hypothesis (obtained from best _{i}) is the overall highestprobability completed hypothesis and future steps will be no better than the best completed hypothesis.

[Exploding Gradients: 4 points; Extra credit for 4803]
Learning longterm dependencies in recurrent networks suﬀers from a particular numerical challenge – gradients propagated over many timesteps tend to either ‘vanish’ (i.e. converge to 0, frequently) or ‘explode’ (ı.e. diverge to infinity; rarely, but with more damage to the optimization). To study this problem in a simple setting, consider the following recurrence relation without any nonlinear activation function or input x:








h_{t} = W ^{>}h_{t} _{1}
(12)







where W is a weight sharing matrix for recurrent relation at any time t. Let _{1}; :::; _{n} be the eigenvalues of the weight matrix W 2 Cn n. Its spectral radius (W ) is defined as:







(W ) = maxfj _{1}j; :::; j _{n}jg
(13)






Assuming the initial hidden state is h_{0}, write the relation between h_{T} and h_{0} and explain the role of the eigenvalues of W in determining the ‘vanishing’ or ‘exploding’ property as T 0

[Paper Review: 4 points]
Explainability is an increasingly important area of deep learning research today. The following paper, from CVPR 2018, introduces a unique multimodal approach to explainability by joint textual rationale generation and attention visualization in the task of visual question answering (VQA). This is a major improvement over preexisting unimodal explainable models, with the results showing complementary explanatory strengths from the visual and textual explanations. The paper can be accessed here.
As in the previous assignments, please limit your reviews to 350 words.
Briefly summarize the key findings, strengths and potential limitations of this work.

Coding: Sequence models for image captioning [55 regular points for CS7643, 45 regular points for CS4803]
The coding part of this assignment will consist of implementation of sequence models for captioning images. To get started, go to https://www.cc.gatech.edu/classes/AY2020/cs7643_spring/
4