*****************************************************************

Alpha Go
Movie: https://www.youtube.com/watch?v=WXuK6gekU1Y
Trailer: https://www.youtube.com/watch?v=8tq1C8spV_g

Hide and seek example
https://www.youtube.com/watch?v=kopoLzvh5jY

*****************************************************************

Discussion - How should we build AI: try to understand, vs. learn
Self-driving cars?
Delivery robots?
Domestic robots?
We could be talking about AI research, or what you do for a living.

*****************************************************************
Reinforcement learning

Good dog

Bad dog

********************************************************************

Pavlov

Behaviorism vs. cognitive science

*******************************************************************

Bandit problem
Learn a policy - which game to play, or what to eat.

*******************************************************************

Bandit problem with states/context, but no dependence on past history
Learn policy - a mapping from state to action
First learn a model of the reward function:
Q(s,a) = predicted reward
Policy comes from max_a Q(s,a)

*******************************************************************

Bandit problem with states and time (dependencies on past history)
Credit assignment
What is a state?

Greedy policy - max_a L(s,a)

*******************************************************************

Dynamic programming:
consider all paths - not necessary
work backward in time
Learn value function V(s) - encodes value of possible future
V(s) = max_a (L(s,a) + V(s_next)) = max_a (L(s,a) + V(F(s,a)))
Need to know dynamics s_next = F(s,a)
One can learn a model of the dynamics using function approximation
Will this converge?

RL becomes optimal control

*******************************************************************

Avoiding learning a model
Learn Q function. Q(s,a)
see a transition s -> s_next (so we don't need a model)
Q(s,a) = (1-alpha) Q(s,a) + alpha(L(x,u) + discount*max_a(Q(s_next,a)))
Need exploration policy
trade off exploitation (do what you think is best) vs.
   exploration (try new things).
Inefficient - takes lots of simulated or real 

*******************************************************************

Optimize parameterized policy(s,p) where p is a set of parameters.
Can use gradient descent dcost(p)/dp, where cost(p) is the cost
of using the policy from a number of initial states for some period
of time.

*******************************************************************

********************************************************************
*********************************************************************
*********************************************************************
*****************************************************************

Networks with internal state

Previous networks were transformations, mappings, or functions with no
internal state.

LSTMs
recurrent networks

*****************************************************************