***************************************************************** Alpha Go Movie: https://www.youtube.com/watch?v=WXuK6gekU1Y Trailer: https://www.youtube.com/watch?v=8tq1C8spV_g Hide and seek example https://www.youtube.com/watch?v=kopoLzvh5jY ***************************************************************** Discussion - How should we build AI: try to understand, vs. learn Self-driving cars? Delivery robots? Domestic robots? We could be talking about AI research, or what you do for a living. ***************************************************************** Reinforcement learning Good dog Bad dog ******************************************************************** Pavlov Behaviorism vs. cognitive science ******************************************************************* Bandit problem Learn a policy - which game to play, or what to eat. ******************************************************************* Bandit problem with states/context, but no dependence on past history Learn policy - a mapping from state to action First learn a model of the reward function: Q(s,a) = predicted reward Policy comes from max_a Q(s,a) ******************************************************************* Bandit problem with states and time (dependencies on past history) Credit assignment What is a state? Greedy policy - max_a L(s,a) ******************************************************************* Dynamic programming: consider all paths - not necessary work backward in time Learn value function V(s) - encodes value of possible future V(s) = max_a (L(s,a) + V(s_next)) = max_a (L(s,a) + V(F(s,a))) Need to know dynamics s_next = F(s,a) One can learn a model of the dynamics using function approximation Will this converge? RL becomes optimal control ******************************************************************* Avoiding learning a model Learn Q function. Q(s,a) see a transition s -> s_next (so we don't need a model) Q(s,a) = (1-alpha) Q(s,a) + alpha(L(x,u) + discount*max_a(Q(s_next,a))) Need exploration policy trade off exploitation (do what you think is best) vs. exploration (try new things). Inefficient - takes lots of simulated or real ******************************************************************* Optimize parameterized policy(s,p) where p is a set of parameters. Can use gradient descent dcost(p)/dp, where cost(p) is the cost of using the policy from a number of initial states for some period of time. ******************************************************************* ******************************************************************** ********************************************************************* ********************************************************************* ***************************************************************** Networks with internal state Previous networks were transformations, mappings, or functions with no internal state. LSTMs recurrent networks *****************************************************************