Course Notes

CECS 477 : Neural Networks

Lecture of February 16, 2000

By Jim Ries, JimR@acm.org

 

Reinforcement Learning

Markov Decision Process

t=0,1,2,…

s=(1,2,…}

given state I, choose action u Î U(I)

iè U(I)

iè cost, our task is to minimize total cost

or

iè reward, our task is to maximize total reward

given i,u è j (new state) Pij(u)

where Pij(u) is the probability of going from state i to j given action u

action policy:

m i iè u

Jm (i) - estimate of plan m starting form state i

infinite horizon: Jm (i) = Em [ for t=0 to t=¥ å g tct | s0=i]

ct = cost in time step t

s0 = start state

g t = discount factor (if reward is in future, it matters less)

0 < g £ 1 (typically, g would be decreasing with time)

finite horizon is same, but iterates to M (limit) only, and thus discount factor can be 1 without problem.

Sometimes, we want average cost, e.g.,

Em [for t=0 to t=¥ å g tct | s0=i] / M

(or the lim Mè ¥ of above in infinite case)

We use dynamic programming to find optimal policy. Reinforcement learning is a variant of dynamic programming.

value iteration

policy iteration

offline - traditional dynamic programming

online - reinforcement learning

Bellman equation

Jk+1(i) = min where uÎ U(i) [Cg (u) + g * jÎ s å Pij(u) Jk(i)] = Q(i,u)

above equation without "k's" is Bellman equation

greedy policy

"one-step backup" - uses next state to update estimate of current state.

Policy Iteration

J(i) = cg (u) + g * jÎ s å Pij(u) J(j)

where u is action chose u = m (i)

update policy, go back, …

algorithm:

init current policy

while current policy ¹ next policy

learn value function of policy

update policy (greedy actions based on current learned value function)

endwhile

Process will converge to optimal policy

Claim 1: Only policies that are greedy with regard to their own value function are optimal policies.

QJ*(i,m *(i)) = min uÎ U(i) QJ*(i,u)

m * is optimal policy

J* is value function of m *

Claim 2: A necessary and sufficient condition for value function to be value function of the optimal policy is:

J(i) = min uÎ U(i) [Ci(u) + g * jÎ s å Pij(u) J(j)] (Bellman Equation)

Thus, a solution to Bellman's equation leads to an optimal policy.

Synchronous update: Jk+1(i) = min Q(i,u) (update all states in parallel)

Asynchronous update: update each state iteratively.

Both converge.

Real-time Dynamic Programming (an instance of on-line)

Random exploration

Prob(i,a) = eQ(i,a)/t/å for all a eQ(i,a)/t

t = "temperature"; t starts very high and settles ("cools") down after a while. So, we start random and make it more deterministic over time.

Dynamic Programming with incomplete information

Non-baysian - indirect estimate. Direct estimate is Q-Learning (model-free RL)

Bartol, et. al. 1990

indirect estimate

Pij(u) = h iju / å lÎ s h ilu

where h iju is # of observed transitions from i to j after action u

(maximum likelihood estimate)

Q-Learning

Off-line Q-learning

Adaptive Q-learning

Real-time Q-learning (also just called "Q-learning"; original Q-learning)

off-line Q-learning

Qk+1(i,u) = (1- a k(i,u))Qk(i,u) + a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u')]

where a is the learning rate; rewrite as:

D Qk+1(i,u) = a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u') - Qk(i,u)]

So, since we update many times, and only learn a little (a ) each time, we can try many successors and learn most probable ones.

In real-time, we have no model of world so we just get successor from the real world (prevailing conditions).

a should decrease over time. If each pair updated infinitely often and a decreases over time, algorithm will converge with probability 1.

Q-Learning

Value Iteration

more space

only need value of each state

update value many times

one-step updates

each step fast (good for real-time)

each step slow (due to å )

Real-time and offline can be mixed. "Contemplate" after actions learned.

Reference:

Barto, A.G., Bradtke, S.J., and Singh, S.P. (1995) Learning to act using real-time dynamic programming, Artificial Intelligence 72: 81–138.

(see http://www.umass.edu/neuro/faculty/barto.html for more on Andrew Barto)

Storing Value functions (Q or J)

Exact Method

Lookup table representation, i.e.,

State

Action

Q-value

Approximate methods:

 

Paper presented by Jonathon Marjamaa

"Self-Improving Reactive Agents Based On Reinforcement Learning, Planning, and Teaching"

By Long-Ji Lin, Carnegie Mellon University 1992

http://riesj.hmi.missouri.edu/CECS477/RL/