RL 01

 

 

 

RL 01

let say we move from 0,0 to 0,1 :

max(Q((1,0), all actions)) =

max(q_table[0][1][0], # ↑
q_table[0][1][1], # ↓
q_table[0][1][2], # ←
q_table[0][1][3]) # →

[0][0] [0][1] [0][2] [0][3] [0][4] [0][5]
[1][0] [1][1] [1][2] [1][3] [1][4] [1][5]
[2][0] [2][1] [2][2] [2][3] [2][4] [2][5]
[3][0] [3][1] [3][2] [3][3] [3][4] [3][5]
[4][0] [4][1] [4][2] [4][3] [4][4] [4][5]
[5][0] [5][1] [5][2] [5][3] [5][4] [5][5]


f the agent tries to go to an invalid position (like (-1, 0)):

'q_table[0][0][2] # y=0, x=0, action index 2 = ←

  • The move is detected as out of bounds

  • The agent gets a reward = -1

  • The episode ends

  • The agent still updates the Q-value for the action it tried (even though it failed)

  • The Q-value will slowly approach -1 over time

for example :

q_table[0][0][2] # y=0, x=0, action index 2 = ←

 Q-table  stores a value — but not for (-1, 0).

Instead, it stores the value of trying to go left () from (0, 0), which is:

 :

after 100 trials

So the agent will choose:

  • Action index = 3

  • Direction = → (right)

  • Move to = (1, 0)

 

 

 

 

כתיבת תגובה