RL 01
let say we move from 0,0 to 0,1 :
max(Q((1,0), all actions)) =
max(q_table[0][1][0], # ↑
q_table[0][1][1], # ↓
q_table[0][1][2], # ←
q_table[0][1][3]) # →
[0][0] [0][1] [0][2] [0][3] [0][4] [0][5]
[1][0] [1][1] [1][2] [1][3] [1][4] [1][5]
[2][0] [2][1] [2][2] [2][3] [2][4] [2][5]
[3][0] [3][1] [3][2] [3][3] [3][4] [3][5]
[4][0] [4][1] [4][2] [4][3] [4][4] [4][5]
[5][0] [5][1] [5][2] [5][3] [5][4] [5][5]
f the agent tries to go to an invalid position (like (-1, 0)
):
'q_table[0][0][2] # y=0, x=0, action index 2 = ←
-
The move is detected as out of bounds
-
The agent gets a reward = -1
-
The episode ends
-
The agent still updates the Q-value for the action it tried (even though it failed)
-
The Q-value will slowly approach
-1
over time
for example :
q_table[0][0][2] # y=0, x=0, action index 2 = ←
Q-table stores a value — but not for (-1, 0)
.
Instead, it stores the value of trying to go left (←
) from (0, 0)
, which is:
:
after 100 trials
So the agent will choose:
-
Action index = 3
-
Direction = → (right)
-
Move to = (1, 0)