יסודות בינה המלאכותית ורובוטיקה – RB31-02
נתבונן במבוך :
1 2 3 4 5 6 |
[ [ 0, 0, 0, 0, 0], [ 0, 1, -1, 1, 0], [ 0, 1, 0, 1, 0], [ 0, 1, +4, 1, 0], [ 0, 0, 0, 0, 0] ] |
Here's the plotted 5×5 maze for the reinforcement learning environment:
- Walls (0) are represented in black and block movement.
- Restart (-1) cells are labeled and restart the game if the agent lands on them.
- Win (+4) cells are labeled and represent the goal of the game.
MDP – Markov decision process
Parts of an MDP
- States (S):
These are the places you can be in the grid. in our grid, each cell (like[0,0]
or[0,1]......
) is a "state." - Actions (A):
Actions are what you can do in the grid. For example:- Move up
- Move down
- Move left
- Move right
- Rewards (R):
Some cells in the grid give rewards, some give penalties, and others give nothing.- In your example:
+4
is a big reward (good!).-1
is a penalty (bad!).0
means no reward or penalty.
- In your example:
- Transitions (T):
This is about the chances of moving to a new state when you take an action.
For example, if you move right from[2,2]
, you might end up in[2,3]
. But if there's uncertainty, you might accidentally move somewhere else.
5 Policy (π\pi):
The policy can and should change from the beginning to the end, evolving from random decisions to a learned, optimized policy
- Policy at the start: Random.
- Policy during learning: Improves as the agent gains experience.
- Policy at the end: Optimized for the best outcomes.
Initially, the robot doesn’t know anything about the grid, rewards, or penalties
so If I’m near [3,2], go toward it to get the +4 reward"—is something the robot would only come up with after learning
The plan for what action to take in each state. It tells the robot: "If you're at cell [1,2]
, go down."
Before Learning: Random Policy
At the beginning, the robot might follow a random policy, meaning:
- It chooses actions randomly without knowing what’s good or bad.
For example: - At
[2,2]
, it might randomly decide to go up, down, left, or right
trajectory
So we can say , that S,A,R are the trajectory (תהכלים)
taow = S0 , A0 , R0 …….
episode : מה התחלה עד הסוף
We would like to optimize the rewards (r) in long tern
in chess we would like to win and get to king – else just getting our opponent t peeresses will might get us max reward put our opponent to beat us and losing to optimize it the correct action
Reward vs Return and discount factor
Scenario
- The robot starts at
[2,2]
. - It follows this path:
[2,2] → [3,2] → [3,3] → [2,3]
.
Step-by-Step Rewards
- Moving from
[2,2]
to[3,2]
: Reward =+4
- Moving from
[3,2]
to[3,3]
: Reward =1
- Moving from
[3,3]
to[2,3]
: Reward =1
Return for the Path
The return is the total of all rewards, discounted over time. If γ=0.9\gamma = 0.9 (discount factor), the return is calculated as:
Return=R0+γR1+γ2R2\text{Return} = R_0 + \gamma R_1 + \gamma^2 R_2
- R0=+4R_0 = +4 (first step)
- R1=+1R_1 = +1 (second step)
- R2=+1R_2 = +1 (third step)
Return=4+(0.9)(1)+(0.92)(1)\text{Return} = 4 + (0.9)(1) + (0.9^2)(1) Return=4+0.9+0.81=5.71\text{Return} = 4 + 0.9 + 0.81 = 5.71
how the agent will know the best path or the shorter path ?
in this case every step R =1 , and R GOAL is 4 , so all path are equal and this is not good .
The more it take to get the reward then it