יסודות בינה המלאכותית ורובוטיקה – RB31-02

יסודות בינה המלאכותית ורובוטיקה – RB31-02

 

נתבונן במבוך :

Here's the plotted 5×5 maze for the reinforcement learning environment:

  • Walls (0) are represented in black and block movement.
  • Restart (-1) cells are labeled and restart the game if the agent lands on them.
  • Win (+4) cells are labeled and represent the goal of the game.

 

MDP – Markov decision process

Parts of an MDP

  1. States (S):
    These are the places you can be in the grid. in our grid, each cell (like [0,0] or [0,1]......) is a "state."
  2. Actions (A):
    Actions are what you can do in the grid. For example:

    • Move up
    • Move down
    • Move left
    • Move right
  3. Rewards (R):
    Some cells in the grid give rewards, some give penalties, and others give nothing.

    • In your example:
      • +4 is a big reward (good!).
      • -1 is a penalty (bad!).
      • 0 means no reward or penalty.
  4. Transitions (T):
    This is about the chances of moving to a new state when you take an action.
    For example, if you move right from [2,2], you might end up in [2,3]. But if there's uncertainty, you might accidentally move somewhere else.

 

 5 Policy (π\pi):

The policy can and should change from the beginning to the end, evolving from random decisions to a learned, optimized policy

  • Policy at the start: Random.
  • Policy during learning: Improves as the agent gains experience.
  • Policy at the end: Optimized for the best outcomes.

Initially, the robot doesn’t know anything about the grid, rewards, or penalties

so  If I’m near [3,2], go toward it to get the +4 reward"—is something the robot would only come up with after learning

The plan for what action to take in each state. It tells the robot: "If you're at cell [1,2], go down."

Before Learning: Random Policy

At the beginning, the robot might follow a random policy, meaning:

  • It chooses actions randomly without knowing what’s good or bad.
    For example:
  • At [2,2], it might randomly decide to go up, down, left, or right

 


 

 

 

trajectory

So we can say , that S,A,R are the trajectory (תהכלים)

taow = S0 , A0  , R0 …….

 

episode  : מה התחלה עד הסוף


 

We would like to optimize the rewards (r) in long tern

in chess we would like to win and get to king – else just getting our  opponent  t peeresses  will might get us max reward put our opponent to beat us and losing to optimize it the correct action

Reward vs Return and discount  factor

Scenario

  1. The robot starts at [2,2].
  2. It follows this path: [2,2] → [3,2] → [3,3] → [2,3].

Step-by-Step Rewards

  • Moving from [2,2] to [3,2]: Reward = +4
  • Moving from [3,2] to [3,3]: Reward = 1
  • Moving from [3,3] to [2,3]: Reward = 1

Return for the Path

The return is the total of all rewards, discounted over time. If γ=0.9\gamma = 0.9 (discount factor), the return is calculated as:

Return=R0+γR1+γ2R2\text{Return} = R_0 + \gamma R_1 + \gamma^2 R_2

  • R0=+4R_0 = +4 (first step)
  • R1=+1R_1 = +1 (second step)
  • R2=+1R_2 = +1 (third step)

Return=4+(0.9)(1)+(0.92)(1)\text{Return} = 4 + (0.9)(1) + (0.9^2)(1) Return=4+0.9+0.81=5.71\text{Return} = 4 + 0.9 + 0.81 = 5.71


 

how the agent will  know the best path or the shorter path ?

 

in this case every step R =1 , and R GOAL is 4 , so all path are equal and this is not good .

The more it take to get the reward then it

כתיבת תגובה