יסודות בינה המלאכותית ורובוטיקה – RB31-02 – רובוטרוניקס: מכללה ללימוד רובוטיקה ובינה מלאכותית ואלקטרוניקה ,מיקרובקרים , תוכנה ורובוטיקה: התקשר עכשיו 0506399001

יסודות בינה המלאכותית ורובוטיקה – RB31-02

מחבר:admin
פורסם:דצמבר 7, 2024
קטגוריה:רובוטרוניקס כללי
תגובות:אין תגובות

יסודות בינה המלאכותית ורובוטיקה – RB31-02

נתבונן במבוך :

[ [ 0,  0,  0,  0,  0],
 [ 0,  1, -1,  1,  0],
 [ 0,  1,  0,  1,  0],
 [ 0,  1, +4,  1,  0],
 [ 0,  0,  0,  0,  0]
]

[ [ 0, 0, 0, 0, 0],

[ 0, 1, -1, 1, 0],

[ 0, 1, 0, 1, 0],

[ 0, 1, +4, 1, 0],

[ 0, 0, 0, 0, 0]

]

Here's the plotted 5×5 maze for the reinforcement learning environment:

Walls (0) are represented in black and block movement.
Restart (-1) cells are labeled and restart the game if the agent lands on them.
Win (+4) cells are labeled and represent the goal of the game.

MDP – Markov decision process

Parts of an MDP

States (S):
These are the places you can be in the grid. in our grid, each cell (like [0,0] or [0,1]......) is a "state."
Actions (A):
Actions are what you can do in the grid. For example:
- Move up
- Move down
- Move left
- Move right
Rewards (R):
Some cells in the grid give rewards, some give penalties, and others give nothing.
- In your example:
  - +4 is a big reward (good!).
  - -1 is a penalty (bad!).
  - 0 means no reward or penalty.
Transitions (T):
This is about the chances of moving to a new state when you take an action.
For example, if you move right from [2,2], you might end up in [2,3]. But if there's uncertainty, you might accidentally move somewhere else.

5 Policy ( $π\pi$ ):

The policy can and should change from the beginning to the end, evolving from random decisions to a learned, optimized policy

Policy at the start: Random.
Policy during learning: Improves as the agent gains experience.
Policy at the end: Optimized for the best outcomes.

Initially, the robot doesn’t know anything about the grid, rewards, or penalties

so If I’m near [3,2], go toward it to get the +4 reward"—is something the robot would only come up with after learning

The plan for what action to take in each state. It tells the robot: "If you're at cell [1,2], go down."

Before Learning: Random Policy

At the beginning, the robot might follow a random policy, meaning:

It chooses actions randomly without knowing what’s good or bad.
For example:
At [2,2], it might randomly decide to go up, down, left, or right

trajectory

So we can say , that S,A,R are the trajectory (תהכלים)

taow = S0 , A0 , R0 …….

episode : מה התחלה עד הסוף

We would like to optimize the rewards (r) in long tern

in chess we would like to win and get to king – else just getting our opponent t peeresses will might get us max reward put our opponent to beat us and losing to optimize it the correct action