קורס יסודות בינה מלאכותית RB26-23 : שימוש ב reinforcement learning – RL

העשרה

תרגיל כיתה

צור תמונה בעזרת aistudio צור תמונה ועשה עליה מנופלציה להנפשה ב https://app.pixverse.ai/

חלק א : יצירת ספר מבינה מלאכותית מבינה מלאכותית

קישור ל : StoryBook Gemini

תרגיל כיתה : חלק א 1

צור סיפור לפי קהל ידע לבחירתך ,

כתוב מספר שורות מתארות את רקע הסיפור ,
תקופה ,
איזור בעולם
השחקנים ושמותיהם

2 כתוב סיפור ללימוד לקהל יעד מבוגרים עבור נושא טכני כמו איך ביהנ מלאוכתית לומדת , מה זה גיר חילזוני , הצפנה ומפתחות

הדגש איזה נושאים טכנים אתה רוצה שיכלול הסיפור
תארו או בקשו מספר דוגמאות שונות
תאר את הקע והאיורים שאתם מבקשים לכולל

הדוגמאות בעזרת הבינה מלאכותית A.I STUDIO GOOLE

1. שיתוף מסך מהטלפון ועזרה בזמן אמת ביצרית נוסחות בזמן אמת GEMINI SHARE

2 .יצירת דף נחיתה – עדכון דף נחיתה , ושידרוג דף נחיתה הוספת שדות

תרגיל כיתה 2

1.בנה דף נחיתה \ טופס

3. יצירת משחק מחשב – PANG

העשרה

נתחיל ממשהו כיפי של יצירת סיפור של בינה מלאוכתית

הסבר על התכונה

(Reinforcement Learning) בקוד – צעד־אחר־צעד, בשפה ברורה וללא צורך בידע מוקדם. אני לא מסביר פה את החלקים של הציור ב-pygame, אלא רק את מה שקשור ללמידת המכונה.

1) מרחב פעולות (Action Space – מה הסוכן יכול לעשות)

תפקיד: מגדיר את כל הפעולות האפשריות שהסוכן יכול לבצע בכל מצב.
מושג: ב-RL פעולה (Action) היא בחירה שהסוכן עושה מתוך מצב נתון (State).

2) פונקציית התגמול (Reward Function – מה טוב ומה רע)

תפקיד: מגדיר את החוקים לקבלת נקודות חיוביות או שליליות.
מושגים:
- תגמול חיובי (+100) כשהסוכן מגיע ליעד (תא הזהב).
- תגמול שלילי (-2) כשהוא דורך על תא אדום (סכנה).
- קנס קטן לכל צעד (-0.01) כדי לעודד אותו להגיע ליעד בדרך הקצרה ביותר.

2.1 ב-Reinforcement Learning הסוכן צריך לבחור בכל צעד פעולה.
הוא יכול לבחור בין שתי גישות:

Exploration – חקירה
לנסות פעולות חדשות כדי ללמוד דברים שלא ניסה עדיין, גם אם הן לא נראות כרגע הכי טובות.
המטרה: למצוא אסטרטגיות חדשות שייתכן ויתנו תגמול גבוה יותר בעתיד.
Exploitation – ניצול
לבחור את הפעולה שנראית כרגע הכי טובה לפי המידע הקיים (ה-Q-values או המודל שכבר למד).
המטרה: למקסם תגמול בטווח המיידי, על סמך מה שכבר ידוע.

למה צריך את שניהם?

אם נלך רק על Exploitation – נשתמש כל הזמן באותה פעולה שהכי טובה עכשיו, אבל אולי נפספס פעולה טובה יותר שלא גילינו.
אם נלך רק על Exploration – נבזבז הרבה זמן על ניסיונות לא יעילים ולא נצבור מספיק רווח.

האיזון הנכון מאפשר ללמוד מהניסיון, אבל גם לגלות הזדמנויות חדשות.

דוגמה פשוטה

דמיין שאתה בודק מסעדות בעיר חדשה:

Exploitation: אתה חוזר שוב ושוב לאותה מסעדה שטעימה לך.
Exploration: אתה מנסה מסעדה אחרת, שאולי אפילו יותר טובה (או אולי פחות טובה…).

אם תמיד תחזור לאותה מסעדה – אולי פספסת את הכי טובה בעיר.
אם תמיד תנסה חדשה – לעולם לא תיהנה מהטובה שמצאת.

איך זה מיושם בקוד שלך (ε-greedy policy)

random.random() < epsilon → חקירה (Exploration) – בחירת פעולה אקראית.
אחרת → ניצול (Exploitation) – בחירת הפעולה עם ערך ה-Q הגבוה ביותר למצב הנוכחי.
epsilon קובע את האיזון:
- ערך גבוה → הרבה חקירה
- ערך נמוך → הרבה ניצול

כלל אצבע באימון

בדרך כלל מתחילים עם ε גבוה (הרבה חקירה בהתחלה)
ומקטינים אותו בהדרגה ככל שהסוכן לומד (epsilon decay) – כך בהתחלה מגלים את העולם, ובסוף ממקסמים רווח.

אם תרצה, אני יכול לצייר לך דיאגרמה שממחישה גרפית איך הסוכן נע בין Exploration ל-Exploitation בזמן האימון.
רוצה שאכין את זה?

3) טבלת Q (המודל בשיטת Tabular Q-learning)

תפקיד: מאתחל את טבלת ה-Q עם ערכים אפס.
מושג:
- Tabular Q-learning – המודל הוא טבלה בגודל (מספר מצבים × מספר פעולות).
- מצב (State) כאן הוא מיקום בלוח (שורה, עמודה).
- עבור כל מצב וכל פעולה יש ערך Q שמייצג את הציפייה לתגמול המצטבר אם נפעל כך.

4) פרמטרים ללמידה (Hyperparameters)

קצב למידה (α): כמה מהר לעדכן את הערכים בטבלה.
מקדם הנחה (γ): עד כמה מתחשבים בתגמולים עתידיים (1 = עתיד חשוב מאוד).
אחוז חקירה (ε): באיזה הסתברות לבחור פעולה אקראית במקום הכי טובה ידועה.
פרק (Episode): ניסיון אחד מההתחלה ועד שהמשחק נגמר.

5) מעבר בין מצבים (State Transition)

תפקיד: מחשב את המצב הבא בהתאם לפעולה.
מושג: כאן אין למידת מודל של הסביבה – זה Model-Free RL (לומד ישירות מהניסיון).

6) פונקציית התגמול

תפקיד: מחזירה את התגמול, האם המשחק נגמר, האם זה ניצחון או הפסד.
מושג: זו הפונקציה שמספקת לסוכן את אות הלמידה (Reward Signal).

7) מדיניות ε-greedy (חקירה לעומת ניצול)

תפקיד: בוחר פעולה במצב הנוכחי (r, c).
מושג:
- הסתברות ε → פעולה אקראית (חקירה – Exploration).
- אחרת → פעולה עם ערך ה-Q הגבוה ביותר (ניצול – Exploitation).

8) עדכון Q-learning (לב הלמידה)

תפקיד: מעדכן את ערך ה-Q עבור זוג מצב–פעולה (r, c, a) לפי משוואת בלמן (Bellman Equation).
מושגים:
- מטרה (Target): reward + γ * max(Q(next_state, all_actions))
- שגיאת TD: ההפרש בין המטרה לערך הנוכחי.
- עדכון: Q ← Q + α * (שגיאת TD)
הקוד שלך מבצע עדכון פעמיים בכל צעד – זה לא סטנדרטי אבל משאיר את ההתנהגות המקורית.

9) לולאת פרק (Episode Loop)

תפקיד: מריץ הרבה פרקים כדי שהסוכן ילמד מהניסיון.
מושג: ככל שיש יותר אינטראקציות, הערכים בטבלה מתכנסים ומובילים למדיניות טובה יותר.

10) מחזורי אימון והתחלה מחדש

תפקיד: אחרי כל 100 פרקים – ממתין 10 שניות, מאתחל את טבלת ה-Q ומגריל מחדש את התאים האדומים.
מושג: מאפשר צפייה בלמידה מחדש כל פעם בתנאי סביבה שונים.

סיכום

אלגוריתם: Tabular Q-learning – למידת חיזוק ללא מודל (Model-Free).
מודל: טבלת Q – מערך דו־ממדי של ערכי מצב–פעולה.
מדיניות: ε-greedy – שילוב של חקירה וניצול.
עדכון: משוואת בלמן לעדכון ערכי ה-Q.
מטרה: ללמוד להגיע ליעד מהר תוך הימנעות ממוקשים, על ידי מקסום תגמול מצטבר.

התוכנית מדמה סוכן Q-learning שמנווט ברשת 5×5, לומד להימנע מעיגולים אדומים ולהגיע לעיגול זהב, תוך הצגת ה־Q-table ומשחק בזמן אמת.

pip install pygame numpy pyinstaller

1	pip install pygame numpy pyinstaller

import pygame
import sys
import random
import numpy as np
import time
import os

os.environ["SDL_VIDEO_WINDOW_POS"] = "120,80"

# Grid & Constants
ROWS, COLS = 5, 5
ACTIONS = ['up', 'down', 'left', 'right']
GAME_SQUARE_SIZE = 40
QTABLE_SQUARE_SIZE = int(70 * 1.24)
GOLD_POS = (4, 4)
START_POS = (0, 0)

# Rewards
REWARD_GOAL = 100
PENALTY_RED = -2
STEP_PENALTY = -0.01

def generate_red_positions(n=3):
    all_pos = [(r, c) for r in range(ROWS) for c in range(COLS)
               if (r, c) != START_POS and (r, c) != GOLD_POS]
    return random.sample(all_pos, n)

RED_POSITIONS = generate_red_positions()

# Q-learning
Q = np.zeros((ROWS, COLS, len(ACTIONS)))
alpha = 0.1
gamma = 0.9
epsilon = 0.1
episodes = 100  # per cycle

pygame.init()
clock = pygame.time.Clock()

# Fixed window size
SCREEN_WIDTH, SCREEN_HEIGHT = 1400, 900

# Offsets
Q_OFFSET_X = 30
GRID_OFFSET_Y_BASE = 100
GRID_OFFSET_Y_Q = GRID_OFFSET_Y_BASE - 100     # Q-table up 100px
GRID_OFFSET_Y_GAME = GRID_OFFSET_Y_BASE        # Game grid unchanged vertically
GAME_OFFSET_X = SCREEN_WIDTH - (COLS * GAME_SQUARE_SIZE) - 30 - 100  # Game grid moved 100px left

# Graph (moved 140px down)
GRAPH_HEIGHT = 230
GRAPH_MARGIN = 50
GRAPH_LEFT = Q_OFFSET_X
GRAPH_TOP = GRID_OFFSET_Y_GAME + ROWS * GAME_SQUARE_SIZE + 20 + 40 + 140
GRAPH_WIDTH = SCREEN_WIDTH - 2 * Q_OFFSET_X

screen = pygame.display.set_mode((SCREEN_WIDTH, SCREEN_HEIGHT), pygame.RESIZABLE)
pygame.display.set_caption("Q-learning RL | Left: Q-table | Right: Game Grid")

# Fonts
font_q = pygame.font.SysFont("Courier New", int(18 * 0.8))
font_game = pygame.font.SysFont("Courier New", 20)
font_footer = pygame.font.SysFont("Courier New", 18)
font_axis = pygame.font.SysFont("Courier New", 16)

# Colors
WHITE = (255, 255, 255)
GRAY = (180, 180, 180)
GRAY_DARK = (120, 120, 120)
RED = (255, 0, 0)
BLUE = (0, 100, 255)
GOLD = (255, 215, 0)
BLACK = (0, 0, 0)

LAST_EP_TEXT = ""
steps_history = []
status_history = []  # 'win' | 'lose' | 'finished'
paused = False

def draw_legend(x, y):
    pad = 8
    line_h = 22
    box_w, box_h = 190, line_h * 2 + pad * 2
    pygame.draw.rect(screen, (250, 250, 250), (x, y, box_w, box_h))
    pygame.draw.rect(screen, GRAY, (x, y, box_w, box_h), 1)
    pygame.draw.circle(screen, BLUE, (x + 16, y + pad + 10), 10)
    screen.blit(font_axis.render("Blue = Win", True, BLACK), (x + 36, y + pad))
    pygame.draw.circle(screen, RED, (x + 16, y + pad + 10 + line_h), 6)
    screen.blit(font_axis.render("Red = Game Over", True, BLACK), (x + 36, y + pad + line_h - 2))

def draw_graph(history, statuses):
    plot_x = GRAPH_LEFT + GRAPH_MARGIN
    plot_y = GRAPH_TOP + GRAPH_MARGIN
    plot_w = max(100, GRAPH_WIDTH - 2 * GRAPH_MARGIN)
    plot_h = max(80, GRAPH_HEIGHT - 2 * GRAPH_MARGIN)

    pygame.draw.rect(screen, (245, 245, 245), (GRAPH_LEFT, GRAPH_TOP, GRAPH_WIDTH, GRAPH_HEIGHT))
    pygame.draw.rect(screen, GRAY, (GRAPH_LEFT, GRAPH_TOP, GRAPH_WIDTH, GRAPH_HEIGHT), 1)

    pygame.draw.line(screen, BLACK, (plot_x, plot_y + plot_h), (plot_x + plot_w, plot_y + plot_h), 2)
    pygame.draw.line(screen, BLACK, (plot_x, plot_y), (plot_x, plot_y + plot_h), 2)

    screen.blit(font_axis.render("Episode (X)", True, BLACK), (plot_x + plot_w - 140, plot_y + plot_h + 8))
    screen.blit(font_axis.render("Steps (Y)", True, BLACK), (plot_x - 5, plot_y - 26))

    # Legend on the RIGHT
    draw_legend(GRAPH_LEFT + GRAPH_WIDTH - 190 - 8, GRAPH_TOP + 8)

    if not history:
        return

    y_max = max(1, max(history))
    screen.blit(font_axis.render("1", True, BLACK), (plot_x - 6, plot_y + plot_h + 8))
    screen.blit(font_axis.render(str(episodes), True, BLACK), (plot_x + plot_w - 20, plot_y + plot_h + 8))
    screen.blit(font_axis.render("0", True, BLACK), (plot_x - 25, plot_y + plot_h - 10))
    screen.blit(font_axis.render(str(y_max), True, BLACK), (plot_x - 40, plot_y - 10))

    pts = []
    for i, steps in enumerate(history, start=1):
        t = (i - 1) / max(1, episodes - 1)
        x = plot_x + t * plot_w
        y = plot_y + (1 - (steps / y_max)) * plot_h
        pts.append((int(x), int(y)))

    if len(pts) >= 2:
        pygame.draw.lines(screen, GRAY_DARK, False, pts, 2)

    for idx, p in enumerate(pts):
        status = statuses[idx] if idx < len(statuses) else 'finished'
        if status == 'win':
            pygame.draw.circle(screen, BLUE, p, 10)
        elif status == 'lose':
            pygame.draw.circle(screen, RED, p, 6)
        else:
            pygame.draw.circle(screen, GRAY_DARK, p, 3)

def draw_grid(agent_pos):
    screen.fill(WHITE)

    # Q-table
    for row in range(ROWS):
        for col in range(COLS):
            x = Q_OFFSET_X + col * QTABLE_SQUARE_SIZE
            y = GRID_OFFSET_Y_Q + row * QTABLE_SQUARE_SIZE
            pygame.draw.rect(screen, GRAY, (x, y, QTABLE_SQUARE_SIZE, QTABLE_SQUARE_SIZE), 1)
            up, down, left, right = Q[row, col, 0], Q[row, col, 1], Q[row, col, 2], Q[row, col, 3]

            def val_color(v):
                if v > 0.1:
                    return BLUE
                elif v < -0.1:
                    return RED
                else:
                    return GRAY_DARK

            screen.blit(font_q.render(f"{up:.1f}",    True, val_color(up)),    (x + 18, y + 2))
            screen.blit(font_q.render(f"{down:.1f}",  True, val_color(down)),  (x + 18, y + QTABLE_SQUARE_SIZE - 20))
            screen.blit(font_q.render(f"{left:.1f}",  True, val_color(left)),  (x + 2,  y + QTABLE_SQUARE_SIZE // 2 - 8))
            screen.blit(font_q.render(f"{right:.1f}", True, val_color(right)), (x + QTABLE_SQUARE_SIZE - 38, y + QTABLE_SQUARE_SIZE // 2 - 8))

    # Game grid
    for row in range(ROWS):
        for col in range(COLS):
            x = GAME_OFFSET_X + col * GAME_SQUARE_SIZE
            y = GRID_OFFSET_Y_GAME + row * GAME_SQUARE_SIZE
            pygame.draw.rect(screen, GRAY, (x, y, GAME_SQUARE_SIZE, GAME_SQUARE_SIZE), 1)
            screen.blit(font_game.render(f"{row},{col}", True, GRAY), (x + 2, y + 2))
            center = (x + GAME_SQUARE_SIZE // 2, y + GAME_SQUARE_SIZE // 2)
            radius = GAME_SQUARE_SIZE // 3
            if (row, col) == GOLD_POS:
                pygame.draw.circle(screen, GOLD, center, radius)
            elif (row, col) in RED_POSITIONS:
                pygame.draw.circle(screen, RED, center, radius)

    # Agent
    ax = GAME_OFFSET_X + agent_pos[1] * GAME_SQUARE_SIZE + GAME_SQUARE_SIZE // 2
    ay = GRID_OFFSET_Y_GAME + agent_pos[0] * GAME_SQUARE_SIZE + GAME_SQUARE_SIZE // 2
    pygame.draw.circle(screen, BLUE, (ax, ay), GAME_SQUARE_SIZE // 3)

    # Footer
    if LAST_EP_TEXT:
        footer_y = GRID_OFFSET_Y_GAME + ROWS * GAME_SQUARE_SIZE + 20
        screen.blit(font_footer.render(LAST_EP_TEXT, True, BLACK), (GAME_OFFSET_X - 250, footer_y))

    draw_graph(steps_history, status_history)
    pygame.display.flip()

def move(pos, action):
    r, c = pos
    if action == 'up' and r > 0: r -= 1
    elif action == 'down' and r < ROWS - 1: r += 1
    elif action == 'left' and c > 0: c -= 1
    elif action == 'right' and c < COLS - 1: c += 1
    return (r, c)

def get_reward(pos):
    if pos in RED_POSITIONS:
        return PENALTY_RED, True, False, True
    if pos == GOLD_POS:
        return REWARD_GOAL, True, True, False
    return STEP_PENALTY, False, False, False

def handle_events_toggle_pause():
    global paused
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            pygame.quit(); sys.exit()
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_q:
                pygame.quit(); sys.exit()
            if event.key == pygame.K_SPACE:
                paused = not paused

def wait_with_events(seconds, message="Restarting fresh in"):
    end_time = time.time() + seconds
    while time.time() < end_time:
        handle_events_toggle_pause()
        # Show countdown
        remaining = int(end_time - time.time()) + 1
        info = f"{message} {remaining}s ..."
        y = GRID_OFFSET_Y_GAME + ROWS * GAME_SQUARE_SIZE + 50
        draw_grid(START_POS)
        screen.blit(font_footer.render(info, True, BLACK), (GAME_OFFSET_X - 250, y))
        pygame.display.flip()
        clock.tick(30)

# Initial draw
draw_grid(START_POS)

cycle = 0
while True:  # infinite cycles until quit
    cycle += 1
    # Fresh start each cycle
    Q[:] = 0
    RED_POSITIONS = generate_red_positions()
    steps_history.clear()
    status_history.clear()
    print(f"=== Starting cycle {cycle} ===")

    for ep in range(episodes):  # 100 episodes
        pos = START_POS
        done = win = game_over = False
        steps = 0

        while not done:
            handle_events_toggle_pause()
            if paused:
                draw_grid(pos)
                clock.tick(60)
                continue

            r, c = pos
            a = random.randint(0, 3) if random.random() < epsilon else int(np.argmax(Q[r, c]))
            next_pos = move(pos, ACTIONS[a])
            reward, done, win, game_over = get_reward(next_pos)

            next_max = np.max(Q[next_pos[0], next_pos[1]]) if not done else 0
            Q[r, c, a] += alpha * (reward + gamma * next_max - Q[r, c, a])
            Q[r, c, a] += alpha * (reward + gamma * next_max - Q[r, c, a])  # kept as in your code

            pos = next_pos
            steps += 1

            draw_grid(pos)
            clock.tick(60)

        status_txt = "Win" if win else "Game Over" if game_over else "Finished"
        LAST_EP_TEXT = f"Cycle {cycle} | Episode {ep+1} | Total moves: {steps} | {status_txt}"
        steps_history.append(steps)
        status_history.append('win' if win else 'lose' if game_over else 'finished')

        print(LAST_EP_TEXT)
        draw_grid(pos)
        time.sleep(0.2)

    print(f"Cycle {cycle} complete. Waiting 10 seconds before restart...")
    wait_with_events(10, "Restarting fresh in")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

import pygame

import sys

import random

import numpy as np

import time

import os

os.environ["SDL_VIDEO_WINDOW_POS"] = "120,80"

# Grid & Constants

ROWS, COLS = 5, 5

ACTIONS = ['up', 'down', 'left', 'right']

GAME_SQUARE_SIZE = 40

QTABLE_SQUARE_SIZE = int(70 * 1.24)

GOLD_POS = (4, 4)

START_POS = (0, 0)

# Rewards

REWARD_GOAL = 100

PENALTY_RED = -2

STEP_PENALTY = -0.01

def generate_red_positions(n=3):

all_pos = [(r, c) for r in range(ROWS) for c in range(COLS)

if (r, c) != START_POS and (r, c) != GOLD_POS]

return random.sample(all_pos, n)

RED_POSITIONS = generate_red_positions()

# Q-learning

Q = np.zeros((ROWS, COLS, len(ACTIONS)))

alpha = 0.1

gamma = 0.9

epsilon = 0.1

episodes = 100 # per cycle

pygame.init()

clock = pygame.time.Clock()

# Fixed window size

SCREEN_WIDTH, SCREEN_HEIGHT = 1400, 900

# Offsets

Q_OFFSET_X = 30

GRID_OFFSET_Y_BASE = 100

GRID_OFFSET_Y_Q = GRID_OFFSET_Y_BASE - 100 # Q-table up 100px

GRID_OFFSET_Y_GAME = GRID_OFFSET_Y_BASE # Game grid unchanged vertically

GAME_OFFSET_X = SCREEN_WIDTH - (COLS * GAME_SQUARE_SIZE) - 30 - 100 # Game grid moved 100px left

# Graph (moved 140px down)

GRAPH_HEIGHT = 230

GRAPH_MARGIN = 50

GRAPH_LEFT = Q_OFFSET_X

GRAPH_TOP = GRID_OFFSET_Y_GAME + ROWS * GAME_SQUARE_SIZE + 20 + 40 + 140

GRAPH_WIDTH = SCREEN_WIDTH - 2 * Q_OFFSET_X

screen = pygame.display.set_mode((SCREEN_WIDTH, SCREEN_HEIGHT), pygame.RESIZABLE)

pygame.display.set_caption("Q-learning RL | Left: Q-table | Right: Game Grid")

# Fonts

font_q = pygame.font.SysFont("Courier New", int(18 * 0.8))

font_game = pygame.font.SysFont("Courier New", 20)

font_footer = pygame.font.SysFont("Courier New", 18)

font_axis = pygame.font.SysFont("Courier New", 16)

# Colors

WHITE = (255, 255, 255)

GRAY = (180, 180, 180)

GRAY_DARK = (120, 120, 120)

RED = (255, 0, 0)

BLUE = (0, 100, 255)

GOLD = (255, 215, 0)

BLACK = (0, 0, 0)

LAST_EP_TEXT = ""

steps_history = []

status_history = [] # 'win' | 'lose' | 'finished'

paused = False

def draw_legend(x, y):

pad = 8

line_h = 22

box_w, box_h = 190, line_h * 2 + pad * 2

pygame.draw.rect(screen, (250, 250, 250), (x, y, box_w, box_h))

pygame.draw.rect(screen, GRAY, (x, y, box_w, box_h), 1)

pygame.draw.circle(screen, BLUE, (x + 16, y + pad + 10), 10)

screen.blit(font_axis.render("Blue = Win", True, BLACK), (x + 36, y + pad))

pygame.draw.circle(screen, RED, (x + 16, y + pad + 10 + line_h), 6)

screen.blit(font_axis.render("Red = Game Over", True, BLACK), (x + 36, y + pad + line_h - 2))

def draw_graph(history, statuses):

plot_x = GRAPH_LEFT + GRAPH_MARGIN

plot_y = GRAPH_TOP + GRAPH_MARGIN

plot_w = max(100, GRAPH_WIDTH - 2 * GRAPH_MARGIN)

plot_h = max(80, GRAPH_HEIGHT - 2 * GRAPH_MARGIN)

pygame.draw.rect(screen, (245, 245, 245), (GRAPH_LEFT, GRAPH_TOP, GRAPH_WIDTH, GRAPH_HEIGHT))

pygame.draw.rect(screen, GRAY, (GRAPH_LEFT, GRAPH_TOP, GRAPH_WIDTH, GRAPH_HEIGHT), 1)

pygame.draw.line(screen, BLACK, (plot_x, plot_y + plot_h), (plot_x + plot_w, plot_y + plot_h), 2)

pygame.draw.line(screen, BLACK, (plot_x, plot_y), (plot_x, plot_y + plot_h), 2)

screen.blit(font_axis.render("Episode (X)", True, BLACK), (plot_x + plot_w - 140, plot_y + plot_h + 8))

screen.blit(font_axis.render("Steps (Y)", True, BLACK), (plot_x - 5, plot_y - 26))

# Legend on the RIGHT

draw_legend(GRAPH_LEFT + GRAPH_WIDTH - 190 - 8, GRAPH_TOP + 8)

if not history:

return

y_max = max(1, max(history))

screen.blit(font_axis.render("1", True, BLACK), (plot_x - 6, plot_y + plot_h + 8))

screen.blit(font_axis.render(str(episodes), True, BLACK), (plot_x + plot_w - 20, plot_y + plot_h + 8))

screen.blit(font_axis.render("0", True, BLACK), (plot_x - 25, plot_y + plot_h - 10))

screen.blit(font_axis.render(str(y_max), True, BLACK), (plot_x - 40, plot_y - 10))

pts = []

for i, steps in enumerate(history, start=1):

t = (i - 1) / max(1, episodes - 1)

x = plot_x + t * plot_w

y = plot_y + (1 - (steps / y_max)) * plot_h

pts.append((int(x), int(y)))

if len(pts) >= 2:

pygame.draw.lines(screen, GRAY_DARK, False, pts, 2)

for idx, p in enumerate(pts):

status = statuses[idx] if idx < len(statuses) else 'finished'

if status == 'win':

pygame.draw.circle(screen, BLUE, p, 10)

elif status == 'lose':

pygame.draw.circle(screen, RED, p, 6)

else:

pygame.draw.circle(screen, GRAY_DARK, p, 3)

def draw_grid(agent_pos):

screen.fill(WHITE)

# Q-table

for row in range(ROWS):

for col in range(COLS):

x = Q_OFFSET_X + col * QTABLE_SQUARE_SIZE

y = GRID_OFFSET_Y_Q + row * QTABLE_SQUARE_SIZE

pygame.draw.rect(screen, GRAY, (x, y, QTABLE_SQUARE_SIZE, QTABLE_SQUARE_SIZE), 1)

up, down, left, right = Q[row, col, 0], Q[row, col, 1], Q[row, col, 2], Q[row, col, 3]

def val_color(v):

if v > 0.1:

return BLUE

elif v < -0.1:

return RED

else:

return GRAY_DARK

screen.blit(font_q.render(f"{up:.1f}", True, val_color(up)), (x + 18, y + 2))

screen.blit(font_q.render(f"{down:.1f}", True, val_color(down)), (x + 18, y + QTABLE_SQUARE_SIZE - 20))

screen.blit(font_q.render(f"{left:.1f}", True, val_color(left)), (x + 2, y + QTABLE_SQUARE_SIZE // 2 - 8))

screen.blit(font_q.render(f"{right:.1f}", True, val_color(right)), (x + QTABLE_SQUARE_SIZE - 38, y + QTABLE_SQUARE_SIZE // 2 - 8))

# Game grid

for row in range(ROWS):

for col in range(COLS):

x = GAME_OFFSET_X + col * GAME_SQUARE_SIZE

y = GRID_OFFSET_Y_GAME + row * GAME_SQUARE_SIZE

pygame.draw.rect(screen, GRAY, (x, y, GAME_SQUARE_SIZE, GAME_SQUARE_SIZE), 1)

screen.blit(font_game.render(f"{row},{col}", True, GRAY), (x + 2, y + 2))

center = (x + GAME_SQUARE_SIZE // 2, y + GAME_SQUARE_SIZE // 2)

radius = GAME_SQUARE_SIZE // 3

if (row, col) == GOLD_POS:

pygame.draw.circle(screen, GOLD, center, radius)

elif (row, col) in RED_POSITIONS:

pygame.draw.circle(screen, RED, center, radius)

# Agent

ax = GAME_OFFSET_X + agent_pos[1] * GAME_SQUARE_SIZE + GAME_SQUARE_SIZE // 2

ay = GRID_OFFSET_Y_GAME + agent_pos[0] * GAME_SQUARE_SIZE + GAME_SQUARE_SIZE // 2

pygame.draw.circle(screen, BLUE, (ax, ay), GAME_SQUARE_SIZE // 3)

# Footer

if LAST_EP_TEXT:

footer_y = GRID_OFFSET_Y_GAME + ROWS * GAME_SQUARE_SIZE + 20

screen.blit(font_footer.render(LAST_EP_TEXT, True, BLACK), (GAME_OFFSET_X - 250, footer_y))

draw_graph(steps_history, status_history)

pygame.display.flip()

def move(pos, action):

r, c = pos

if action == 'up' and r > 0: r -= 1

elif action == 'down' and r < ROWS - 1: r += 1

elif action == 'left' and c > 0: c -= 1

elif action == 'right' and c < COLS - 1: c += 1

return (r, c)

def get_reward(pos):

if pos in RED_POSITIONS:

return PENALTY_RED, True, False, True

if pos == GOLD_POS:

return REWARD_GOAL, True, True, False

return STEP_PENALTY, False, False, False

def handle_events_toggle_pause():

global paused

for event in pygame.event.get():

if event.type == pygame.QUIT:

pygame.quit(); sys.exit()

if event.type == pygame.KEYDOWN:

if event.key == pygame.K_q:

pygame.quit(); sys.exit()

if event.key == pygame.K_SPACE:

paused = not paused

def wait_with_events(seconds, message="Restarting fresh in"):

end_time = time.time() + seconds

while time.time() < end_time:

handle_events_toggle_pause()

# Show countdown

remaining = int(end_time - time.time()) + 1

info = f"{message} {remaining}s ..."

y = GRID_OFFSET_Y_GAME + ROWS * GAME_SQUARE_SIZE + 50

draw_grid(START_POS)

screen.blit(font_footer.render(info, True, BLACK), (GAME_OFFSET_X - 250, y))

pygame.display.flip()

clock.tick(30)

# Initial draw

draw_grid(START_POS)

cycle = 0

while True: # infinite cycles until quit

cycle += 1

# Fresh start each cycle

Q[:] = 0

RED_POSITIONS = generate_red_positions()

steps_history.clear()

status_history.clear()

print(f"=== Starting cycle {cycle} ===")

for ep in range(episodes): # 100 episodes

pos = START_POS

done = win = game_over = False

steps = 0

while not done:

handle_events_toggle_pause()

if paused:

draw_grid(pos)

clock.tick(60)

continue

r, c = pos

a = random.randint(0, 3) if random.random() < epsilon else int(np.argmax(Q[r, c]))

next_pos = move(pos, ACTIONS[a])

reward, done, win, game_over = get_reward(next_pos)

next_max = np.max(Q[next_pos[0], next_pos[1]]) if not done else 0

Q[r, c, a] += alpha * (reward + gamma * next_max - Q[r, c, a])

Q[r, c, a] += alpha * (reward + gamma * next_max - Q[r, c, a]) # kept as in your code

pos = next_pos

steps += 1

draw_grid(pos)

clock.tick(60)

status_txt = "Win" if win else "Game Over" if game_over else "Finished"

LAST_EP_TEXT = f"Cycle {cycle} | Episode {ep+1} | Total moves: {steps} | {status_txt}"

steps_history.append(steps)

status_history.append('win' if win else 'lose' if game_over else 'finished')

print(LAST_EP_TEXT)

draw_grid(pos)

time.sleep(0.2)

print(f"Cycle {cycle} complete. Waiting 10 seconds before restart...")

wait_with_events(10, "Restarting fresh in")

AI Agent Lands Lunar on the Moon

הנה הסבר בעברית על החוקים, המדיניות (Policy) ומערכת התגמולים/עונשים (Reward & Punishment) של סוכן בינה מלאכותית בסביבת LunarLander-v2 (הנחיתה על הירח) של OpenAI Gym/Gymnasium.

תיאור כללי של הסביבה

המשחק מדמה נחיתה רכה של נחתת על הירח.
הסוכן מקבל בכל צעד מצב (State), בוחר פעולה (Action), ומקבל תגמול (Reward) בהתאם להתנהגות.

מרחב המצבים (State Space) – וקטור עם 8 ערכים:

מיקום אופקי ואנכי של הנחתת (x, y)
מהירות אופקית ואנכית (ẋ, ẏ)
זווית של הנחתת
מהירות זוויתית
שני ערכים בוליאניים – האם הרגל השמאלית והאם הרגל הימנית נוגעות בקרקע

מרחב הפעולות (Action Space) – 4 פעולות אפשריות:

לא לעשות כלום
להפעיל מנוע צד שמאלי (לסיבוב ימינה)
להפעיל מנוע ראשי (דחיפה כלפי מעלה)
להפעיל מנוע צד ימני (לסיבוב שמאלה)

חוקי התגמול/עונש (Reward Function)

במהלך הצעדים:

תנועה לכיוון אזור הנחיתה, האטה מהירות, ושמירה על יציבות הזווית → תגמול חיובי
+10 נקודות על כל רגל שנוגעת בקרקע
שימוש במנועים עולה בניקוד (עונש קטן):
- מנוע ראשי: –0.3 נקודות לכל פריים
- מנועי צד: –0.03 נקודות לכל פריים

בסיום הפרק:

נחיתה רכה ומוצלחת (הגעה למנוחה) → +100 נקודות
התרסקות → –100 נקודות

סיום פרק (Episode End) קורה כאשר:

הנחתת התרסקה
הנחתת יצאה מחוץ לגבולות המסך
הנחתת נחתה ונשארה במנוחה (פיזיקלית “נרדמה”)

מטרת האימון

הסוכן צריך ללמוד מדיניות (Policy) שתמקסם את התגמול הכולל:

להגיע לאזור הנחיתה
לשמור על מהירות נמוכה וזווית יציבה
להימנע מהתרסקות
לחסוך בדלק (שימוש מועט במנועים)

מדיניות (Policy) ולמידה

בשיטת Q-Learning / DQN – נשמרת טבלת Q או רשת נוירונים שמעריכה את ערך כל פעולה בכל מצב (Q(s,a)). המדיניות לרוב תהיה ε-greedy: לפעמים לבחור פעולה אקראית (חקירה – Exploration), ולפעמים את הפעולה עם ערך ה-Q הכי גבוה (ניצול – Exploitation).
בשיטות Policy Gradient / PPO / Actor-Critic – המודל לומד ישירות פונקציית הסתברות לבחירת פעולה בכל מצב (π(a|s)) ומשפר אותה לפי התגמולים שקיבל.

טבלה מסכמת

רכיב	פירוט
מצב (State)	מיקום, מהירות, זווית, מהירות זוויתית, מגע רגליים
פעולות (Actions)	0: כלום, 1: מנוע צד שמאלי, 2: מנוע ראשי, 3: מנוע צד ימני
תגמול בכל צעד	+קרבה/יציבות, +10 לכל רגל נוגעת; –0.3 למנוע ראשי, –0.03 למנועי צד
תגמול סופי	+100 לנחיתה רכה, –100 להתרסקות
סיום פרק	התרסקות, יציאה מהמסך, נחיתה סופית
פתרון הסביבה	ציון ≥ 200

קורס יסודות בינה מלאכותית RB26-23 : שימוש ב reinforcement learning – RL

חלק א : יצירת ספר מבינה מלאכותית מבינה מלאכותית

תרגיל כיתה : חלק א 1

הדוגמאות בעזרת הבינה מלאכותית A.I STUDIO GOOLE

1. שיתוף מסך מהטלפון ועזרה בזמן אמת ביצרית נוסחות בזמן אמת GEMINI SHARE

2 .יצירת דף נחיתה – עדכון דף נחיתה , ושידרוג דף נחיתה הוספת שדות

3. יצירת משחק מחשב – PANG

1) מרחב פעולות (Action Space – מה הסוכן יכול לעשות)

2) פונקציית התגמול (Reward Function – מה טוב ומה רע)

למה צריך את שניהם?

דוגמה פשוטה

איך זה מיושם בקוד שלך (ε-greedy policy)

כלל אצבע באימון

3) טבלת Q (המודל בשיטת Tabular Q-learning)

4) פרמטרים ללמידה (Hyperparameters)

5) מעבר בין מצבים (State Transition)

6) פונקציית התגמול

7) מדיניות ε-greedy (חקירה לעומת ניצול)

8) עדכון Q-learning (לב הלמידה)

9) לולאת פרק (Episode Loop)

10) מחזורי אימון והתחלה מחדש

סיכום

ChatGPT said:

AI Agent Lands Lunar on the Moon

תיאור כללי של הסביבה

מרחב המצבים (State Space) – וקטור עם 8 ערכים:

מרחב הפעולות (Action Space) – 4 פעולות אפשריות:

חוקי התגמול/עונש (Reward Function)

במהלך הצעדים:

בסיום הפרק:

סיום פרק (Episode End) קורה כאשר:

מטרת האימון

מדיניות (Policy) ולמידה

טבלה מסכמת

אולי תאהב/י גם

קורס יסודות בינה מלאכותית – RB23-13

רובוטיקה לבתי ספר – RB8-01 רובוטיקס בלוקס

מיקרו- בקר שליטה במנוע צעד

כתיבת תגובה לבטל