我正在使用openai的gymnom-python包创建一个PPO模型来玩一个简单的基于网格的游戏,类似于gymnom's GridWorld的例子。大多数行动都会得到积极的回报。通常只有一个动作会导致负面回报。
在学习阶段,我可以通过在环境中打印
step()
功能,模型做得很好。它很少选择会带来负面回报的行动。
当我尝试在之后测试模型并预测一款新游戏时,它会抓狂,选择一些好的动作,然后只选择唯一一个给予负面奖励的动作。一旦发现不良行为,它就会坚持到底。
测试/使用模型进行预测的代码中是否存在错误?
env = GameEnv()
obs = env.reset()
model = PPO("MultiInputPolicy", env, verbose=1)
model.learn(total_timesteps=10_000)
obs = env.reset()
for i in range(50):
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(int(action))
env.render()
if done:
obs = env.reset()
学习输出示例:
action, reward = 2, 1
action, reward = 3, 1
action, reward = 2, 5
action, reward = 0, 1
action, reward = 0, 9
action, reward = 1, 1
action, reward = 3, 1
action, reward = 3, -5
action, reward = 2, 1
测试输出的样本:
action, reward = 0, 1
action, reward = 1, 5
action, reward = 2, 1
action, reward = 0, 1
action, reward = 0, -5
action, reward = 0, -5
action, reward = 0, -5
action, reward = 0, -5
action, reward = 0, -5
action, reward = 0, -5
action, reward = 0, -5
action, reward = 0, -5
...