-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdefault_env.py
177 lines (137 loc) · 6.4 KB
/
default_env.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
#!/usr/bin/env python3
"""
Official first environment of Swarm Robotics Project
By default, will load default.xml, which consists of a body
in the center of a bounded 3x3 area. There are two agents located
in corners of the environment.
Useful references:
https://mujoco.readthedocs.io/en/latest/programming.html
https://mujoco.readthedocs.io/en/latest/APIreference.html
https://openai.github.io/mujoco-py/build/html/reference.html#pymjdata-time-dependent-data
"""
from mujoco_py import load_model_from_path, MjSim, MjViewer
import os
import numpy as np
import glfw #screen creator, often used with opengl/pyglet
#To see order or to change something with the movable parts of
#the agent, check xmls/ directory
#Actions right now:
#sim.data.ctrl[0] = X direction of bot1 [-5 to 5] determines speed
#sim.data.ctrl[1] = Y direction of bot1 [-5 to 5] determines speed
#sim.data.ctrl[2] = X direction of bot2 [-5 to 5] determines speed
#sim.data.ctrl[3] = Y direction of bot2 [-5 to 5] determines speed
"""
Description: A humanoid is dropped at position (0,0) in the center of the environment.
The environment is bounded to limit the environment to within 5x5 (meters? I think).
As we develop this environment, we may want to allow randomness and create custom XML so
that the agent is just learning to move to one specific spot.
Currently, the idea will be when the agents get close enough to the target, they will
automatically gain the target's information. In reality, this would require preprocessing.
However, implmentation specific details are not important.
Observation:
Position: No programmatic limit, but a physical barrier in default environment limits
travel to between -5 and 5.
Velocity: Current bound is a combination of Force bounded to -10 to 10 and mass. This
will change with testing and I'm not currently sure on units and conversion to velocity.
Box() - i.e. continuous
[0] Dictionary of each bot's name with their (x,y,xVelocity, yVelocity) positions.
[1] Target information: None if the target hasn't been found. (x, y, xVelocity, yVelocity) if target has been found.
There is also Z information, but we don't really need that.
Note: We can add xquat and xvelr (rotation) information if we want as well.
The current information is only for the torso. We can pull all the other body
part information, but there are about 20 parts.
Actions:
Discrete()
A list of actions from bot1 ... to ... botN (currently bot4)
[-1, 0, 1] - Left, Nop, Right
[-1, 0, 1] - Down, Nop, Up
For now, the bots can only travel at full speed.
e.g.
[[-1],[-1]
[0],[0]
[1],[1]
[1],[1]]
The shape of this is list[8][1].
We can modify this if we want. I just did this for easy
conversion to the data that MuJoCo likes.
Reward:
Reward +10 for discovering the target.
Reward is given based on the target's proximity to the destination.
The calculation is Euclidean distance. This means that moving the target
towards the destination provides a positive reward, but moving away provides
a negative reward.
Reward = old_target_destination_distance - new_target_destination_distance * 100
Reward +10 for reaching the destination.
Reward will be -1 every 1000 frame. This encourages the agent to progress.
Starting State:
The humanoid in a standing position. The agents will be preset to a starting
location. (The human will immediately fall). The agents will not have any
initialized random values. We may want to change this in the future to
better generalize our agents.
Episode Termination:
Episode length is greater than 20,000.
Target has reached destination.
Solved Requirements:
Considered solved when the average return is 500 over 100 trials.
(This is completely arbitrary right now, but we can test the feasibility
of this later.)
"""
class Environment:
def __init__(self, name='default.xml'):
self.model = load_model_from_path("xmls/"+name)
self.sim = MjSim(self.model)
self.viewer = MjViewer(self.sim)
self.reset_state = self.sim.get_state()
self.SPEED = 5
self.VISION_DIST = 1
self.destination = [4.5, 4.5] #I just randomly chose this
self.discovered = False
self.timestep = 0
self.MAX_TIMESTEP = 20000
#helper function in a class
@staticmethod
def dist(x, y): #x and y should be the same length
return np.linalg.norm(x-y)
def step(self, action):
self.timestep += 1
self.model.ctrl[:] = action[:]
observation = {}
target = self.sim.data.get_body_xpos("target")[:2]
target.extend(self.sim.data.get_body_xvelp("target")[:2])
agent1 = self.sim.data.get_body_xpos("agent1")[:2]
agent1.extend(self.sim.data.get_body_xvelp("agent1")[:2])
observation["agent1"] = agent1
agent2 = self.sim.data.get_body_xpos("agent2")[:2]
agent2.extend(self.sim.data.get_body_xvelp("agent2")[:2])
observation["agent2"] = agent2
agent3 = self.sim.data.get_body_xpos("agent3")[:2]
agent3.extend(self.sim.data.get_body_xvelp("agent3")[:2])
observation["agent3"] = agent3
agent4 = self.sim.data.get_body_xpos("agent4")[:2]
agent4.extend(self.sim.data.get_body_xvelp("agent4")[:2])
observation["agent4"] = agent4
agent1_dist = self.dist(agent1[:2], target[:2])
agent2_dist = self.dist(agent2[:2], target[:2])
agent3_dist = self.dist(agent3[:2], target[:2])
agent4_dist = self.dist(agent4[:2], target[:2])
reward = 0
if agent1_dist < self.VISION_DIST or agent2_dist < self.VISION_DIST or agent3_dist < self.VISION_DIST or agent4_dist < self.VISION_DIST:
if self.discovered == False: # I have this condition so we reward finding the target the first time.
self.discovered = True
reward += 10
if self.discovered == True: # Once we've found the target once, we will always know its location.
observation["target"] = target
target_to_dest = self.dist(target[:2], self.destination)
reward += target_to_dest * 100
if target_to_dest < 0.5:
done = True
reward += 10
if(self.timestep > self.MAX_TIMESTEP):
done = True
return observation, reward, done, {} #If we need debug info we can put it here.
def render(self):
self.viewer.render()
def reset(self):
self.sim.set_state(self.reset_state)
def close(self):
glfw.destroy_window(self.viewer.window)