Artificial Intelligence 7750 - Graduate final project Relevant papers
- Neural Learning of Heuristic Functions for General Game Playing
- Neural network architecture for a snake game
- Playing Atari with Deep Reinforcement Learning
- A Deep Q-Learning based approach applied to the Snake game
Agent will compare the usage of Neural Network with heuristic vs Deep-Q-Network (DQN) learning to increasingly improve itself on playing a Snake game.
Represent snake's 3 possible actions using one-hot encoding, with 1 = action to do and 0 = action to not do.
[1,0,0]
= forward (continues in current direction)
[0,1,0]
= turn right
[0,0,1]
= turn left
state = Represents 11 conditions using one-hot encoding, with 1 = condition met and 0 = condition unmet.
- If danger (snake collides with its own body or game window boundary) is forward, right, and or left of the snake.
- If current direction of snake is going left, right, up, or down.
- If mice is left, right, up, and or down of snake (can have 2 combos if it's diagonal).
[danger_forward, danger_right_turn, danger_left_turn, going_left, going_right, going_up, going_down, mice_left, mice_right, mice_up, mice_down]
Ex: state = [0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0]
= Danger to left of
snake, snake moving downward, and mice (food) is to right & up of snake.
Uses heuristic function to determine target action to take:
- decided_action = Direction(s) where there's no danger
- decided_action = If mice in same direction snake is heading towards, return "go forward" action
- decided_action = If mice in direction that snake can turn towards, return that direction
- decided_action = If no previous conditions matched/danger everywhere just return random action
- eat_mice = +10
- game_over = -10
- idle_steps_after_long_time = -10 (idle/useless steps limit porportional to length of snake*100)
Uses Bellman equation to calculate new Q values
epsilon = 80-m
Random exploration if randint(0, 200) < epsilon
, else do exploitation
gamma = 0.9
Results fair better when gamma, aka discount factor, set closer to 1 (aka values future rewards almost as much as current rewards)
Both experiments ran for 10 minutes.
Conclusion As can be seen, whereas Neural Network with heuristic approach improves in performance quickly, as time passes the increase in performance plateaus. In DQN, the performance doesn't improve as quickly initially, but as time goes on there is a clear and continuous increase in performance and no signs of plateauing yet even after 10 minutes.