This project involves defining and solving various reinforcement learning (RL) environments using SARSA and tabular methods such as Q-learning and Double Q-learning. We explore both deterministic and stochastic environments across different scenarios: Frozen Lake, Lawn Mower, and Squirrel Maze. And we apply these techniques to a stock trading environment to demonstrate the versatility of these algorithms.
The primary objective is for the agent to learn a policy that maximizes the cumulative reward over time in various grid-world environments. Each environment is designed with unique states, actions, and rewards to test the robustness of RL algorithms.
environments/
: Source code for the environments.images/
: Images used in the environments.models/
: Saved models and results.reports/
: Project reports and documentation.README.md
: Project overview and instructions.
Lawn Mower Squirrel Maze Frozen Lake
The Frozen Lake environment is a 4x4 grid where the agent (a skater) must navigate from the start to the goal while avoiding holes and collecting gems.
- States: Positions on the grid, including start, goal, holes, and gems.
- Actions: Move left, right, up, or down.
- Rewards: Positive rewards for reaching the goal and collecting gems; negative rewards for falling into holes or moving away from the goal.
The Lawn Mower environment simulates a mower navigating a lawn to cut grass while avoiding obstacles.
- States: Grid positions representing lawn mower agent, battery and rock locations along with goal state.
- Actions: Move left, right, up, or down.
- Rewards: Positive rewards for reaching battery; negative rewards for hitting obstacles.
The Squirrel Maze environment involves a squirrel navigating a grid to collect acorns and avoid hunters.
- States: Positions on the grid, including start, acorns, hunters, and home.
- Actions: Move left, right, up, or down.
- Rewards: Positive rewards for collecting acorns and reaching home; negative rewards for encountering hunters.
The Stock Trading environment simulates trading in a stock market, where the agent learns to buy and sell stocks to maximize profit.
- States: Market conditions and portfolio status.
- Actions: Buy, sell, hold.
- Rewards: Profit or loss from trades.
-
Technical Proficiency with Algorithms Implemented: The project involves the implementation of advanced RL algorithms such as Q-learning and Double Q-learning, a quick start to understand and apply sophisticated reinforcement learning techniques. The custom design of multiple RL environments (Frozen Lake, Lawn Mower, Squirrel Maze, and Stock Trading) gives the capability to create and manipulate simulations, an essential skill in many AI and data science roles.
-
Problem-Solving Skills with Deterministic and Stochastic Approaches and Reward Optimization: By addressing both deterministic and stochastic environments, the project improves our ability to tackle uncertainty and variability, which are common in real-world scenarios. The strategic design of reward systems to guide agent behavior gives an understanding of optimization and objective-driven development.
-
Data Analysis and Visualization: Detailed analysis and comparison of different algorithms provide insights and gives us the ability to draw meaningful conclusions from data. The use of matplotlib for environment visualization and results plotting enhances our proficiency in presenting data clearly and effectively.
-
Versatility with ease of Extending Application to Diverse Domains: The application of RL to both grid-world scenarios and a stock trading environment gives the versatility and ability to adapt RL techniques to various domains, from robotics to finance.
-
Innovation and Creativity with Custom Environments: Creating unique environments like Lawn Mower and Squirrel Maze increases our creativity and ability to think outside the box, essential for innovation in technology roles. The inclusion of safety measures ensures ethical considerations are addressed, reflecting a responsible approach to AI development.
- Defining RL Environments - Frozen Lake, Law Mower, Squirrel Maze and Stock Trading: Detailed creation of both deterministic and stochastic environments.
- SARSA Implementation: Step by Step solving with SARSA algorithm.
- Q-Learning Implementation: In-depth application of Q-learning to solve creative environments.
- Other Tabular Method - Double Q Learning: Exploration of different tabular methods for RL problem-solving.
- Stock Trading Environment: Unique application of Q-learning in a simulated stock trading scenario.
- Comprehensive Analysis: Extensive evaluation and comparison of different RL methods.
- Deterministic Q-Learning: Showed a smooth decrease in epsilon and a steady increase in total rewards per episode, stabilizing at higher values indicating effective learning.
- Stochastic Q-Learning: Demonstrated more fluctuations in rewards due to randomness, with epsilon decay less smooth compared to deterministic.
- Deterministic Double Q-Learning: Performed slightly better than Q-Learning, with higher and more stable rewards.
- Stochastic Double Q-Learning: Similar fluctuations as stochastic Q-Learning, but generally showed better performance over time.
- Epsilon Decay Plot
-
Description: This plot shows how the epsilon value decreases over episodes, indicating the agent's transition from exploration to exploitation.
- Total Rewards per Episode
-
Description: This plot illustrates the cumulative rewards obtained by the agent in each episode, showing the learning progress over time.
- Q-Learning vs. Double Q-Learning Comparison
-
Description: A comparison of the performance of Q-Learning and Double Q-Learning algorithms in the Frozen Lake environment.
- SARSA: Steady increase in rewards indicating effective policy learning. Performance might vary if adapted to stochastic environments.
- Double Q-Learning: Achieved higher rewards compared to SARSA, showing improved learning efficiency. Potential for robust policy in stochastic environments.
- N-Step Bootstrapping: Outperformed SARSA with a steady increase in rewards, showing the benefit of multi-step updates. Can handle stochasticity with more robustness.
- Epsilon Decay Plot
SARSA Epsilon Decay Q Learning Epsilon Decay Double Q Epsilon Decay
- Total Rewards per Episode
SARSA Total Rewards Q Learning Total Rewards Double Q Total Rewards
- Deterministic Q-Learning: Showed steady increase in rewards with a smooth epsilon decay.
- Stochastic Q-Learning: High variance in rewards due to environment randomness, less smooth epsilon decay.
- Deterministic Double Q-Learning: Higher and more stable rewards compared to Q-Learning, better handling of state-action space.
- Stochastic Double Q-Learning: High variance similar to stochastic Q-Learning, but occasionally higher peaks in rewards.
Algorithm | Environment | Model Variation | Max Reward (Episode) | Episode 1000 Reward | Epsilon Decay Trend |
---|---|---|---|---|---|
Q-learning | Deterministic | Base Model | 400+ | 390 | Slow Decline |
Q-learning | Deterministic | Hyperparameter Tuning (Max Timestamp, Decay Rate: 20, 0.75) | 800+ | 765 | Slow Decline |
Q-learning | Deterministic | Hyperparameter Tuning (Max Timestamp, Decay Rate: 20, 0.995) | 800+ | 765 | Slow Decline |
Q-learning | Stochastic | Base Model | 400+ | 305 | Slow Decline |
Q-learning | Stochastic | Hyperparameter Tuning (Max Timestamp, Decay Rate: 20, 0.995) | 800+ | 660 | Slow Decline |
Double Q-learning | Deterministic | Base Model | 400 | - | - |
Double Q-learning | Stochastic | Base Model | 400+ | - | - |
- Q-Learning: Demonstrated gradual increase in account value over episodes with expected fluctuations, indicating effective learning despite market volatility. Can be adapted to stochastic market conditions, reflecting real-world market dynamics, useful in algorithmic trading.
- Description: This plot shows how the agent's account value changes over time, indicating the profitability of the trading strategy.
Environment | Algorithm | Deterministic Environment Results | Stochastic Environment Results |
---|---|---|---|
Frozen Lake | Q-Learning | - Epsilon Decay: Smooth exponential decay | - Epsilon Decay: Fluctuating due to randomness |
- Total Rewards per Episode: Gradually increasing, stable at higher values | - Total Rewards per Episode: Fluctuating with occasional high rewards | ||
Double Q-Learning | - Epsilon Decay: Smooth exponential decay | - Epsilon Decay: Fluctuating due to randomness | |
- Total Rewards per Episode: Gradually increasing, slightly higher than Q-Learning | - Total Rewards per Episode: Fluctuating, generally lower than deterministic | ||
- Comparison: Double Q-Learning performs slightly better in stable environments | - Comparison: More variance observed, with occasional high rewards | ||
Lawn Mower | SARSA | - Total Rewards per Episode: Increasing steadily | - Performance might vary if adapted to stochastic |
Double Q-Learning | - Total Rewards per Episode: Higher rewards compared to SARSA | - Potential for robust policy in stochastic environment | |
N-Step Bootstrapping | - Total Rewards per Episode: Steady increase, higher than SARSA | - Can handle stochasticity with more robustness | |
- Comparison: Double Q-Learning achieves better long-term rewards | - Higher variance expected with stochastic adjustments | ||
Squirrel Maze | Q-Learning | - Epsilon Decay: Smooth exponential decay | - Epsilon Decay: Fluctuating due to randomness |
- Total Rewards per Episode: Increasing steadily | - Total Rewards per Episode: Fluctuating with high variance | ||
Double Q-Learning | - Epsilon Decay: Smooth exponential decay | - Epsilon Decay: Fluctuating due to randomness | |
- Total Rewards per Episode: Higher and more stable compared to Q-Learning | - Total Rewards per Episode: Fluctuating, generally lower than deterministic | ||
- Comparison: Double Q-Learning shows better reward dynamics | - Comparison: High variance with occasional peaks in rewards | ||
Stock Trading | Q-Learning | - Epsilon Decay: Smooth exponential decay | - Adapts well to deterministic trading strategies |
- Total Rewards per Episode: Initial fluctuations, eventually stabilizing | - Can be adapted to stochastic market conditions | ||
- Evaluation: Account value increases over time with fluctuations | - Reflects real-world market dynamics, useful in algorithmic trading |
- Python
- Gymnasium Library
- Matplotlib (for visualizations)
- Various RL Algorithms
This project in this repository is intended solely as an inspiration for your future projects and should be referenced accordingly. It is not meant for students to fulfill their academic project requirements. If a student uses this project for such purposes, the creators are not responsible. The student will be solely accountable for violating academic integrity. We explicitly state that this repository should not be used to meet academic requirements. Therefore, any academic integrity issues should be addressed with the student, not the creators.