MP #4 - Deep RL methods
Download [attachment]
Late Policy
- You have free 4 late days in total for all 4 homework assignments.
- You can use late days for any assignments, in whole-day increments (i.e., one day being the minimum unit). A late day extends the deadline by 24 hours.
- Once you have used all 4 late days, the penalty is 10% for each additional late day (until 0 points left).
Bonus Policy
- The highest points you can get in each assignment is capped at 100, i.e., final score = min(score after bonus, 100).
- We will not cap at the question level, but the entire assignment level.
This assignment is designed for you to gain hands‑on experience with deep reinforcement learning using neural‑network‑based policy and value function approximators. You will implement and evaluate DQN, Policy Gradient (PG), and Actor–Critic (AC) on the CartPole‑v1 environment from OpenAI Gym.
- Part 1: Implementing deep RL algorithms (55 points)
- Part 2: Fine‑tuning and comparison (45 points)
A XueTang assignment page has been created. Submit your written report in PDF format, following the same naming rule:
studentID-assignment4.pdf.
Environment: CartPole‑v1
In this assignment, you will use the CartPole‑v1 environment (gym.make('CartPole-v1')).
The following content is copied from OpenAI’s gym site (https://gymnasium.farama.org/environments/classic_control/cart_pole/).
Description
This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”. A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.
Action Space
The action is an ndarray with shape (1,) which can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.
- 0: Push cart to the left
- 1: Push cart to the right
Note: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it
Observation Space
The observation is a ndarray with shape (4,) with the values corresponding to the following positions and velocities:
| Num | Observation | Min | Max |
|---|---|---|---|
| 0 | Cart Position | -4.8 | 4.8 |
| 1 | Cart Velocity | -Inf | Inf |
| 2 | Pole Angle | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
| 3 | Pole Angular Velocity | -Inf | Inf |
Note: While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:
- The cart x-position (index 0) can be take values between
(-4.8, 4.8), but the episode terminates if the cart leaves the(-2.4, 2.4)range. - The pole angle can be observed between
(-.418, .418)radians (or ±24°), but the episode terminates if the pole angle is not in the range(-.2095, .2095)(or ±12°)
Rewards
Since the goal is to keep the pole upright for as long as possible, by default, a reward of +1 is given for every step taken, including the termination step. The default reward threshold is 500 for v1 and 200 for v0 due to the time limit on the environment.
If sutton_barto_reward=True, then a reward of 0 is awarded for every non-terminating step and -1 for the terminating step. As a result, the reward threshold is 0 for v0 and v1.
Starting State
All observations are assigned a uniformly random value in (-0.05, 0.05)
Episode End
The episode ends if any one of the following occurs:
- Termination: Pole Angle is greater than ±12°
- Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
- Truncation: Episode length is greater than 500 (200 for v0)
Part 1: Implementing Deep RL Algorithms (65 points)
You will complete implementations for the following algorithms:
- Deep Q‑Network (DQN)
- Policy Gradient (PG, using REINFORCE)
- Actor–Critic (AC, one‑step advantage‑based)
Starter code is provided in agents/. All algorithm agents derive from an abstract base class. You need to implement all the abstract methods. Specifically, fill in the action selections, training loops, neural network updates, and helper functions. You may add new methods where needed.
Deliverables:
- (5 points - NN parameterization) Provide your implementation of neural network. Explain clearly the input and output formalization of each network. Explain briefly their differences with their discrete counterparts (in tabular settings of formal MPs).
- (20 points - DQN) Elaborate the design of DQN, including optimization objective and the use of replay buffer. Paste your implemented code of DQNAgent.
- (20 points - PG) Elaborate the deduction of policy gradient formula. Document the mathematical formula of REINFORCE. Paste your implemented code of PGAgent.
- (20 points - AC) Elaborate the design of Actor Critic algorithm, highlighting its differences compared with PG. Paste your implemented code of ACAgent.
Part 2: Fine-tuning and Analyses (35 points)
Since introducing deep NN for approximating policy / value functions, tuning becomes a key factor influencing outcoming performance.
Deliverables:
- (10 points - Replay Buffer) Finetune the setting of replay buffer in DQN algorithm. Analyse the effect of different replay buffer size, batch size and access scheme (a queue vs., a priority queue) to final performance.
- (10 points - Tuning) For each algorithm, tune on at least one hyperparameter (hidden size, learning rate, etc). Report their performances in different hyperparameter settings. Give brief analyses.
- (15 points - Comparison) Give training rewards as a function of number of training episodes for three (tuned) algorithms on CartPole-v1. Compare their pros and cons on
- Final performance
- Convergence speed
Submission
Submit the following to Gradescope:
- PDF report (name:
studentID-assignment4.pdf) - Your completed code files
- Figures described above
Late submissions follow course policy.
