Created
May 7, 2020 21:11
-
-
Save qgallouedec/a898757ce55997239a6530dc044dee3d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # initilise Q and N | |
| Q_pi = np.ones((nb_states, nb_actions)) | |
| N = np.zeros_like(Q_pi) | |
| for i_generation in range(nb_generation): | |
| # Define π' = ε-greedy policy w.r.t Q_π | |
| pi = epsilon_greedy_policy(Q_pi, eps) | |
| # Generate one episode by following the π' | |
| states, actions, rewards = generate_episode(env, pi) | |
| # From this episode, get a rought approximation of Q_π'. | |
| for t in range(len(states)): | |
| # Compute the dicounted reward Gt from time t | |
| # Gt = rewards[t] + gamma*Gt | |
| Gt = compute_gain(rewards, t, gamma) | |
| # \delta_t = G_t - Q(S_t, A_t) | |
| delta_t = Gt - Q_pi[states[t]][actions[t]] | |
| # Add pair state-action to the counter | |
| N[states[t]][actions[t]] += 1 | |
| # Add delta_t to the current value function | |
| # Q(S_t, A_t) += \frac{\delta_t}{N(S_t, A_t)} | |
| Q_pi[states[t]][actions[t]] += delta_t/N[states[t]][actions[t]] | |
| eps*=0.99 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment