Actor-critic methods were among the earliest to be investigated in RL. They were supplanted in the 90’s by action-value methods like Q-learning. These action-value methods directly model the value function at each state and choose actions based on the best values. This approach was appealing due to its simplicity, however it has been shown to have theoretical difficulties when combined with function approximation.
The problem is, with function approximation, each update is “global” - it changes the value of all the actions. In action-value methods, we take “hard” actions (for example greedy), and a slight change in the value function can lead to very different a policy. As a result action-value methods have been shown in certain cases to not converge even with linear function approximation.
The alternative is to use parameterized policies instead of directly modeling the value function. We can use stochastic policy, which gives us a probability distribution over possible actions given state, or deterministic policy, which gives us only one action. One advantage of parameterized policy compared to action-value method is, although the parameter update is still “global” under function approximation, the action is not determined by only one value, which is the action-value, but rather determined by many parameters. Especially when we use stochastic policy where the actions we take are “soft”. Another advantage of parameterized policy over action-value methods is when we have continuous action space, where we can use deterministic policy. We will talk about this later in the talk.
With parameterized policy, the question, then, is how to update the policy exclusively toward improving the performance in terms of reward, however irrespective of value-function accuracy. Meaning, I want to improve my policy so it achieves higher reward even when I don’t have a good estimation of the value function. A solution is policy gradient.
With just a parameterized policy and no critic, you get an “actor-only” approach. In this, we don’t try to model the value function and just try to move along the gradient of the reward. An example of an actor-only approach is Williams’ REINFORCE algorithm, which updates the weights of a neural network just based on the immediate reward received.
Combining policy gradient and action-value method you have actor-critic - the actor is the parameterized policy, and the critic is the action-value function. In actor-critic framework, the critic will observe the reward from the environment, and estimate the action-value function. The actor updates its parameters based on policy gradient, using the action-value estimate of the critic via the “policy gradient theorem”. This procedure is similar to general policy iteration, where you have two steps: policy evaluation (corresponds to critic update) and policy improvement (corresponds to actor update). The difference is that we require the actor to choose an action based on a parameterized policy, instead of directly on the state-action values.
One important theoretical result about actor-critic is, given a policy, you can derive a “compatible” critic that is a non-biased estimate of the true action-value. However in practice, people often use non-compatible critics that have more flexibility and use different variance reduction methods to make them converge.
- Natural Actor-Critic
- A Survey of Actor-Critic Reinforcement Learning
- Incremental Natural Actor-Critic Algorithms
- Comparing Policy-Gradient Algorithms
- Deterministic Policy Gradient Algorithms
- Continuous Control with Deep Reinforcement Learning