the goal of inverse reinforcement learning is to infer the reward function from expert demonstrations or policy. the idea is to show the agent a bunch of demonstrations of how an expert in the real world solves a certain problem, so that it learns what problem the expert is trying to solve. for example, if you show one million example of a master chess player playing games, you expect your agent to learn that ‘checkmate is very important’, or that checkmate has high reward.
a generic irl algorithm looks like this:
- initialize the reward function estimate
- unless converged
- solve mdp using that estimated reward function
- update the estimate using some method (this varies from algorithm to algorithm)
for the update step various algorithms over the years have proposed various solutions, ranging from basic linear programming to generative adversarial networks.
various papers have used multiple versions for the problem, which we will list out here.
the basis set up for the irl problem comes from reinforcement learing itself.
in the most general setting, we consider.
a markov decision process is a tuple
the total discounted reward is $g_t = ∑k=0t γ^k rt+k$.
a policy is a function
the goal of reinforcement learning is to find the optimal policy (so that over an episode
the goal of inverse reinforcement learning, on the other hand, is to find the reward function given the optimal policy (there is some work on how to tackle sub-optimal policies, but we will come to that later).
Two methods of Feature Matching
- start with some estimate of reward parameters
- Calculate feature expectations
- Calculate Inverse Learning Error
- Update the paramettrs
- using gradient descent
- using SVM
http://ai.stanford.edu/~ang/papers/icml00-irl.pdf
first IRL algorithm for trajectories this algorithm does not explicitly require access to the expert policy and can work with trajectories only. the way I understand this algorithm is something like:
- start with a random reward function and random policy
- write the reward function as dot product of paratmeters, and formulate the value function of a trajectory in terms of these parameters. our goal is to estimate these parameters
- for some criteria:
- generate trajectories based on current policy
- get the coeffs of the value of the trajectory based current reward function estimate
- solve the following linear programming problem to the new reward parameters
maximize
$$ ∑i=1^k p(vπ*(s_0) - vπ_i(s_0)) $$
such that
$$ |α_i| \leq 1$$
and the
$p$ is the function$p(x) = x$ if $ x \geq 0$ otherwise$p(x) = λ x$ $λ$ is a hyperparameter, russell and ng chose it to be 2. - get a new policy that is optimal wrt this new reward functions and add that to the list of policies
- return the parameters
- the assumption here is that experts take actions that are exponentially more likely to generate higher rewards
- for a parameterized reward function $R_ψ(τ) = ∑t r_ψ (s_t, a_t)$
$$p(τ) = \frac{1}{Z}eR_ψ(τ)$$ with $Z = ∑τ eR_ψ(τ) dτ$
- so we want to parametrize the reward function in the way that maximizes the log-likelihood
$$L(ψ) = ∑τ ∈ Dlog pr_ψ (ψ)$$ $$ = ∑τ ∈ D (log \frac{1}{Z} + R_ψ (τ))$$ $$ = ∑τ ∈ D (R_ψ (τ)) - M log Z$$ $$ = ∑τ ∈ D (R_ψ (τ)) - M log ∑τeR_ψ (τ)$$
- so the gradient wrt the paratmeters is
$$∇_psi L(ψ) = ∑τ ∈ D \frac{dR_ψ(τ)}{dψ} - M \frac{1}{∑τ eR_ψ(τ) dτ}∑{τ}(eR_ψ(τ)\frac{dR_ψ(τ)}{dψ} )$$
$$= ∑τ ∈ D \frac{dR_ψ(τ)}{dψ} - M ∑τ \frac{eR_ψ(τ)}{∑τeR_ψ(τ)}\frac{dR_ψ(τ)}{dψ} )$$ $$= ∑τ ∈ D \frac{dR_ψ(τ)}{dψ} - M ∑τ p(τ \mid ψ)\frac{dR_ψ(τ)}{dψ} )$$ optimizing for
$p(τ \mid ψ)$ is same as$p(s | ψ)$ otherwise known as state-visitation frequency so, finally $$L(ψ)= \frac{1}{\mid D \mid}∑τ_d ∈ D \frac{dr_ψ(τ_d)}{dψ} - ∑s p(s \mid ψ)\frac{dr_ψ(s)}{dψ} )$$
- How do we calculate the state-visitation frequency in an efficient way?
TODO
- So the final algorithm looks like:
ipython
- claims to learn nearly optimal rewards, along with policy
- GAIL recovers policy only, not transferrable rewards
- Reward Learning is difficult because
- many optimal policies explain the same demonstration
- many reward functions induces an optimal policy
- Empowerment is a mutual information based potential function, like value fucntions, which intuitively quantifies for a state the extent to which an agent can influence its environment.
- Empowerment acts as a regularizer in policy update
- Empowerment is a *maximal of mutual information between a sequence of
$K$ actions$a$ and the final state$s’$ reached after execution of those actions, conditioned on current state$s$ .$$ Φ(s) = max I(a, s’|s) = max \mathbb{E}p(s’|a,s)w(a|s)[log (\frac{p(a,s’|s)}{w(a|s)p(s’|s)})]$$
- after some mathematical gymnastics, they approximate Empowerment as something and finally optimize it using the loss function
$$ l_I(s,a,s’) = | β log q_φ (a| s’, s) - (log π_θ (a|s) + Φ_\varphi(s))| $$
- the algorithm has four models and are trained simultaneously.
-
inverse model (maxmium log-likelihood supervised learning problem) that, given a set of trajectories, minimizes the mean-square error between its predicted action
$q(a|s’, s)$ and the action$a$ according to the generated trajectory. $$ l_q(s,a,s’) = (q_φ (.|s, s’) - a)^2 $$ - empowerment optimization as noted before
-
reward function
first compute the Discriminator as $$Dζ, \varphi (s,a,s’) = \frac{exp[r_ζ(s,a) + γ Φ\varphi’(s’) - Φ\varphi(s)]}{exp[r_ζ(s,a) + γ Φ\varphi’(s’) - Φ\varphi(s)] + π_θ(a|s)} Finally train the parameters
$/zeta$ by binary logistic regression to discriminate between expert and generated trajectories via$$ \mathbb{E}_τ [log Dζ, \varphi (s,a,s’)] + \mathbb{E}τ_E [(1 - log Dζ, \varphi (s,a,s’)) ]$$
-
policy optimization
train the policy $πθ(a|s)$ to mazimize the discriminative reward
$\hat{r}(s,a,s’) = log D(s,a,s’) - log (1 - D(s,a,s’))$ and to minimize the loss function $l_I(s,a,s’) = \mid β log q_φ(a|s, s’) - (log π_θ (a|s) + Φ_\varphi(s)) \mid $ which accounts for empowerment regularization overall training obejective becomes $$ \mathbb{E}_π [log π_θ(a|s) \hat{r} (s, a,s’)] + λ_I \mathbb{E}_τ[l_I(s,a,s’)]$$
-
inverse model (maxmium log-likelihood supervised learning problem) that, given a set of trajectories, minimizes the mean-square error between its predicted action
- So, the final algorithm looks something like
-
so all algorithms differ in two ways -
- how to measure the difference between current estimation and expert demonstrations
- how to update the reward function estimate
if that is the case, we can probably frame irl completely as a supervised learning problem. maybe we cannot, as giving the reward function in the training set is giving away the answer.
is it possible to have an end-to-end approach to IRL? we input a bunch of environments and expert trajectories and get reward function as outputs? would current deep learning techniques be able to tackle this? my hunch is that this will not generalize well. but how do I prove that?
setup
- state space
$\mathcal{S} ∈ \mathbb{R}^n$ - action space
$\mathcal{A} ∈ \mathbb{R}^m$ - a feature function
$f: \mathcal{S} → \mathbb{R}^k : s \mapsto f(s)$ - unknown reward function
$\mathcal{S} → \mathbb{R}$
assumptions
- reward function is some non-linear function of features
$\matchcal{R}(s) = φ(f(s))$
have
- trajectories
$$ζ_1, ζ_2, …, ζ_l$$ $$ζ_i = (s_1, a_1), (s_2, a_2), …$$
want to arrive at a formulation that trajs with more rewards are exponentially more likely to be taken by the expert
maximum entropy principle choose the probablity distribution that has maxmimum information entropy amongst all distributions that satisfy the data
derivation
expert feature count from a trajectory $$F(ζ) = ∑s ∈ ζ f(s) \hspace{2cm} ∈ \matchbb{R}^k$$
expert feature count from all trajectories $$F^* = \frac{1}{l} ∑i=1^l F(ζ_i)$$
want a reward function that explains this features counts (i.e. all the trajectories) assuming optimal-rationality of the expert
total reward obtained in a path $$G(ζ) = ∑s ∈ ζ \mathcal{R}(s)$$
matching features
probablity distribution over all possible paths should be something something such that $$∑\text{all possible ζ} P(ζ)F(ζ) = F^*$$
if
- accurate inference (underspecified problem)
- generalizability
- correctness of prior knowledge (if you are using feature functions then they must be accurate and your method of IRL should be less sensitive to the accuracy of the prior knowledge)
- solution complexity (grows with state-action space size)