Notes and Comments on IRL

Intro/ Definitions

the goal of inverse reinforcement learning is to infer the reward function from expert demonstrations or policy. the idea is to show the agent a bunch of demonstrations of how an expert in the real world solves a certain problem, so that it learns what problem the expert is trying to solve. for example, if you show one million example of a master chess player playing games, you expect your agent to learn that ‘checkmate is very important’, or that checkmate has high reward.

a generic irl algorithm looks like this:

initialize the reward function estimate
unless converged
- solve mdp using that estimated reward function
- update the estimate using some method (this varies from algorithm to algorithm)

for the update step various algorithms over the years have proposed various solutions, ranging from basic linear programming to generative adversarial networks.

various papers have used multiple versions for the problem, which we will list out here.

the basis set up for the irl problem comes from reinforcement learing itself.

in the most general setting, we consider.

a markov decision process is a tuple $(\mathcal{s}, \mathcal{a}, \mathcal{r}, \mathcal{p}, γ)$.

$\mathcal{s}$ is the discrete or continuous state-space.

$\mathcal{a}$ is the discrete or continuous action-space, available for each state.

$\mathcal{r} : \mathcal{s} × \mathcal{a} × \mathcal{s} → [0, ∞]$ is the reward function. $\mathcal{r}(s, a, s’)$ determines the numerical reward obtained when taking a particular action $a$ in a particular state $s$ which leads the agent to the next state $s’$.

$\mathcal{p}: \mathcal{s} × \mathcal{a} × \mathcal{s} → [0,1]$ is the transition probability. if the state and actions are discrete, this can be represented as a tensor. $\mathcal{p}(s, a, s’)$ determines the probability of reaching state $s’$, when taken action $a$ in state $s$.

$γ ∈ [0, 1]$ is the discount factor. it determines the value of future rewards in current timestep.

the total discounted reward is $g_t = ∑_k=0^t γ^k r_t+k$.

a policy is a function $π : \mathcal{s} × \mathcal{a} → [0, 1]$ which determines the agent’s probability of taking action $a$ in state $s$.

the goal of reinforcement learning is to find the optimal policy (so that over an episode $g_t$ is maximized).

the goal of inverse reinforcement learning, on the other hand, is to find the reward function given the optimal policy (there is some work on how to tackle sub-optimal policies, but we will come to that later).

Theorems

First Theorem from Ng & Russell

Algorithms

Early Methods

Two methods of Feature Matching

start with some estimate of reward parameters
Calculate feature expectations
Calculate Inverse Learning Error
Update the paramettrs
- using gradient descent
- using SVM

Linear Programming

linear programming irl from trajectories

http://ai.stanford.edu/~ang/papers/icml00-irl.pdf

first IRL algorithm for trajectories this algorithm does not explicitly require access to the expert policy and can work with trajectories only. the way I understand this algorithm is something like:

start with a random reward function and random policy
write the reward function as dot product of paratmeters, and formulate the value function of a trajectory in terms of these parameters. our goal is to estimate these parameters
for some criteria:
- generate trajectories based on current policy
- get the coeffs of the value of the trajectory based current reward function estimate
- solve the following linear programming problem to the new reward parameters maximize $$ ∑_i=1^k p(v^π*(s_0) - v^π_i(s_0)) $$ such that $$ |α_i| \leq 1$$ and the $p$ is the function $p(x) = x$ if $ x \geq 0$ otherwise $p(x) = λ x$ $λ$ is a hyperparameter, russell and ng chose it to be 2.
- get a new policy that is optimal wrt this new reward functions and add that to the list of policies
return the parameters

Apprenticeship Learning

Maximum Margin Planning

Maximum Entropy Based Methods

Maximum Entropy IRL

the assumption here is that experts take actions that are exponentially more likely to generate higher rewards
for a parameterized reward function $R_ψ(τ) = ∑_t r_ψ (s_t, a_t)$
$$p(τ) = \frac{1}{Z}e^R_ψ(τ)$$ with $Z = ∑_τ e^R_ψ(τ) dτ$
so we want to parametrize the reward function in the way that maximizes the log-likelihood
$$L(ψ) = ∑_{τ ∈ D}log p_{r_ψ} (ψ)$$ $$ = ∑_{τ ∈ D} (log \frac{1}{Z} + R_ψ (τ))$$ $$ = ∑_{τ ∈ D} (R_ψ (τ)) - M log Z$$ $$ = ∑_{τ ∈ D} (R_ψ (τ)) - M log ∑_τe^{R_ψ (τ)}$$
so the gradient wrt the paratmeters is
$$∇_psi L(ψ) = ∑_{τ ∈ D} \frac{dR_ψ(τ)}{dψ} - M \frac{1}{∑_τ e^R_ψ(τ) dτ}∑{τ}(e^R_ψ(τ)\frac{dR_ψ(τ)}{dψ} )$$

$$= ∑_{τ ∈ D} \frac{dR_ψ(τ)}{dψ} - M ∑_τ \frac{e^R_ψ(τ)}{∑_τe^R_ψ(τ)}\frac{dR_ψ(τ)}{dψ} )$$ $$= ∑_{τ ∈ D} \frac{dR_ψ(τ)}{dψ} - M ∑_τ p(τ \mid ψ)\frac{dR_ψ(τ)}{dψ} )$$ optimizing for $p(τ \mid ψ)$ is same as $p(s | ψ)$ otherwise known as state-visitation frequency so, finally $$L(ψ)= \frac{1}{\mid D \mid}∑_{τ_d ∈ D} \frac{dr_ψ(τ_d)}{dψ} - ∑_s p(s \mid ψ)\frac{dr_ψ(s)}{dψ} )$$

How do we calculate the state-visitation frequency in an efficient way?
TODO
So the final algorithm looks like:

Maximum Causal Entropy IRL

ipython

Deep MEIRL

Guided Cost Learning (GCL/ GAN-GCL)

GAIL

AIRL

Empowerment Based Adversarial Inverse Reinforcement Learning

claims to learn nearly optimal rewards, along with policy
GAIL recovers policy only, not transferrable rewards
Reward Learning is difficult because
- many optimal policies explain the same demonstration
- many reward functions induces an optimal policy
Empowerment is a mutual information based potential function, like value fucntions, which intuitively quantifies for a state the extent to which an agent can influence its environment.
Empowerment acts as a regularizer in policy update
Empowerment is a *maximal of mutual information between a sequence of $K$ actions $a$ and the final state $s’$ reached after execution of those actions, conditioned on current state $s$.
$$ Φ(s) = max I(a, s’|s) = max \mathbb{E}_{p(s’|a,s)w(a|s)}[log (\frac{p(a,s’|s)}{w(a|s)p(s’|s)})]$$
after some mathematical gymnastics, they approximate Empowerment as something and finally optimize it using the loss function
$$ l_I(s,a,s’) = | β log q_φ (a| s’, s) - (log π_θ (a|s) + Φ_\varphi(s))| $$
the algorithm has four models and are trained simultaneously.
- inverse model (maxmium log-likelihood supervised learning problem) that, given a set of trajectories, minimizes the mean-square error between its predicted action $q(a|s’, s)$ and the action $a$ according to the generated trajectory. $$ l_q(s,a,s’) = (q_φ (.|s, s’) - a)^2 $$
- empowerment optimization as noted before
- reward function
  first compute the Discriminator as $$D_{ζ, \varphi} (s,a,s’) = \frac{exp[r_ζ(s,a) + γ Φ_\varphi’(s’) - Φ_\varphi(s)]}{exp[r_ζ(s,a) + γ Φ_\varphi’(s’) - Φ_\varphi(s)] + π_θ(a|s)} Finally train the parameters $/zeta$ by binary logistic regression to discriminate between expert and generated trajectories via
  
  $$ \mathbb{E}_τ [log D_{ζ, \varphi} (s,a,s’)] + \mathbb{E}_{τ_E} [(1 - log D_{ζ, \varphi} (s,a,s’)) ]$$
- policy optimization
  train the policy $π_θ(a|s)$ to mazimize the discriminative reward $\hat{r}(s,a,s’) = log D(s,a,s’) - log (1 - D(s,a,s’))$ and to minimize the loss function $l_I(s,a,s’) = \mid β log q_φ(a|s, s’) - (log π_θ (a|s) + Φ_\varphi(s)) \mid $ which accounts for empowerment regularization overall training obejective becomes $$ \mathbb{E}_π [log π_θ(a|s) \hat{r} (s, a,s’)] + λ_I \mathbb{E}_τ[l_I(s,a,s’)]$$
So, the final algorithm looks something like

-

Bayesian

BIRL

MAP BIRL

Hierarchical BIRL

Thoughts and Ideas

so all algorithms differ in two ways -

how to measure the difference between current estimation and expert demonstrations
how to update the reward function estimate

if that is the case, we can probably frame irl completely as a supervised learning problem. maybe we cannot, as giving the reward function in the training set is giving away the answer.

is it possible to have an end-to-end approach to IRL? we input a bunch of environments and expert trajectories and get reward function as outputs? would current deep learning techniques be able to tackle this? my hunch is that this will not generalize well. but how do I prove that?

Marry Bayesian Uncertainty with SOTA EAIRL

Proving Non-linear Version of MEIRL

setup

state space $\mathcal{S} ∈ \mathbb{R}^n$
action space $\mathcal{A} ∈ \mathbb{R}^m$
a feature function $f: \mathcal{S} → \mathbb{R}^k : s \mapsto f(s)$
unknown reward function $\mathcal{S} → \mathbb{R}$

assumptions

reward function is some non-linear function of features $\matchcal{R}(s) = φ(f(s))$

have

trajectories $$ζ_1, ζ_2, …, ζ_l$$ $$ζ_i = (s_1, a_1), (s_2, a_2), …$$

want to arrive at a formulation that trajs with more rewards are exponentially more likely to be taken by the expert

maximum entropy principle choose the probablity distribution that has maxmimum information entropy amongst all distributions that satisfy the data

derivation

expert feature count from a trajectory $$F(ζ) = ∑_{s ∈ ζ} f(s) \hspace{2cm} ∈ \matchbb{R}^k$$

expert feature count from all trajectories $$F^* = \frac{1}{l} ∑_i=1^l F(ζ_i)$$

want a reward function that explains this features counts (i.e. all the trajectories) assuming optimal-rationality of the expert

total reward obtained in a path $$G(ζ) = ∑_{s ∈ ζ} \mathcal{R}(s)$$

matching features

probablity distribution over all possible paths should be something something such that $$∑_{\text{all possible} ζ} P(ζ)F(ζ) = F^*$$

if $F^*_j$ is the j-th component of the vector $F^*$, then via principle of maximum entropy we arrive at

$$P(ζ) = \frac{1}{Z} e∑_{jλ_j F_j(ζ)$$

$$Z = ∑\text{all possible ζ} e∑_{jλ_j F_j(ζ)$$

The Challenges in IRL

accurate inference (underspecified problem)
generalizability
correctness of prior knowledge (if you are using feature functions then they must be accurate and your method of IRL should be less sensitive to the accuracy of the prior knowledge)
solution complexity (grows with state-action space size)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

irl-notes.org

irl-notes.org

Notes and Comments on IRL

Intro/ Definitions

Theorems

First Theorem from Ng & Russell

Algorithms

Early Methods

Linear Programming

linear programming irl from trajectories

Apprenticeship Learning

Maximum Margin Planning

Maximum Entropy Based Methods

Maximum Entropy IRL

Maximum Causal Entropy IRL

Deep MEIRL

Guided Cost Learning (GCL/ GAN-GCL)

GAIL

AIRL

Empowerment Based Adversarial Inverse Reinforcement Learning

Bayesian

BIRL

MAP BIRL

Hierarchical BIRL

Thoughts and Ideas

Marry Bayesian Uncertainty with SOTA EAIRL

Proving Non-linear Version of MEIRL

The Challenges in IRL

Files

irl-notes.org

Latest commit

History

irl-notes.org

File metadata and controls

Notes and Comments on IRL

Intro/ Definitions

Theorems

First Theorem from Ng & Russell

Algorithms

Early Methods

Linear Programming

linear programming irl from trajectories

Apprenticeship Learning

Maximum Margin Planning

Maximum Entropy Based Methods

Maximum Entropy IRL

Maximum Causal Entropy IRL

Deep MEIRL

Guided Cost Learning (GCL/ GAN-GCL)

GAIL

AIRL

Empowerment Based Adversarial Inverse Reinforcement Learning

Bayesian

BIRL

MAP BIRL

Hierarchical BIRL

Thoughts and Ideas

Marry Bayesian Uncertainty with SOTA EAIRL

Proving Non-linear Version of MEIRL

The Challenges in IRL