Skip to content

Latest commit

 

History

History
237 lines (157 loc) · 11.7 KB

irl-notes.org

File metadata and controls

237 lines (157 loc) · 11.7 KB

Notes and Comments on IRL

Intro/ Definitions

the goal of inverse reinforcement learning is to infer the reward function from expert demonstrations or policy. the idea is to show the agent a bunch of demonstrations of how an expert in the real world solves a certain problem, so that it learns what problem the expert is trying to solve. for example, if you show one million example of a master chess player playing games, you expect your agent to learn that ‘checkmate is very important’, or that checkmate has high reward.

a generic irl algorithm looks like this:

  • initialize the reward function estimate
  • unless converged
    • solve mdp using that estimated reward function
    • update the estimate using some method (this varies from algorithm to algorithm)

for the update step various algorithms over the years have proposed various solutions, ranging from basic linear programming to generative adversarial networks.

various papers have used multiple versions for the problem, which we will list out here.

the basis set up for the irl problem comes from reinforcement learing itself.

in the most general setting, we consider.

a markov decision process is a tuple $(\mathcal{s}, \mathcal{a}, \mathcal{r}, \mathcal{p}, γ)$.

$\mathcal{s}$ is the discrete or continuous state-space.

$\mathcal{a}$ is the discrete or continuous action-space, available for each state.

$\mathcal{r} : \mathcal{s} × \mathcal{a} × \mathcal{s} → [0, ∞]$ is the reward function. $\mathcal{r}(s, a, s’)$ determines the numerical reward obtained when taking a particular action $a$ in a particular state $s$ which leads the agent to the next state $s’$.

$\mathcal{p}: \mathcal{s} × \mathcal{a} × \mathcal{s} → [0,1]$ is the transition probability. if the state and actions are discrete, this can be represented as a tensor. $\mathcal{p}(s, a, s’)$ determines the probability of reaching state $s’$, when taken action $a$ in state $s$.

$γ ∈ [0, 1]$ is the discount factor. it determines the value of future rewards in current timestep.

the total discounted reward is $g_t = ∑k=0t γ^k rt+k$.

a policy is a function $π : \mathcal{s} × \mathcal{a} → [0, 1]$ which determines the agent’s probability of taking action $a$ in state $s$.

the goal of reinforcement learning is to find the optimal policy (so that over an episode $g_t$ is maximized).

the goal of inverse reinforcement learning, on the other hand, is to find the reward function given the optimal policy (there is some work on how to tackle sub-optimal policies, but we will come to that later).

Theorems

First Theorem from Ng & Russell

Algorithms

Early Methods

Two methods of Feature Matching

  • start with some estimate of reward parameters
  • Calculate feature expectations
  • Calculate Inverse Learning Error
  • Update the paramettrs
    • using gradient descent
    • using SVM

Linear Programming

linear programming irl from trajectories

http://ai.stanford.edu/~ang/papers/icml00-irl.pdf

first IRL algorithm for trajectories this algorithm does not explicitly require access to the expert policy and can work with trajectories only. the way I understand this algorithm is something like:

  • start with a random reward function and random policy
  • write the reward function as dot product of paratmeters, and formulate the value function of a trajectory in terms of these parameters. our goal is to estimate these parameters
  • for some criteria:
    • generate trajectories based on current policy
    • get the coeffs of the value of the trajectory based current reward function estimate
    • solve the following linear programming problem to the new reward parameters maximize $$ ∑i=1^k p(vπ*(s_0) - vπ_i(s_0)) $$ such that $$ |α_i| \leq 1$$ and the $p$ is the function $p(x) = x$ if $ x \geq 0$ otherwise $p(x) = λ x$ $λ$ is a hyperparameter, russell and ng chose it to be 2.
    • get a new policy that is optimal wrt this new reward functions and add that to the list of policies
  • return the parameters

Apprenticeship Learning

Maximum Margin Planning

Maximum Entropy Based Methods

Maximum Entropy IRL

  • the assumption here is that experts take actions that are exponentially more likely to generate higher rewards
  • for a parameterized reward function $R_ψ(τ) = ∑t r_ψ (s_t, a_t)$

    $$p(τ) = \frac{1}{Z}eR_ψ(τ)$$ with $Z = ∑τ eR_ψ(τ) dτ$

  • so we want to parametrize the reward function in the way that maximizes the log-likelihood

    $$L(ψ) = ∑τ ∈ Dlog pr_ψ (ψ)$$ $$ = ∑τ ∈ D (log \frac{1}{Z} + R_ψ (τ))$$ $$ = ∑τ ∈ D (R_ψ (τ)) - M log Z$$ $$ = ∑τ ∈ D (R_ψ (τ)) - M log ∑τeR_ψ (τ)$$

  • so the gradient wrt the paratmeters is

    $$∇_psi L(ψ) = ∑τ ∈ D \frac{dR_ψ(τ)}{dψ} - M \frac{1}{∑τ eR_ψ(τ) dτ}∑{τ}(eR_ψ(τ)\frac{dR_ψ(τ)}{dψ} )$$

    $$= ∑τ ∈ D \frac{dR_ψ(τ)}{dψ} - M ∑τ \frac{eR_ψ(τ)}{∑τeR_ψ(τ)}\frac{dR_ψ(τ)}{dψ} )$$ $$= ∑τ ∈ D \frac{dR_ψ(τ)}{dψ} - M ∑τ p(τ \mid ψ)\frac{dR_ψ(τ)}{dψ} )$$ optimizing for $p(τ \mid ψ)$ is same as $p(s | ψ)$ otherwise known as state-visitation frequency so, finally $$L(ψ)= \frac{1}{\mid D \mid}∑τ_d ∈ D \frac{dr_ψ(τ_d)}{dψ} - ∑s p(s \mid ψ)\frac{dr_ψ(s)}{dψ} )$$

  • How do we calculate the state-visitation frequency in an efficient way?

    TODO

  • So the final algorithm looks like: ~/Downloads/MEIRL.png

Maximum Causal Entropy IRL

ipython

Deep MEIRL

Guided Cost Learning (GCL/ GAN-GCL)

GAIL

AIRL

Empowerment Based Adversarial Inverse Reinforcement Learning

  • claims to learn nearly optimal rewards, along with policy
  • GAIL recovers policy only, not transferrable rewards
  • Reward Learning is difficult because
    • many optimal policies explain the same demonstration
    • many reward functions induces an optimal policy
  • Empowerment is a mutual information based potential function, like value fucntions, which intuitively quantifies for a state the extent to which an agent can influence its environment.
  • Empowerment acts as a regularizer in policy update
  • Empowerment is a *maximal of mutual information between a sequence of $K$ actions $a$ and the final state $s’$ reached after execution of those actions, conditioned on current state $s$.

    $$ Φ(s) = max I(a, s’|s) = max \mathbb{E}p(s’|a,s)w(a|s)[log (\frac{p(a,s’|s)}{w(a|s)p(s’|s)})]$$

  • after some mathematical gymnastics, they approximate Empowerment as something and finally optimize it using the loss function

    $$ l_I(s,a,s’) = | β log q_φ (a| s’, s) - (log π_θ (a|s) + Φ_\varphi(s))| $$

  • the algorithm has four models and are trained simultaneously.
    • inverse model (maxmium log-likelihood supervised learning problem) that, given a set of trajectories, minimizes the mean-square error between its predicted action $q(a|s’, s)$ and the action $a$ according to the generated trajectory. $$ l_q(s,a,s’) = (q_φ (.|s, s’) - a)^2 $$
    • empowerment optimization as noted before
    • reward function

      first compute the Discriminator as $$Dζ, \varphi (s,a,s’) = \frac{exp[r_ζ(s,a) + γ Φ\varphi’(s’) - Φ\varphi(s)]}{exp[r_ζ(s,a) + γ Φ\varphi’(s’) - Φ\varphi(s)] + π_θ(a|s)} Finally train the parameters $/zeta$ by binary logistic regression to discriminate between expert and generated trajectories via

      $$ \mathbb{E}_τ [log Dζ, \varphi (s,a,s’)] + \mathbb{E}τ_E [(1 - log Dζ, \varphi (s,a,s’)) ]$$

    • policy optimization

      train the policy $πθ(a|s)$ to mazimize the discriminative reward $\hat{r}(s,a,s’) = log D(s,a,s’) - log (1 - D(s,a,s’))$ and to minimize the loss function $l_I(s,a,s’) = \mid β log q_φ(a|s, s’) - (log π_θ (a|s) + Φ_\varphi(s)) \mid $ which accounts for empowerment regularization overall training obejective becomes $$ \mathbb{E}_π [log π_θ(a|s) \hat{r} (s, a,s’)] + λ_I \mathbb{E}_τ[l_I(s,a,s’)]$$

  • So, the final algorithm looks something like

-

Bayesian

BIRL

MAP BIRL

Hierarchical BIRL

Thoughts and Ideas

so all algorithms differ in two ways -

  • how to measure the difference between current estimation and expert demonstrations
  • how to update the reward function estimate

if that is the case, we can probably frame irl completely as a supervised learning problem. maybe we cannot, as giving the reward function in the training set is giving away the answer.

is it possible to have an end-to-end approach to IRL? we input a bunch of environments and expert trajectories and get reward function as outputs? would current deep learning techniques be able to tackle this? my hunch is that this will not generalize well. but how do I prove that?

Marry Bayesian Uncertainty with SOTA EAIRL

Proving Non-linear Version of MEIRL

setup

  • state space $\mathcal{S} ∈ \mathbb{R}^n$
  • action space $\mathcal{A} ∈ \mathbb{R}^m$
  • a feature function $f: \mathcal{S} → \mathbb{R}^k : s \mapsto f(s)$
  • unknown reward function $\mathcal{S} → \mathbb{R}$

assumptions

  • reward function is some non-linear function of features $\matchcal{R}(s) = φ(f(s))$

have

  • trajectories $$ζ_1, ζ_2, …, ζ_l$$ $$ζ_i = (s_1, a_1), (s_2, a_2), …$$

want to arrive at a formulation that trajs with more rewards are exponentially more likely to be taken by the expert

maximum entropy principle choose the probablity distribution that has maxmimum information entropy amongst all distributions that satisfy the data

derivation

expert feature count from a trajectory $$F(ζ) = ∑s ∈ ζ f(s) \hspace{2cm} ∈ \matchbb{R}^k$$

expert feature count from all trajectories $$F^* = \frac{1}{l} ∑i=1^l F(ζ_i)$$

want a reward function that explains this features counts (i.e. all the trajectories) assuming optimal-rationality of the expert

total reward obtained in a path $$G(ζ) = ∑s ∈ ζ \mathcal{R}(s)$$

matching features

probablity distribution over all possible paths should be something something such that $$∑\text{all possible ζ} P(ζ)F(ζ) = F^*$$

if $F^*_j$ is the j-th component of the vector $F^*$, then via principle of maximum entropy we arrive at

$$P(ζ) = \frac{1}{Z} e∑_{jλ_j F_j(ζ)$$

$$Z = ∑\text{all possible ζ} e∑_{jλ_j F_j(ζ)$$

The Challenges in IRL

  • accurate inference (underspecified problem)
  • generalizability
  • correctness of prior knowledge (if you are using feature functions then they must be accurate and your method of IRL should be less sensitive to the accuracy of the prior knowledge)
  • solution complexity (grows with state-action space size)