You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we have a discrete space, discrete action, generative MDP.
And states space and actions space are hard to enumerate. But we still want to use the traditional tabular RL algorithm to solve it.
So, I implement a DictPolicy, it used to store state-action pair values. (Sure. Users need to add Base.isequal() and Base.hash() for their state and action type.)
DictPolicy.jl :
struct DictPolicy{P<:Union{POMDP,MDP}, T<:AbstractDict{Tuple,Float64}} <:Policy
mdp::P
value_dict::Tend# Returns the action that the policy deems best for the current statefunctionaction(p::DictPolicy, s)
available_actions =actions(mdp,s)
max_action =nothing
max_action_value =0for a in available_actions
ifhaskey(p.value_dict,(s,a))
action_value = p.value_dict[(s,a)]
if action_value > max_action_value
max_action = a
max_action_value = action_value
endelse
p.value_dict[(s,a)] =0endendif max_action ===nothing
max_action = available_actions[1]
endreturn max_action
end# returns the values of each action at state s in a dictfunctionactionvalues(p::DictPolicy, s) ::Dict
available_actions =actions(mdp,s)
action_dict =Dict()
for a in available_actions
haskey(p.value_dict,(s,a)) ? action_dict[a] = value_dict[(s,a)] : action_dict[a] =0endreturn action_dict
endfunction Base.show(io::IO, mime::MIME"text/plain", p::DictPolicy{M}) where M <:MDPsummary(io, p)
println(io, ':')
ds =get(io, :displaysize, displaysize(io))
ioc =IOContext(io, :displaysize=>(first(ds)-1, last(ds)))
showpolicy(io, mime, p.mdp, p)
end
Then we have a special Q-learning based on key-value storage, we don't need to enumerate states space and actions space in MDP definition. (okay, most code copy from TabularTDLearning.jl, but change Q-value store and read.
dict_q_learning.jl :
@with_kwmutable struct QLearningSolver{E<:ExplorationPolicy} <:Solver
n_episodes::Int64=100
max_episode_length::Int64=100
learning_rate::Float64=0.001
exploration_policy::E
Q_vals::Union{Nothing, Dict{Tuple,Float64}}=nothing
eval_every::Int64=10
n_eval_traj::Int64=20
rng::AbstractRNG= Random.GLOBAL_RNG
verbose::Bool=trueendfunctionsolve(solver::QLearningSolver, mdp::MDP)
rng = solver.rng
if solver.Q_vals ===nothing
Q =Dict{Tuple,Float64}()
else
Q = solver.Q_vals
end
exploration_policy = solver.exploration_policy
sim =RolloutSimulator(rng=rng, max_steps=solver.max_episode_length)
on_policy =DictPolicy(mdp, Q)
k =0for i =1:solver.n_episodes
s =rand(rng, initialstate(mdp))
t =0while!isterminal(mdp, s) && t < solver.max_episode_length
a =action(exploration_policy, on_policy, k, s)
k +=1
sp, r =@gen(:sp, :r)(mdp, s, a, rng)
max_sp_prediction =0for k inkeys(Q)
if sp == k[1] && max_sp_prediction < Q[k]
max_sp_prediction = Q[k]
endend
current_s_prediction =0haskey(Q,(s,a)) ? (current_s_prediction = Q[(s,a)]) : (Q[(s,a)] =0)
Q[(s,a)] += solver.learning_rate * (r +discount(mdp) * max_sp_prediction - current_s_prediction)
s = sp
t +=1endif i % solver.eval_every ==0
r_tot =0.0for traj in1:solver.n_eval_traj
r_tot +=simulate(sim, mdp, on_policy, rand(rng, initialstate(mdp)))
end
solver.verbose ?println("On Iteration $i, Returns: $(r_tot/solver.n_eval_traj)") :nothingendendreturn on_policy
end
What's your point of view? Do you have any advice?
Thank you for taking the time to read my issue.
If you think it's meaningful, I can opne a PR and add some test.
It's okay if you think it's meaningless and no versatility. I just finish it for solve my MDP.
The text was updated successfully, but these errors were encountered:
@NeroBlackstone sorry that we never responded to this! This is actually something that people often want to do. If you're still interested in contributing it, I think we can integrate it in with a few small adjustments. Let me know if you're interested in doing that.
If we have a discrete space, discrete action, generative MDP.
And states space and actions space are hard to enumerate. But we still want to use the traditional tabular RL algorithm to solve it.
So, I implement a
DictPolicy
, it used to store state-action pair values. (Sure. Users need to addBase.isequal()
andBase.hash()
for their state and action type.)DictPolicy.jl :
Then we have a special Q-learning based on key-value storage, we don't need to enumerate states space and actions space in MDP definition. (okay, most code copy from TabularTDLearning.jl, but change Q-value store and read.
dict_q_learning.jl :
What's your point of view? Do you have any advice?
Thank you for taking the time to read my issue.
If you think it's meaningful, I can opne a PR and add some test.
It's okay if you think it's meaningless and no versatility. I just finish it for solve my MDP.
The text was updated successfully, but these errors were encountered: