Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Deep Q-Network #617

Merged
merged 36 commits into from
Aug 10, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
5dc50eb
Copy code from Colab
seungjaeryanlee Jun 29, 2020
9f049af
Merge branch 'master' into dqn
seungjaeryanlee Jun 29, 2020
6268d58
Use .scalarized() to convert TF scalar to Swift
seungjaeryanlee Jun 30, 2020
51d1fad
Improve code clarity
seungjaeryanlee Jun 30, 2020
84df320
Save isDone as Tensor<Bool>
seungjaeryanlee Jun 30, 2020
2b99489
Save and use isDone for target calculation
seungjaeryanlee Jun 30, 2020
18e6294
Add commented parallelized training implementation
seungjaeryanlee Jun 30, 2020
36c1ddf
Save learning curve plot
seungjaeryanlee Jun 30, 2020
da57062
Use parallelized training with custom gatherNd
seungjaeryanlee Jul 1, 2020
2ec956c
Add minBufferSize parameter
seungjaeryanlee Jul 1, 2020
dab2a3f
Remove comments and refactor code
seungjaeryanlee Jul 1, 2020
0bc60ca
Fix bug where state was updated
seungjaeryanlee Jul 1, 2020
01074d9
Simplify code
seungjaeryanlee Jul 1, 2020
eca8a92
Save TD loss curve
seungjaeryanlee Jul 1, 2020
ae087dd
Purge uses of _Raw operations
seungjaeryanlee Jul 2, 2020
4acd6ce
Use Huber loss instead of MSE
seungjaeryanlee Jul 2, 2020
22aaf75
Simplify Tensor initialization
seungjaeryanlee Jul 2, 2020
24392f3
Set device explicitly on Tensor creation
seungjaeryanlee Jul 2, 2020
441ab35
Merge branch 'master' into dqn
seungjaeryanlee Aug 3, 2020
ccfa087
Add minBufferSize to Agent argument
seungjaeryanlee Aug 3, 2020
65de04e
Use soft target updates
seungjaeryanlee Aug 3, 2020
bcbb7e2
Fix bug where isDone was used wrong
seungjaeryanlee Aug 3, 2020
a203226
Fix bug where target net is initialized with soft update
seungjaeryanlee Aug 3, 2020
e757c0f
Follow hyperparameters in swift-rl
seungjaeryanlee Aug 3, 2020
d2be5bd
Run evaluation episode for every training episode
seungjaeryanlee Aug 3, 2020
6a118ab
Implement combined experience replay
seungjaeryanlee Aug 4, 2020
ce539e5
Implement double DQN
seungjaeryanlee Aug 4, 2020
cf7b96a
Add options to toggle CER and DDQN
seungjaeryanlee Aug 4, 2020
98b4647
Refactor code
seungjaeryanlee Aug 4, 2020
e00901a
Add updateTargetQNet to Agent class
seungjaeryanlee Aug 4, 2020
bca2614
Use TF-Agents hyperparameters
seungjaeryanlee Aug 4, 2020
45b880e
Changed ReplayBuffer to play better with GPU eager mode, restructured…
BradLarson Aug 5, 2020
356c989
Fix ReplayBuffer pass-by-value bug
seungjaeryanlee Aug 6, 2020
d774fad
Use epsilon decay for more consistent performance
seungjaeryanlee Aug 6, 2020
a10f201
Add documentation and improve names
seungjaeryanlee Aug 7, 2020
4aa9296
Document Agent and ReplayBuffer parameters
seungjaeryanlee Aug 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
278 changes: 278 additions & 0 deletions Gym/DQN/main.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
// Copyright 2020 The TensorFlow Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#if canImport(PythonKit)
import PythonKit
#else
import Python
#endif
import TensorFlow

// Force unwrapping with `!` does not provide source location when unwrapping `nil`, so we instead
// make a utility function for debuggability.
fileprivate extension Optional {
func unwrapped(file: StaticString = #filePath, line: UInt = #line) -> Wrapped {
guard let unwrapped = self else {
fatalError("Value is nil", file: (file), line: line)
}
return unwrapped
}
}

// Initialize Python. This comment is a hook for internal use, do not remove.

let np = Python.import("numpy")
let gym = Python.import("gym")

typealias State = Tensor<Float>
typealias Action = Tensor<Int32>
typealias Reward = Tensor<Float>

class ReplayBuffer {
var states: Tensor<Float>
var actions: Tensor<Int32>
var rewards: Tensor<Float>
var nextStates: Tensor<Float>
let capacity: Int
var count: Int
var index: Int
seungjaeryanlee marked this conversation as resolved.
Show resolved Hide resolved

init(capacity: Int) {
self.capacity = capacity

states = Tensor<Float>(numpy: np.zeros([capacity, 4], dtype: np.float32))!
actions = Tensor<Int32>(numpy: np.zeros([capacity, 1], dtype: np.int32))!
rewards = Tensor<Float>(numpy: np.zeros([capacity, 1], dtype: np.float32))!
nextStates = Tensor<Float>(numpy: np.zeros([capacity, 4], dtype: np.float32))!
count = 0
index = 0
}

func append(state: Tensor<Float>, action: Tensor<Int32>, reward: Tensor<Float>, nextState: Tensor<Float>) {
if count < capacity {
count += 1
}
// Erase oldest SARS if the replay buffer is full
states[index] = state
actions[index] = Tensor<Int32>(numpy: np.expand_dims(action.makeNumpyArray(), axis: 0))!
rewards[index] = Tensor<Float>(numpy: np.expand_dims(reward.makeNumpyArray(), axis: 0))!
nextStates[index] = nextState
index = (index + 1) % capacity
}

func sample(batchSize: Int) -> (stateBatch: Tensor<Float>, actionBatch: Tensor<Int32>, rewardBatch: Tensor<Float>, nextStateBatch: Tensor<Float>) {
let randomIndices = Tensor<Int32>(numpy: np.random.randint(count, size: batchSize, dtype: np.int32))!

let stateBatch = _Raw.gather(params: states, indices: randomIndices)
let actionBatch = _Raw.gather(params: actions, indices: randomIndices)
let rewardBatch = _Raw.gather(params: rewards, indices: randomIndices)
let nextStateBatch = _Raw.gather(params: nextStates, indices: randomIndices)

return (stateBatch, actionBatch, rewardBatch, nextStateBatch)
}
}

struct Net: Layer {
typealias Input = Tensor<Float>
typealias Output = Tensor<Float>

var l1, l2: Dense<Float>

init(observationSize: Int, hiddenSize: Int, actionCount: Int) {
l1 = Dense<Float>(inputSize: observationSize, outputSize: hiddenSize, activation: relu, weightInitializer: heNormal())
l2 = Dense<Float>(inputSize: hiddenSize, outputSize: actionCount, weightInitializer: heNormal())
}

@differentiable
func callAsFunction(_ input: Input) -> Output {
return input.sequenced(through: l1, l2)
}
}

class Agent {
// Q-network
var qNet: Net
// Target Q-network
var targetQNet: Net
// Optimizer
let optimizer: Adam<Net>
// Replay Buffer
let replayBuffer: ReplayBuffer
// Discount Factor
let discount: Float

init(qNet: Net, targetQNet: Net, optimizer: Adam<Net>, replayBuffer: ReplayBuffer, discount: Float) {
self.qNet = qNet
self.targetQNet = targetQNet
self.optimizer = optimizer
self.replayBuffer = replayBuffer
self.discount = discount
}

func getAction(state: Tensor<Float>, epsilon: Float) -> Tensor<Int32> {
if Float(np.random.uniform()).unwrapped() < epsilon {
// print("getAction | state: \(state)")
// print("getAction | epsilon: \(epsilon)")
let npState = np.random.randint(0, 2, dtype: np.int32)
// print("getAction | npState: \(npState)")
return Tensor<Int32>(numpy: np.array(npState, dtype: np.int32))!
}
else {
// Neural network input needs to be 2D
let tfState = Tensor<Float>(numpy: np.expand_dims(state.makeNumpyArray(), axis: 0))!
let qValues = qNet(tfState)
let leftQValue = Float(qValues[0][0]).unwrapped()
let rightQValue = Float(qValues[0][1]).unwrapped()
return leftQValue < rightQValue ? Tensor<Int32>(numpy: np.array(1, dtype: np.int32))! : Tensor<Int32>(numpy: np.array(0, dtype: np.int32))!
}
}

func train(batchSize: Int) {
// Don't train if replay buffer is too small
if replayBuffer.count >= batchSize {
// print("train | Start training")
let (tfStateBatch, tfActionBatch, tfRewardBatch, tfNextStateBatch) = replayBuffer.sample(batchSize: batchSize)

// TODO: Find equivalent function of tf.gather_nd in S4TF to parallelize Q-value computation (_Raw.gather_nd does not exist)
// Gradient are accumulated since we calculate every element in the batch individually
dan-zheng marked this conversation as resolved.
Show resolved Hide resolved
var totalGrad = qNet.zeroTangentVector
for i in 0..<batchSize {
let 𝛁qNet = gradient(at: qNet) { qNet -> Tensor<Float> in

let stateQValueBatch = qNet(tfStateBatch)
let tfAction: Tensor<Int32> = tfActionBatch[i][0]
let action = Int(tfAction.makeNumpyArray()).unwrapped()
seungjaeryanlee marked this conversation as resolved.
Show resolved Hide resolved
let prediction: Tensor<Float> = stateQValueBatch[i][action]

let nextStateQValueBatch = self.targetQNet(tfNextStateBatch)
let tfReward: Tensor<Float> = tfRewardBatch[i][0]
let leftQValue = Float(nextStateQValueBatch[i][0].makeNumpyArray()).unwrapped()
let rightQValue = Float(nextStateQValueBatch[i][1].makeNumpyArray()).unwrapped()
seungjaeryanlee marked this conversation as resolved.
Show resolved Hide resolved
let maxNextStateQValue = leftQValue > rightQValue ? leftQValue : rightQValue
let target: Tensor<Float> = tfReward + self.discount * maxNextStateQValue

return squaredDifference(prediction, withoutDerivative(at: target))
}
totalGrad += 𝛁qNet
}
optimizer.update(&qNet, along: totalGrad)
}
}
}

func updateTargetQNet(source: Net, target: inout Net) {
target.l1.weight = Tensor<Float>(source.l1.weight)
target.l1.bias = Tensor<Float>(source.l1.bias)
target.l2.weight = Tensor<Float>(source.l2.weight)
target.l2.bias = Tensor<Float>(source.l2.bias)
}

class TensorFlowEnvironmentWrapper {
let originalEnv: PythonObject
let action_space: PythonObject
let observation_space: PythonObject

init(_ env: PythonObject) {
self.originalEnv = env
self.action_space = env.action_space
self.observation_space = env.observation_space
}

func reset() -> Tensor<Float> {
let state = self.originalEnv.reset()
return Tensor<Float>(numpy: np.array(state, dtype: np.float32))!
}

func step(_ action: Tensor<Int32>) -> (Tensor<Float>, Tensor<Float>, PythonObject, PythonObject) {
seungjaeryanlee marked this conversation as resolved.
Show resolved Hide resolved
let npAction = action.makeNumpyArray().item()
let (state, reward, isDone, info) = originalEnv.step(npAction).tuple4
let tfState = Tensor<Float>(numpy: np.array(state, dtype: np.float32))!
let tfReward = Tensor<Float>(numpy: np.array(reward, dtype: np.float32))!
return (tfState, tfReward, isDone, info)
}
}

// Hyperparameters
let discount: Float = 0.99
let learningRate: Float = 0.01
let hiddenSize: Int = 64
let startEpsilon: Float = 0.5
let maxEpisode: Int = 500
let replayBufferCapacity: Int = 1000
let batchSize: Int = 32
let targetNetUpdateRate: Int = 1

// Initialize environment
let env = TensorFlowEnvironmentWrapper(gym.make("CartPole-v0"))

// Initialize agent
let actionCount = Int(env.action_space.n).unwrapped()
var qNet = Net(observationSize: 4, hiddenSize: hiddenSize, actionCount: actionCount)
var targetQNet = Net(observationSize: 4, hiddenSize: hiddenSize, actionCount: actionCount)
updateTargetQNet(source: qNet, target: &targetQNet)
let optimizer = Adam(for: qNet, learningRate: learningRate)
var replayBuffer: ReplayBuffer = ReplayBuffer(capacity: replayBufferCapacity)
var agent = Agent(qNet: qNet, targetQNet: targetQNet, optimizer: optimizer, replayBuffer: replayBuffer, discount: discount)

// RL Loop
var stepIndex = 0
var episodeIndex = 0
var episodeReturn: Int = 0
var episodeReturns: Array<Int> = []
var state = env.reset()
while episodeIndex < maxEpisode {
stepIndex += 1
// print("Step \(stepIndex)")

// Interact with environment
let action = agent.getAction(state: state, epsilon: startEpsilon * Float(maxEpisode - episodeIndex))
// print("action: \(action)")
var (nextState, reward, isDone, _) = env.step(action)
// print("state: \(state)")
// print("nextState: \(nextState)")
// print("reward: \(reward)")
// print("isDone: \(isDone)")
episodeReturn += Int(reward.makeNumpyArray().item()).unwrapped()
seungjaeryanlee marked this conversation as resolved.
Show resolved Hide resolved
// print("episodeReturn: \(episodeReturn)")

// Save interaction to replay buffer
replayBuffer.append(state: state, action: action, reward: reward, nextState: nextState)
// print("Append successful")

// Train agent
agent.train(batchSize: batchSize)
// print("Train successful")

// Periodically update Target Net
if stepIndex % targetNetUpdateRate == 0 {
updateTargetQNet(source: qNet, target: &targetQNet)
}
// print("Target net update successful")

// End-of-episode
if isDone == true {
state = env.reset()
episodeIndex += 1
print("Episode \(episodeIndex) Return \(episodeReturn)")
if episodeReturn > 199 {
print("Solved in \(episodeIndex) episodes with \(stepIndex) steps!")
break
}
episodeReturns.append(episodeReturn)
episodeReturn = 0
}

// End-of-step
nextState = state
}
1 change: 1 addition & 0 deletions Gym/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,5 @@ To build and run the models, run:
swift run Gym-CartPole
swift run Gym-FrozenLake
swift run Gym-Blackjack
swift run Gym-DQN
dan-zheng marked this conversation as resolved.
Show resolved Hide resolved
```
1 change: 1 addition & 0 deletions Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ let package = Package(
.target(name: "Gym-FrozenLake", path: "Gym/FrozenLake"),
.target(name: "Gym-CartPole", path: "Gym/CartPole"),
.target(name: "Gym-Blackjack", path: "Gym/Blackjack"),
.target(name: "Gym-DQN", path: "Gym/DQN"),
.target(
name: "VGG-Imagewoof",
dependencies: ["Datasets", "ImageClassificationModels", "TrainingLoop"],
Expand Down