-
Notifications
You must be signed in to change notification settings - Fork 4
What is a Transformation? #2
Comments
To give a short reply I can list a few requirements that I would need one way or the other
Some discussion topics:
|
Additionally, what would be really powerful is if we didn't need to subtype from Transformation at all, as long as the query functions were implemented for the type, and anything implied by the traits was implemented. What if we re-think the type tree to be the type tree of traits? Then Transformation is the abstract type of the trait, not the actual transformation object. Let me think out loud:
|
If the type is parameterized, then
I'm on board with this. I regret the decision in OnlineStats.
Yes... Tim Holy's |
This isn't universally better. It's great for SGD, but bad for coordinate descent. Enforcing one or the other may be enough to keep people from buying in to JuliaML. I would prefer the default match notation (observations in rows), but I could be convinced otherwise. |
I think we need at least three (maybe four) verbs here:
Also, I have a nitpick about this example:
Should instead be:
(unless mu is the intercept in the model). The more important point I want to make is that it is actually better to think about the above model like this:
The difference is subtle, but important, for GLMs. Basically you think of the total transformation as a linear predictor
From a practical standpoint, if you want a generative model, you need to specify |
+1, though I could also be convinced otherwise. |
One more thought. If we are really going the Bayes Net route, we should talk with @sisl and maybe sync up with their BayesNets.jl package. cc: @tawheeler |
I like this, if we can figure out the right way to do it. It should probably be |
For SVMs to be fast i need an observation to be adjacent in memory |
Yeah I guess we should have our convention be |
I pushed up some of my experiments... I think they're pretty promising. Check out the main Transformations.jl in the tom branch. The gist is that there's a Here's an example from runtests which wraps an # "link" this object into the transformations world
Transformations.input_size(m::Means) = size(m.value)
Transformations.output_size(m::Means) = size(m.value)
Transformations.is_learnable(m::Means) = true
Transformations.transform!(y, m::Means, x) = (y[:] = x - m.value)
Transformations.learn!(m, x) = OnlineStats.fit!(m, x)
# instantiate an OnlineStats.Means, which computes an online mean of a vector
m = Means(5)
# wrap the object in a Transformation, which stores input/output dimensions in its type,
# and allows for the common functionality to apply to the Transformation object.
t = transformation(m)
# update/learn the parameters for this transformation
# (at this point, we don't care what the transformation is!)
learn!(t, 1:5)
# center the data by applying the transform. this dispatches generically.
# we only need to define a single `transform!` method
y = transform(t, 10ones(5))
# make sure everything worked
@test y == 10ones(5) - (1:5) |
Hey! Happy to discuss BayesNets. |
Hi Tim. Yes computational graph is part of it. I do care about backprop, On Friday, July 1, 2016, Tim Wheeler [email protected] wrote:
|
One thing I'm curious about is whether we can use the same set of abstractions to specify a computation graph as well as a generative model (more or less as a Bayes net). At the moment, I am interested in the latter for my day job, but I might be in the minority. This has made me think it isn't so crazy to support both: http://edwardlib.org/ P.S. @tawheeler - Are you in the Bay Area for the summer? I'm at Stanford as well, but am in Livermore until October. Would love to meet and chat about BayesNets.jl in person. |
This sounds like a very neat concept. I would think that @ahwillia I am, but will be heading to Germany from Oct through the end of the year. Happy to connect you with SISL though, or meet thereafter. |
A little package name nitpick that may come up along the way. I discovered that there is an implicit convention to call packages Not sure if there is a subtle difference between a transform and a transformation, but I thought I'd bring it up |
Should we stick with Transformations then? maybe it would be useful to have a term just for the ML side of things. I shall refactor the stuff I am working on to |
I voted for "Transform" over "Transformation" a while back because it is On Aug 6, 2016 10:30 AM, "Christof Stocker" [email protected]
|
I am ok with |
For reference, copying from gitter:
Since then I also implemented many |
One piece I'm a little unsure about is how best to add penalties to the mix. Are the penalties a component in each learnable transformation? Or maybe they are passed into every transform!(chain::Chain) = (foreach(transform!, chain.ts); chain.output.val)
grad!(chain::Chain) = foreach(grad!, reverse(chain.ts)) Penalties will mess that up :( |
I'm going to move on to ObjectiveFunctions next week, and try to piece some of this stuff together. I'm thinking that package will be no more than conveniences to connect Transformations, Losses, and Penalties together for use in StochasticOptimization. At some point we can revisit the package organization... lets get some stuff working in the wild first. |
Thanks for taking a stab at this - I really like what you've done 💯 Adding the contribution of the penalty to the gradient should be taken care of by a different package right? I don't see why it would be too difficult. using Transformations
using Penalties
using Losses
type PenalizedDeepNet{P<:Penalty}
layers::Chain
penalties::Vector{P} # penalty at each layer
loss::SupervisedLoss
end
function grad!(net::PenalizedDeepNet, target)
# gradient of output weights w.r.t. loss
grad!(net.loss, target, net.layers[end])
# backprop
grad!(net.layers)
# the penalty terms are just added to the overall objective,
# so their contribution can now be added
for (layer,penalty) in zip(net.layers, net.penalties)
addgrad!(layer, penalty)
end
end I just did that quickly... but is that the gist? |
It looks like there is quite a lot going on in the code. we should benchmark the affine prediction against a minimal handcrafted version that just does the matrix multiplication plus the bias. I like the generality, but we must tread carefully here. I will give this a more detailed review once 0.5 is released and I have a bit more time on my hands. Thanks for working on this Tom. |
Please do benchmark. In theory it should be very close in computation to a I started ObjectiveFunctions last night with something similar to your code On Thursday, September 1, 2016, Christof Stocker [email protected]
|
Thinking out loud... if a Transformation has a type RegularizedObjective{T<:Transformation, L<:LossTransform, P<:Penalty}
transformation::T
loss::L
penalty::P
end
# assumes we've already updated the Node objects through a forward pass
function grad!(objfunc::RegularizedObjective)
grad!(objfunc.loss)
grad!(objfunc.transformation)
addgrad!(grad(objfunc.transformation), objfunc.penalty, params(objfunc.transformation))
end
# and then the penalty would be something like:
function addgrad!(∇θ::AbstractVector, penalty::Penalty, θ::AbstractVector)
for i in eachindex(θ)
∇θ[i] += deriv(penalty, θ[i])
end
end So Losses and Transformations would know nothing about Penalties, and vice versa. ObjectiveFunctions would bind them together in a single |
Not sure I completely follow, but I think the pieces are there. Is |
Sort of. I forgot to put it # Copy input values into the input node, then transform
function transform!(t::Transformation, input::AbstractVector)
copy!(input_value(t), input)
transform!(t)
end
# Copy the gradient into the output node, and propagate it back.
function grad!(t::Transformation, ∇out::AbstractVector)
copy!(output_grad(t), ∇out)
grad!(t)
end So the "no data args" versions assume that the forward pass already has the input value loaded and populates the output value, and the backward pass assumes the output gradient is loaded and populates the input gradient (as well as the gradients of any learnable parameters). Part of the Transformations API that I'm playing with is that a Transformation holds forward and backward state vectors. This simplifies creating and passing temporary vectors around and lets you operate on a transformation knowing that there are value and gradient vectors pre-computed before implementing the actual value/gradient logic for each transformation type. This all might seem like a heavy-handed way to do some matrix multiplies and adds, and apply some functions, but in fact the inner loops of learning algos should be pretty lightweight. Storage can be shared between connected inputs/outputs among transformations, and the complexity is greatly reduced. (I hope!) Right now I'm working on the ObjectiveFunctions design; how to nicely make the forward and backward passes, and when to pass in the |
How do tree based learners fit into this? What about classic statistical tests (wald etc). |
Well, they may not be differentiable, so the
I'm not prepared to give a complete answer, but note that you should always be able to get the params/gradients of all learnable parameters in a transformation, possibly as a CatView of underlying parameter arrays. This lets you treat a transformation as a black-box function |
Sounds all very nice so far (judging from a distance). Does your design allow for special handling of sparse arrays? (in the sense that down the road could we dispatch on that and execute more efficient code for those cases). Not sure how sparse arrays have evolved recently, but I remember that in KSVM there was a huge speed up when I handled them manually |
I would say that sparse arrays can certainly be included, though they can't On Tuesday, September 6, 2016, Christof Stocker [email protected]
|
Makes sense to me! |
I was just reading back through this issue, and I wanted to comment on how we might integrate with a "BayesNet approach". cc: @tawheeler I think we could have a To expand a little: type NormalTransformation{T} <: BayesTransform
mu::Node{:input,T}
sigma::Node{:input,T}
output::Node{:output,T}
...
end
function transform!(t::NormalTransformation)
dist = Normal(value(t.mu), value(t.sigma))
rand!(dist, value(t.output))
end Once we have an "arbitrary graph" type, we could presumably track the names/labels of each transformation in a similar manner to how you do it in BayesNet. |
FYI we already have this is LearnBase: abstract StochasticTransformation <: Transformation |
I like it, but is If all the input/outputs are Gaussian transformations then you should be able to analytically calculate the full distribution as output rather than just sample from it. Could be nice to support this as well at some point. |
Could be either I suppose. This could be parameterized.
More generally, it would be good to be able to compute priors and posteriors for a given transformation... either learning from data/sampling or through analytic means when possible. I think it would be easy to do this using Monte Carlo techniques without too much headache, but analytic distributions could be tricky. (I hope others step in to help with this!) |
This isn't my area of expertise, but I actually think the opposite! Gaussians are the only analytically tractable case and there is a cookbook we can follow for them. On the other hand, making effective Monte Carlo samplers seems to be quite involved. There are also variational techniques... But lets start with just implementing forward sampling approach you've outlined. That will still be a useful first step. |
FWIW I would really like to see more conjugate priors in JuliaStats. That is something I want to add to all learning in BayesNets.jl. The "BayesNets" approach is hopefully straightforward. Each node is a CPD which supplies functions to obtain its name, parents, evaluation, learning, etc. It sounds like Transformations is a more general framework based on the same principles. |
I was about to start writing code, but I think it's worth having a discussion first. If we're going to realize my eventual vision of autonomous creation of valid "directed graphs of transformations", then we need to approach the problem systematically. Every transformation should have the same type signature, otherwise we're going to have too many special cases for the framework to be useful.
A Transformation should:
nothing
is the input. does it always get a scalar? or a vector?)Should this info be part of the type signature as parameters? It might have to be, though I'm going to attempt to solve this through "query functions" that produce traits to be dispatched on. If I'm successful, we don't need the wrappers for Distributions at all... we just need the generic:
For generative/stochastic transformations (generating distributions):
I wonder if we should think of the input as "randomness". i.e. we take "randomness" as input, and generate
rand(t, output_dims)
. I don't know the best way to represent this... maybe justimmutable Randomness end
The
transform
should, I think, take in nothing, and output the center. Think about if you have a generative process:where T is a Transformation. What does it mean to
transform
x into y? I think it probably means to give our best estimate of y given x, so:transform(T, x) == w * x + mu
. If that's the case, thentransform(N) == mu
.What does it mean to
generate
an output from T? We're walking into the land of Bayesians here... I started to answer this but need more time to think it through.What does it mean to
learn
the parametersw
? I think:learn(T, x, y)
means to learnw
,mu
, andsigma
in parallel.Do we agree on these definitions? If so then we need some drastic changes to the first PR.
The text was updated successfully, but these errors were encountered: