Add transformer defined by R-style formula #406

ablaom · 2021-10-14T01:43:53Z

I think it would be useful (especially to R users) to have an MLJ formula-based transformer that can be inserted anywhere in an MLJ pipeline (or other composite model). Here "formula" means "one-side formula"; I don't think two-sided formulas make much sense in the MLJ context because the target and features are treated separately, like in sklearn.

StatsModels.@formula apparatus appears to provide most of what is needed here already - check out the docs. So this is hopefully just wrapping that.

This transformer would probably be a Static model with a one-sided StatsModels formula as parameter. Ideally, and for consistency, it would perform a table-to-table transformation, rather than a table-to-matrix transformation, which is what StatsModels does. This does cause problems for very-high cardinality categorical features (which get one-hot encoded when you apply StatsBase formula??) but does have the advantage that new columns would come with informative names for interpretation downstream of the transformer. Actually, it probably makes sense not to force one-hot encoding anyway, as not all supervised models need this and we already have transformers to do one-hot encoding which generate the new column names.

I recall slack discussions with @kleinschmidt about this (now lost to the ether). Perhaps he would care to chime in.

See also #314 and JuliaAI/MLJGLMInterface.jl#13

The text was updated successfully, but these errors were encountered:

ablaom · 2021-10-14T01:46:37Z

@vollmersj

ablaom · 2021-10-14T01:49:00Z

Maybe some minor work involved in translating between the ScientificTypes.jl convention used in MLJ and the one specific to StatsModels.jl. Happy to provided guidance around this.

kleinschmidt · 2021-10-14T13:45:45Z

Yup, I think the main bits of work are integrating scientific types, resolving the one-sided/two-sided thing, and supporting table-table integration. The last bit might actually be pretty simple, you can do something like

modeltable(t::AbstractTerm, data) = NamedTuple([name => col for (name, col) in zip(coefnames(t), eachcol(modelcols(t, data)))]...)

kleinschmidt · 2021-10-14T13:50:20Z

For ScientificTypes support you'd probably want to override schema/concrete_term methods to generate the appropriate term types (which are probably going to be different than the two that StatsModels defines, which are ContinuousTerm and CategoricalTerm).

ablaom · 2021-10-14T21:04:22Z

@rikhuijzer

vollmersj · 2021-11-22T21:49:35Z

It should be fairly straight forward to implement this using statsmodels.jl

ablaom mentioned this issue Oct 14, 2021

Allow fitting arbitrary @formulas JuliaAI/MLJGLMInterface.jl#13

Closed

ablaom mentioned this issue Nov 22, 2021

transformations like in R with formulas y ~ a + a * b + b^3. JuliaAI/MLJ.jl#867

Closed

OkonSamuel mentioned this issue Feb 16, 2022

add more parameters to control fitting, and add data checks JuliaAI/MLJGLMInterface.jl#24

Merged

ablaom mentioned this issue Oct 6, 2022

Syntax for feature engineering #314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transformer defined by R-style formula #406

Add transformer defined by R-style formula #406

ablaom commented Oct 14, 2021

ablaom commented Oct 14, 2021

ablaom commented Oct 14, 2021 •

edited

Loading

kleinschmidt commented Oct 14, 2021

kleinschmidt commented Oct 14, 2021

ablaom commented Oct 14, 2021

vollmersj commented Nov 22, 2021

Add transformer defined by R-style formula #406

Add transformer defined by R-style formula #406

Comments

ablaom commented Oct 14, 2021

ablaom commented Oct 14, 2021

ablaom commented Oct 14, 2021 • edited Loading

kleinschmidt commented Oct 14, 2021

kleinschmidt commented Oct 14, 2021

ablaom commented Oct 14, 2021

vollmersj commented Nov 22, 2021

ablaom commented Oct 14, 2021 •

edited

Loading