-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support more parts of end-to-end ML workflow #19
Comments
Thanks @deepyaman put them together. Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.
|
I agree that it could be valuable to handle more where Ibis is well-suited (e.g. some EDA). Your open issue on the Feature engineering is a much bigger topic; I could see Ibis-ML expanding in that direction, to include some auto-FE (a la Featuretools), but it's not clear whether that's a priority. It's also a bit separate from the initial focus. |
For consideration from @jcrist just now: Consider something like @deepyaman: The |
IbisML 0.1.0 is released and covers most of this. |
Objectives
TL;DR
Start at the "Alternatives considered" section.
Constraints
Mapping the landscape
Data processing for ML is a broad area. We need a strategy to differentiate our value and narrow it down to what we can provide immediate value.
Breaking down an end-to-end ML pipeline
Stephen Oladele’s neptune.ai blog article provides a high-level depiction of a standard ML pipeline.
Source: https://neptune.ai/blog/building-end-to-end-ml-pipeline
The article also describes each step of the pipeline. Based on the previously-established constraints, we will limit ourselves to the data preparation and model training components.
The data preparation (data preprocessing and feature engineering) and model training parts can be further subdivided into a number of processes:
Note
The above list of processes is adapted from the linked article. I've updated some of the definitions based on my experience and understanding.
Feature comparison (WIP)
Details
Tecton
Scikit-learn
BigQuery ML
DATA_SPLIT_*
parameters to yourCREATE MODEL
statement to control how train-test splitting is done. You can’t extract the split dataset.HPARAM_*
parameters to yourCREATE MODEL
statement.NVTabular
Dask-ML
Ray
Ibis-ML product hypotheses
Scope
ibis.Table
as training data, we don't need to care for now where it's coming from or how the process upstream was handled IMO."Alternatives considered
End-to-end IMO also means that you should be able to able to go beyond just preprocessing the data. There are a few different approaches here:
.from_sklearn()
or something).Confusing? If I can train some of my steps using Ibis-ML, but for the rest I have to go a different library, it doesn't feel very unified.@jcrist makes a good point that it's not so confusing, because of the separation of transformers and steps.Proposal
I propose to go with option #3 of the alternatives considered. In practice, this will mean:
from_sklearn
(and, in the future, potentially other libraries)This also means that the following will be out of scope (at least, for now):
Deliverables
Guiding principles
Demo workflows
We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.
We need variants for all of:
With less priority:
High-level deliverables
P0 deliverables must be included in the Q1 release. The remainder are prioritized opportunistically/for future development, but priorities may shift (e.g. due to user feedback).
[P0] Support handoff to XGBoost (for training and inference)Update:to_dmatrix
/to_dask_dmatrix
are already implementedtidymodels
from_sklearn
from_sklearn
(i.e. those with predict functions that don't require UDFs)from_sklearn
(e.g. PCA, or some more frequently used step)from_sklearn
(e.g. SGDRegressor)Questions for validation
Changelog
2024-03-19
Based on discussion around the Ibis-ML use cases and vision with stakeholders, some of the priorities have shifted:
from_sklearn
is no longer a priority, from P0 to P3.sklearn.preprocessing
is a higher priority. We break down the relative priority of implementing steps in a separate issue.The text was updated successfully, but these errors were encountered: