-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with class imbalance in blocked cross-validation #262
Comments
👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible. You might also want to take a look at our Contributing Guide and Code of Conduct. |
note as a double check I made the orange bars in the histogram transparent and fixed the bins to be consistent and it is indeed still an issue. |
@AndrewAnnex thanks for posting this 👍 Let me see if I understand your problem. What are you trying to predict exactly? Is it the spatial distribution of 2 or more classes/categories? If so, then #261 and #268 will be of interest to you. We can't do it properly just yet since Verde only does regression type models. But #268 would solve this. The fold imbalance part is another issue. I'm not entirely sure how @jessepisel this seems like it's something you might be interested in (or know how to proceed). |
Also, checkout #254 which adds the |
@leouieda I am trying to predict the elevation of a given surface layer, essentially given an x,y,z position what stratigraphic surface is present at that position. This is broadly similar to producing a 3d spline interpolation of a surface for a single layer, I have multiple layers so I use the GemPy project currently because it is the one of the few available open source geomodeling packages available. looking at #268, it is essentially a model XYZ -> C, where C is the target or prediction to be made, and there are N possible categories in C. Otherwise it seems that the first case in #261 is basically what I am doing. As I understand the blockKFold works, you define some spatially disjoint spatial boxes such that when you split the data for the fold into a test and train set you guarantee no mixing along with some criteria such that the test blocks are spatially distributed in some way so they are not all in one corner or another. For a Stratified Block fold, I would imagine that the blocks would need to balanced so that there is an equal proportion of each class in either the test/train set as a whole or for each spatial block (that seems harder). My idea, although it is just a hunch at the moment, would be to use space filling curves (like a hilbert curve) to provide a 1 dimensional index that could essentially be used to produce another categorical or ordinal column through which the data could be spatially stratified, then a conventional multi-label stratification could be performed using builtin methods in sklearn. Space filling curves can be tuned to create a desired number of uniform "blocks" (to n it is a quad tree like structure...) and there are a few to choose between that have different properties. I think it could potentially work, if one had enough data points, to first perform the block K fold, then for each fold sub sample the test/train data to equalize the counts of each class, but It depends on what block K fold is really doing as it sounds like it tries to equalize the counts of data for each block? There is also the imblearn package that implements a number of undersampling techniques, it also has oversampling like SMOTE implemented, but those methods either rely on some form of interpolation or sampling with replacement that I think is undesirable for my use case. |
Description of the desired feature
I am using gempy to produce geologic models of multiple geologic layers simultaneously. In verde it seems that points are only ever considered part of 1 surface and 1 class to predict, but in gempy I of course have multiple layers. Additionally there needs to be a way to make sure that every class is present in the training dataset, otherwise the model will not be able to predict for that class. That functionality is already present in sklearn stratified k fold, but of course the block portion is not there.
example image of issue:
the red dots are the test data and the blue are the training data in the map view on the left, on the right the test data is orange, there are 22 classes but it is clear that around class 14/15 the full sample of that class is only present in the test dataset
Are you willing to help implement and maintain this feature? Yes/No
yes and no.. I can dig into the code to see how difficult this is but I think I would need a deep
understanding of the paper referenced, and that changes to make this happen would diverge from that implementation sufficiently to require a new function entirely. I have my own ideas for how to make this work also that I could try out and contribute back but they won't be peer reviewed
The text was updated successfully, but these errors were encountered: