-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/leaky splits #203
Feature/leaky splits #203
Conversation
@jacobsela when finalized could you provide a code snippet for how you imagine this feature working? |
s.save() | ||
|
||
@staticmethod | ||
def _image_hash(image, hash_size=24): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think image hashing is a generally useful enough function that it should be pulled out. For example, the compute_exact_duplicates
method uses this hash but could also easily be changed to use a image hash.
@mwoodson1
|
The interface seems a bit messy to me. I was hoping for something like dataset = foz.load_zoo_dataset(...)
leaks = fob.compute_data_leaks(
dataset,
method, # use hash or embedding soft similarity
brain_key, # which similarity index / embeddings to use,
model, # which model to use to compute embeddings
...
) This would follow similar patterns to |
@mwoodson1 Thanks for the feedback, I agree that this isn't ideal. I'm holding off on creating the final |
split_tags=None, | ||
threshold=0.2, | ||
similarity_brain_key=None, | ||
embeddings_field=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this code results in all sample embeddings being computed:
view1 = dataset.limit(500)
view2 = dataset.skip(500).limit(500)
fob.compute_leaky_splits(dataset, split_views={"train":view1, "test":view2}, brain_key="leak_key2", embeddings_field="clip_embeddings")
When in practice only 1000 need embeddings. I think this is fine, just calling it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
# Don't allow overwriting an existing run with same key, since we | ||
# need the existing run in order to perform workflows like | ||
# automatically cleaning up the backend's index | ||
brain_method.register_run(samples, brain_key, overwrite=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check to make sure this doesn't break anything, but by adding cleanup=False
in register_run()
and commenting out the cleanup()
method in LeakySplits
below, I was able to get rid of the error with Limit
not being registered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for registration and shouldn't break anything as far as I can tell but it doesn't fix the core issue when deleting a run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? It did for me, at least when running dataset.delete_brain_runs()
return views | ||
|
||
|
||
def _throw_index_is_bigger_warning(sample_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, but it gets very annoying when this happens 300 times in a row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now make it very explicit in the docs that the similarity's samples should be equal to the samples passed to leaky splits which should equal the union of the splits. I think if the user chooses to ignore that they should either deal with the annoyance or suppress warnings. What's the alternative for people that make the mistake unwittingly and need an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very solid! I think it's almost at the finish line.
I've tested the following:
- using
split_field
,split_tags
, andsplit_views
to specify splits, with or without all samples included - using existing embeddings or not; using existing similarity index or not
- edge cases where just one split is given, and where every sample is assigned to its own split. In the former, an error is thrown as is appropriate. In the latter case, leaky splits works (although is slow, as is to be expected)
Currently testing on a larger dataset to make sure nothing breaks, but getting very close.
Minor nit @jacobsela :make sure to lint all code before merging the PR
@mwoodson1 can you take a look at this PR as well? there are a lot of moving pieces so more people trying to break things now will be better in the end
Noting that
I was a bit surprised by the behavior of |
@jacobmarks This was also something I thought about. The issue with having no arguments is that I need to know where the user wants to keep images. Just removing all leaks from every split wastes data. I can maybe make it so that the argument is the split to keep/remove leaks from. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Please make the changes requested before merging into main:
- change
leaks
toleaks_view
- implementation of tagging
Also make sure to add documentation and unit testing to the main FO repo. Please request me as a reviewer on those PRs
# Don't allow overwriting an existing run with same key, since we | ||
# need the existing run in order to perform workflows like | ||
# automatically cleaning up the backend's index | ||
brain_method.register_run(samples, brain_key, overwrite=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? It did for me, at least when running dataset.delete_brain_runs()
Link to docs PR: voxel51/fiftyone#5189 |
Computes leaks on a dataset using embeddings and the similarity brain method.
To try out: