Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/leaky splits #203

Merged
merged 60 commits into from
Nov 25, 2024
Merged

Feature/leaky splits #203

merged 60 commits into from
Nov 25, 2024

Conversation

jacobsela
Copy link
Contributor

@jacobsela jacobsela commented Oct 21, 2024

Computes leaks on a dataset using embeddings and the similarity brain method.

To try out:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10")
model = foz.load_zoo_model("clip-vit-base32-torch")

dataset.compute_embeddings(model, "embeds_clip", batch_size=32, num_workers=16)

index_clip, leaks_clip = fob.compute_leaky_splits(
    dataset,
    "foobar",
    split_field="split",
    threshold=0.2,
    embeddings_field="embeds_clip",
)

@jacobsela jacobsela added the draft Work in a draft state label Oct 21, 2024
@mwoodson1
Copy link
Contributor

@jacobsela when finalized could you provide a code snippet for how you imagine this feature working?

s.save()

@staticmethod
def _image_hash(image, hash_size=24):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think image hashing is a generally useful enough function that it should be pulled out. For example, the compute_exact_duplicates method uses this hash but could also easily be changed to use a image hash.

@jacobsela
Copy link
Contributor Author

@mwoodson1
Basic snippet I used in the demo:

import fiftyone as fo
import fiftyone.brain.internal.core.leaky_splits as ls

config = ls.LeakySplitsSKLConfig(
    split_tags=['train', 'test'],
    model="resnet18-imagenet-torch"
)

# skl backend
index = ls.LeakySplitsSKL(config).initialize(dataset, "foo")

index.set_threshold(0.1)
leaks = index.leaks

session = fo.launch_app(leaks, auto=False)

# hash backend
config = ls.LeakySplitsHashConfig(
    split_tags=['train', 'test'],
    method='image',
    hash_field='hash'
)

index = ls.LeakySplitsHash(config).initialize(dataset, "foo")

session = fo.launch_app(index.leaks, auto=False)

@mwoodson1
Copy link
Contributor

The interface seems a bit messy to me. I was hoping for something like

dataset = foz.load_zoo_dataset(...)

leaks = fob.compute_data_leaks(
    dataset,
    method, # use hash or embedding soft similarity
    brain_key, # which similarity index / embeddings to use,
    model, # which model to use to compute embeddings
    ...
)

This would follow similar patterns to fob.compute_visualization and fob.compute_uniqueness. For example see the work happening in #201

@jacobsela
Copy link
Contributor Author

@mwoodson1 Thanks for the feedback, I agree that this isn't ideal. I'm holding off on creating the final compute_leaks (or compute_leaky_splits as it currently is in the code) until we finalize what we want the behavior to look like (e.g. in terms of thresholds). Putting together a final easy to use function at the end should be quick so I'd rather do it once.

fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved
fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved
@jacobmarks jacobmarks self-requested a review November 21, 2024 19:01
fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved
split_tags=None,
threshold=0.2,
similarity_brain_key=None,
embeddings_field=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this code results in all sample embeddings being computed:

view1 = dataset.limit(500)
view2 = dataset.skip(500).limit(500)

fob.compute_leaky_splits(dataset, split_views={"train":view1, "test":view2}, brain_key="leak_key2", embeddings_field="clip_embeddings")

When in practice only 1000 need embeddings. I think this is fine, just calling it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.

fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved
# Don't allow overwriting an existing run with same key, since we
# need the existing run in order to perform workflows like
# automatically cleaning up the backend's index
brain_method.register_run(samples, brain_key, overwrite=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check to make sure this doesn't break anything, but by adding cleanup=False in register_run() and commenting out the cleanup() method in LeakySplits below, I was able to get rid of the error with Limit not being registered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for registration and shouldn't break anything as far as I can tell but it doesn't fix the core issue when deleting a run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? It did for me, at least when running dataset.delete_brain_runs()

return views


def _throw_index_is_bigger_warning(sample_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, but it gets very annoying when this happens 300 times in a row

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now make it very explicit in the docs that the similarity's samples should be equal to the samples passed to leaky splits which should equal the union of the splits. I think if the user chooses to ignore that they should either deal with the annoyance or suppress warnings. What's the alternative for people that make the mistake unwittingly and need an error?

Copy link
Contributor

@jacobmarks jacobmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very solid! I think it's almost at the finish line.

I've tested the following:

  • using split_field, split_tags, and split_views to specify splits, with or without all samples included
  • using existing embeddings or not; using existing similarity index or not
  • edge cases where just one split is given, and where every sample is assigned to its own split. In the former, an error is thrown as is appropriate. In the latter case, leaky splits works (although is slow, as is to be expected)

Currently testing on a larger dataset to make sure nothing breaks, but getting very close.

Minor nit @jacobsela :make sure to lint all code before merging the PR

@mwoodson1 can you take a look at this PR as well? there are a lot of moving pieces so more people trying to break things now will be better in the end

@jacobmarks
Copy link
Contributor

Noting that

  • everything works as expected at the scale of 50k samples.
  • threshold control works as expected
  • tag_leaks() works as expected
  • index.leaks works as expected

I was a bit surprised by the behavior of no_leaks_view(), that it requires a view as input. Need to think on this

@jacobsela
Copy link
Contributor Author

@jacobmarks This was also something I thought about. The issue with having no arguments is that I need to know where the user wants to keep images. Just removing all leaks from every split wastes data. I can maybe make it so that the argument is the split to keep/remove leaks from. What do you think?

@jacobmarks jacobmarks self-requested a review November 25, 2024 16:04
Copy link
Contributor

@jacobmarks jacobmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Please make the changes requested before merging into main:

  • change leaks to leaks_view
  • implementation of tagging

Also make sure to add documentation and unit testing to the main FO repo. Please request me as a reviewer on those PRs

# Don't allow overwriting an existing run with same key, since we
# need the existing run in order to perform workflows like
# automatically cleaning up the backend's index
brain_method.register_run(samples, brain_key, overwrite=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? It did for me, at least when running dataset.delete_brain_runs()

fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved
fiftyone/brain/internal/core/leaky_splits.py Show resolved Hide resolved
fiftyone/brain/internal/core/leaky_splits.py Show resolved Hide resolved
fiftyone/brain/internal/core/leaky_splits.py Show resolved Hide resolved
@jacobsela
Copy link
Contributor Author

Link to docs PR: voxel51/fiftyone#5189

@jacobsela jacobsela merged commit 7b0259c into develop Nov 25, 2024
5 checks passed
@jacobsela jacobsela deleted the feature/leaky-splits branch November 25, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
draft Work in a draft state
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants