Feature/leaky splits #203

jacobsela · 2024-10-21T22:34:25Z

Computes leaks on a dataset using embeddings and the similarity brain method.

To try out:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10")
model = foz.load_zoo_model("clip-vit-base32-torch")

dataset.compute_embeddings(model, "embeds_clip", batch_size=32, num_workers=16)

index_clip, leaks_clip = fob.compute_leaky_splits(
    dataset,
    "foobar",
    split_field="split",
    threshold=0.2,
    embeddings_field="embeds_clip",
)

… one

mwoodson1 · 2024-10-22T18:09:19Z

@jacobsela when finalized could you provide a code snippet for how you imagine this feature working?

mwoodson1 · 2024-10-22T18:12:49Z

fiftyone/brain/internal/core/leaky_splits.py

+                s.save()
+
+    @staticmethod
+    def _image_hash(image, hash_size=24):


I think image hashing is a generally useful enough function that it should be pulled out. For example, the compute_exact_duplicates method uses this hash but could also easily be changed to use a image hash.

jacobsela · 2024-10-22T19:37:00Z

@mwoodson1
Basic snippet I used in the demo:

import fiftyone as fo
import fiftyone.brain.internal.core.leaky_splits as ls

config = ls.LeakySplitsSKLConfig(
    split_tags=['train', 'test'],
    model="resnet18-imagenet-torch"
)

# skl backend
index = ls.LeakySplitsSKL(config).initialize(dataset, "foo")

index.set_threshold(0.1)
leaks = index.leaks

session = fo.launch_app(leaks, auto=False)

# hash backend
config = ls.LeakySplitsHashConfig(
    split_tags=['train', 'test'],
    method='image',
    hash_field='hash'
)

index = ls.LeakySplitsHash(config).initialize(dataset, "foo")

session = fo.launch_app(index.leaks, auto=False)

mwoodson1 · 2024-10-24T15:09:24Z

The interface seems a bit messy to me. I was hoping for something like

dataset = foz.load_zoo_dataset(...)

leaks = fob.compute_data_leaks(
    dataset,
    method, # use hash or embedding soft similarity
    brain_key, # which similarity index / embeddings to use,
    model, # which model to use to compute embeddings
    ...
)

This would follow similar patterns to fob.compute_visualization and fob.compute_uniqueness. For example see the work happening in #201

jacobsela · 2024-10-24T15:24:39Z

@mwoodson1 Thanks for the feedback, I agree that this isn't ideal. I'm holding off on creating the final compute_leaks (or compute_leaky_splits as it currently is in the code) until we finalize what we want the behavior to look like (e.g. in terms of thresholds). Putting together a final easy to use function at the end should be quick so I'd rather do it once.

fiftyone/brain/internal/core/leaky_splits.py

…uments

fiftyone/brain/internal/core/leaky_splits.py

jacobmarks · 2024-11-21T19:06:18Z

fiftyone/brain/__init__.py

+    split_tags=None,
+    threshold=0.2,
+    similarity_brain_key=None,
+    embeddings_field=None,


Currently, this code results in all sample embeddings being computed:

view1 = dataset.limit(500) view2 = dataset.skip(500).limit(500) fob.compute_leaky_splits(dataset, split_views={"train":view1, "test":view2}, brain_key="leak_key2", embeddings_field="clip_embeddings")

When in practice only 1000 need embeddings. I think this is fine, just calling it out

See comment below.

fiftyone/brain/internal/core/leaky_splits.py

jacobmarks · 2024-11-22T20:47:54Z

fiftyone/brain/internal/core/leaky_splits.py

+        # Don't allow overwriting an existing run with same key, since we
+        # need the existing run in order to perform workflows like
+        # automatically cleaning up the backend's index
+        brain_method.register_run(samples, brain_key, overwrite=False)


You should check to make sure this doesn't break anything, but by adding cleanup=False in register_run() and commenting out the cleanup() method in LeakySplits below, I was able to get rid of the error with Limit not being registered

This works for registration and shouldn't break anything as far as I can tell but it doesn't fix the core issue when deleting a run.

Really? It did for me, at least when running dataset.delete_brain_runs()

jacobmarks · 2024-11-22T20:57:31Z

fiftyone/brain/internal/core/leaky_splits.py

+    return views
+
+
+def _throw_index_is_bigger_warning(sample_id):


I like this idea, but it gets very annoying when this happens 300 times in a row

I now make it very explicit in the docs that the similarity's samples should be equal to the samples passed to leaky splits which should equal the union of the splits. I think if the user chooses to ignore that they should either deal with the annoyance or suppress warnings. What's the alternative for people that make the mistake unwittingly and need an error?

fiftyone/brain/internal/core/leaky_splits.py

jacobmarks

This is looking very solid! I think it's almost at the finish line.

I've tested the following:

using split_field, split_tags, and split_views to specify splits, with or without all samples included
using existing embeddings or not; using existing similarity index or not
edge cases where just one split is given, and where every sample is assigned to its own split. In the former, an error is thrown as is appropriate. In the latter case, leaky splits works (although is slow, as is to be expected)

Currently testing on a larger dataset to make sure nothing breaks, but getting very close.

Minor nit @jacobsela :make sure to lint all code before merging the PR

@mwoodson1 can you take a look at this PR as well? there are a lot of moving pieces so more people trying to break things now will be better in the end

jacobmarks · 2024-11-22T21:54:33Z

Noting that

everything works as expected at the scale of 50k samples.
threshold control works as expected
tag_leaks() works as expected
index.leaks works as expected

I was a bit surprised by the behavior of no_leaks_view(), that it requires a view as input. Need to think on this

jacobsela · 2024-11-22T22:08:30Z

@jacobmarks This was also something I thought about. The issue with having no arguments is that I need to know where the user wants to keep images. Just removing all leaks from every split wastes data. I can maybe make it so that the argument is the split to keep/remove leaks from. What do you think?

jacobmarks

Overall LGTM. Please make the changes requested before merging into main:

change leaks to leaks_view
implementation of tagging

Also make sure to add documentation and unit testing to the main FO repo. Please request me as a reviewer on those PRs

jacobmarks · 2024-11-25T16:05:40Z

fiftyone/brain/internal/core/leaky_splits.py

+        # Don't allow overwriting an existing run with same key, since we
+        # need the existing run in order to perform workflows like
+        # automatically cleaning up the backend's index
+        brain_method.register_run(samples, brain_key, overwrite=False)


Really? It did for me, at least when running dataset.delete_brain_runs()

fiftyone/brain/internal/core/leaky_splits.py

jacobsela · 2024-11-25T16:28:40Z

Link to docs PR: voxel51/fiftyone#5189

Jacob Sela and others added 5 commits October 15, 2024 12:46

initial commit

c8207ce

after much deliberation, quick implementation, and shell of lengthier…

d8c85e2

… one

small fixes

0a5412b

added basic filpath hash functionality

d4d2b99

refactor - very wip

07d06a7

jacobsela added the draft Work in a draft state label Oct 21, 2024

jacobsela requested review from mwoodson1, jacobmarks and brimoor October 21, 2024 22:34

Jacob Sela added 3 commits October 22, 2024 11:36

some fixes

a2b9594

to views implmented

d13a494

implemented leaks for hash

d87d769

mwoodson1 reviewed Oct 22, 2024

View reviewed changes

sklearn backend basic functionality implemented and integrated

e42aaf5

Jacob Sela added 3 commits October 22, 2024 16:21

made the hash backend give out an ordered view

110a1c0

cache leak view after first time it's computed

8f630bb

cache leak view after first time it's computed

6b4eaec

some documentation and cleanup

4733226

Jacob Sela added 7 commits October 24, 2024 17:28

far better caching mechanism

276489b

filter res so it's actually leaks and not just sim

a923c6a

bugfix

80fe828

more bugfixes

855c9ac

added model kwargs to leaky splits sklearn backend

f6a6652

wrote main function

fe28ce7

removed remove_leaks, replaced it with view_without_leaks

7d4a552

jacobsela assigned brimoor Nov 5, 2024

jacobmarks requested changes Nov 21, 2024

View reviewed changes

fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved

fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved

Jacob Sela added 2 commits November 21, 2024 09:35

fixed passing tags

a352898

fixed order of precedence for defaults, similarity conf dict, and arg…

1cf9b5f

…uments

jacobmarks self-requested a review November 21, 2024 19:01

jacobmarks requested changes Nov 21, 2024

View reviewed changes

Jacob Sela added 4 commits November 22, 2024 08:27

made id2split internal

c445865

throw warning when a considered sample is not in any of the splits

9a189e0

added warnings for view matching heuristics

49059c9

updated docs to reflect importance of arguments

992767e

jacobsela requested review from manushreegangwar, jacobmarks and mwoodson1 November 22, 2024 19:01

jacobmarks reviewed Nov 22, 2024

View reviewed changes

fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved

jacobmarks reviewed Nov 22, 2024

View reviewed changes

fiftyone/brain/internal/core/leaky_splits.py Outdated Show resolved Hide resolved

jacobmarks reviewed Nov 22, 2024

View reviewed changes

changed variable for clarity

27be7fa

removed unnused variable

4f6f4a0

jacobmarks self-requested a review November 25, 2024 16:04

jacobmarks approved these changes Nov 25, 2024

View reviewed changes

jacobsela mentioned this pull request Nov 25, 2024

[Docs] Leaky splits voxel51/fiftyone#5189

Merged

5 tasks

Jacob Sela added 3 commits November 25, 2024 08:41

changed leaks to leaks_view

7d9e418

made tag leaks use tag_samples

a12c349

changed _to_views docs

9c51d46

jacobsela merged commit 7b0259c into develop Nov 25, 2024
5 checks passed

jacobsela deleted the feature/leaky-splits branch November 25, 2024 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/leaky splits #203

Feature/leaky splits #203

jacobsela commented Oct 21, 2024 •

edited by brimoor

Loading

mwoodson1 commented Oct 22, 2024

mwoodson1 Oct 22, 2024

jacobsela commented Oct 22, 2024

mwoodson1 commented Oct 24, 2024

jacobsela commented Oct 24, 2024

jacobmarks Nov 21, 2024

jacobsela Nov 22, 2024

jacobmarks Nov 22, 2024

jacobsela Nov 22, 2024

jacobmarks Nov 25, 2024

jacobmarks Nov 22, 2024

jacobsela Nov 22, 2024

jacobmarks left a comment

jacobmarks commented Nov 22, 2024

jacobsela commented Nov 22, 2024

jacobmarks left a comment

jacobmarks Nov 25, 2024

jacobsela commented Nov 25, 2024

Feature/leaky splits #203

Feature/leaky splits #203

Conversation

jacobsela commented Oct 21, 2024 • edited by brimoor Loading

mwoodson1 commented Oct 22, 2024

Choose a reason for hiding this comment

jacobsela commented Oct 22, 2024

mwoodson1 commented Oct 24, 2024

jacobsela commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobmarks left a comment

Choose a reason for hiding this comment

jacobmarks commented Nov 22, 2024

jacobsela commented Nov 22, 2024

jacobmarks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobsela commented Nov 25, 2024

jacobsela commented Oct 21, 2024 •

edited by brimoor

Loading