Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/leaky splits #203

Merged
merged 60 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
c8207ce
initial commit
Oct 15, 2024
d8c85e2
after much deliberation, quick implementation, and shell of lengthier…
Oct 17, 2024
0a5412b
small fixes
Oct 21, 2024
d4d2b99
added basic filpath hash functionality
Oct 21, 2024
07d06a7
refactor - very wip
Oct 21, 2024
a2b9594
some fixes
Oct 22, 2024
d13a494
to views implmented
Oct 22, 2024
d87d769
implemented leaks for hash
Oct 22, 2024
e42aaf5
sklearn backend basic functionality implemented and integrated
Oct 22, 2024
110a1c0
made the hash backend give out an ordered view
Oct 22, 2024
8f630bb
cache leak view after first time it's computed
Oct 23, 2024
6b4eaec
cache leak view after first time it's computed
Oct 23, 2024
4733226
some documentation and cleanup
Oct 24, 2024
276489b
far better caching mechanism
Oct 24, 2024
a923c6a
filter res so it's actually leaks and not just sim
Oct 25, 2024
80fe828
bugfix
Oct 29, 2024
855c9ac
more bugfixes
Oct 29, 2024
f6a6652
added model kwargs to leaky splits sklearn backend
Nov 1, 2024
fe28ce7
wrote main function
Nov 5, 2024
7d4a552
removed remove_leaks, replaced it with view_without_leaks
Nov 5, 2024
d55c9c7
cleanup and documentation
Nov 13, 2024
09b5e51
cleanup and documentation
Nov 13, 2024
7449545
removed patches
Nov 19, 2024
2d86e79
added checks for non empty support and no overlap when providing spli…
Nov 19, 2024
5b56182
refactor + bugfix sometimes a sample would be kept even when it had n…
Nov 20, 2024
a607ff5
fixed accessing previous brain runs
Nov 20, 2024
7c741ee
updated main function and fixed serialization bug
Nov 20, 2024
12b0975
typo
Nov 20, 2024
a390118
added cleanup
Nov 20, 2024
41190c4
another probably redundant optimization check
Nov 20, 2024
fe89ea7
a lot of thinking and not a lot of writing code
Nov 20, 2024
ee20404
optimized leak finding
Nov 20, 2024
07d5bd3
removed old code
Nov 21, 2024
2a63a37
updated docs
Nov 21, 2024
a5ad99c
moved compute function to __init__
Nov 21, 2024
24b3a29
updated docs
Nov 21, 2024
f53022c
removed more old code
Nov 21, 2024
4047459
moved similarity registration out of class, doesn't make sense for it…
Nov 21, 2024
3b54720
documentation fixes
Nov 21, 2024
644b8d5
cleaned up imports
Nov 21, 2024
eb34ca0
dealt with leaks by sample edge case
Nov 21, 2024
3e3ddb8
assume loading of brain run happens correctly
Nov 21, 2024
d2bdbd6
changed variable name
Nov 21, 2024
13364f7
made the ethod name lowercase
Nov 21, 2024
080e8d7
renamed leaks_by_sample to leaks_for_sample
Nov 21, 2024
54ecb5a
renamed view_without_leaks to no_leak_view
Nov 21, 2024
ff08fd3
updated docs
Nov 21, 2024
a666b58
compute embeddings on the fly
Nov 21, 2024
e1a7b4f
changed method type property
Nov 21, 2024
a352898
fixed passing tags
Nov 21, 2024
1cf9b5f
fixed order of precedence for defaults, similarity conf dict, and arg…
Nov 21, 2024
c445865
made id2split internal
Nov 22, 2024
9a189e0
throw warning when a considered sample is not in any of the splits
Nov 22, 2024
49059c9
added warnings for view matching heuristics
Nov 22, 2024
992767e
updated docs to reflect importance of arguments
Nov 22, 2024
27be7fa
changed variable for clarity
Nov 22, 2024
4f6f4a0
removed unnused variable
Nov 22, 2024
7d9e418
changed leaks to leaks_view
Nov 25, 2024
a12c349
made tag leaks use tag_samples
Nov 25, 2024
9c51d46
changed _to_views docs
Nov 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions fiftyone/brain/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -703,3 +703,99 @@ def compute_exact_duplicates(
return fbd.compute_exact_duplicates(
samples, num_workers, skip_failures, progress
)


def compute_leaky_splits(
samples,
brain_key=None,
split_views=None,
split_field=None,
split_tags=None,
threshold=0.2,
similarity_brain_key=None,
embeddings_field=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this code results in all sample embeddings being computed:

view1 = dataset.limit(500)
view2 = dataset.skip(500).limit(500)

fob.compute_leaky_splits(dataset, split_views={"train":view1, "test":view2}, brain_key="leak_key2", embeddings_field="clip_embeddings")

When in practice only 1000 need embeddings. I think this is fine, just calling it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.

model=None,
model_kwargs=None,
similarity_backend=None,
similarity_config_dict=None,
**kwargs,
):
"""Uses a similarity index or creates one on the spot to find leaks.

Calling this method only creates the index. You can then call the methods
exposed on the returned object to perform the following operations:

- :meth:`leaks <fiftyone.brain.core.internal.leaky_splits.LeakySplitIndex.leaks>`:
Returns a view of all leaks in the dataset.

- :meth:`no_leaks_view <fiftyone.brain.core.internal.leaky_splits.LeakySplitIndex.no_leaks_view>`:
Returns a subset of the given view without any leaks.

- :meth:`leaks_for_sample <fiftyone.brain.core.internal.leaky_splits.LeakySplitIndex.leaks_for_sample>`:
Returns a view with leaks corresponding to the given sample.

- :meth:`tag_leaks <fiftyone.brain.core.internal.leaky_splits.LeakySplitIndex.tag_leaks>`:
Tags leaks in the dataset as leaks.


Args:
samples: a :class:`fiftyone.core.collections.SampleCollection`
brain_key (None): a brain key under which to store the results of this
method. If no brain key is provided the results will not be saved.
split_views (None): a dict of :class:`fiftyone.core.view.DatasetView`
corresponding to different splits in the datset. Only one of
`split_views`, `split_field`, and `splits_tags` need to be used.
split_field (None): a string name of a field that holds the split of the sample.
Each unique value in the field will be treated as a split.
Only one of `split_views`, `split_field`, and `splits_tags` need to be used.
split_tags (None): a list of strings, tags corresponding to differents splits.
Only one of `split_views`, `split_field`, and `splits_tags` need to be used.
threshold (0.2): The threshold to run the algorithm with. Values between
0.1 - 0.25 tend to give good results.
similarity_brain_key (None): a brain key for the similarity index
If the brain key exists already, it will load up the similarity index corresponding to it
If the brain key does not exist already, a new similarity index will be created
and the results will be saved under this name
embeddings_field (None): field for embeddings to feed the index. This argument's
behavior depends on whether a ``model`` is provided, as described
below.
If no ``model`` is provided, this argument specifies the field of pre-computed
embeddings to use
If a ``model`` is provided, this argument specifies where to store
the model's embeddings
model (None): a :class:`fiftyone.core.models.Model` or the name of a
model from the
`FiftyOne Model Zoo <https://docs.voxel51.com/user_guide/model_zoo/index.html>`_
to use, or that was already used, to generate embeddings. The model
must expose embeddings (``model.has_embeddings = True``)
model_kwargs (None): a dictionary of optional keyword arguments to pass
to the model's ``Config`` when a model name is provided
similarity_backend: string, the similarity backend to use. The supported values are
``fiftyone.brain.brain_config.similarity_backends.keys()`` and the
default is
``fiftyone.brain.brain_config.default_similarity_backend``
similarity_config_dict: dict, used to build the similarity backend. Arguments take
precedence over the values in the dict (e.g. model)

Returns:
a :class:`fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex`,
a :class:`fiftyone.core.view.DatasetView`
"""

from fiftyone.brain.internal.core.leaky_splits import compute_leaky_splits

return compute_leaky_splits(
samples,
brain_key=brain_key,
split_views=split_views,
split_field=split_field,
split_tags=split_tags,
threshold=threshold,
similarity_brain_key=similarity_brain_key,
embeddings_field=embeddings_field,
model=model,
model_kwargs=model_kwargs,
similarity_backend=similarity_backend,
similarity_config_dict=similarity_config_dict,
**kwargs,
)
Loading
Loading