Compute near duplicates + ROI fields #214

brimoor · 2024-11-29T23:27:08Z

Change log

Adds a fob.compute_near_duplicates() method that provides a use case-centric interface to the existing DuplicatesMixin.find_duplicates() method
Also adds roi_field arguments to compute_similarity() and compute_leaky_splits() for consistency with compute_uniqueness() and compute_representativeness()

Example near duplicates usage

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Load some COCO data
dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    max_samples=1000,
    persistent=True,
)

# Compute embeddings
model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")
dataset.compute_embeddings(model, embeddings_field="embeddings")

# Find near duplicates
index = fob.compute_near_duplicates(dataset, embeddings="embeddings")

print(index.thresh)  # 0.2
print(index.duplicate_ids)  # ['XXXX', ...]

duplicates = index.duplicates_view()

session = fo.launch_app(duplicates)

Example `roi_field` usage

Setup

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
import fiftyone.utils.random as four

dataset = foz.load_zoo_dataset("quickstart")

# model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")
# dataset.compute_patch_embeddings(model, "ground_truth", embeddings_field="embeddings")

Similarity w/ ROI

index = fob.compute_similarity(
    dataset[:100],
    roi_field="ground_truth",
    # embeddings="embeddings",
)

print(index.total_index_size)  # 100

embeddings, sample_ids, _ = index.compute_embeddings(dataset[100:])
index.add_to_index(embeddings, sample_ids)

print(index.total_index_size)  # 200

Near duplicates w/ ROI

index = fob.compute_near_duplicates(
    dataset,
    roi_field="ground_truth",
    # embeddings="embeddings",
)

print(index.thresh)  # 0.2
print(index.duplicate_ids)  # ['XXXX', ...]

duplicates = index.duplicates_view()

Leaky splits w/ ROI

dataset.untag_samples(dataset.distinct("tags"))
four.random_split(dataset, {"train": 0.7, "test": 0.3})
index = fob.compute_leaky_splits(
    dataset,
    ["train", "test"],
    roi_field="ground_truth",
    # embeddings="embeddings",
)

print(index.thresh)  # 0.2
print(index.leak_ids)  # ['XXXX', ...]

leaks = index.leaks_view()

jacobsela · 2024-12-02T18:02:41Z

fiftyone/brain/internal/core/duplicates.py

+            "%s mixin" % fbs.DuplicatesMixin
+        )
+
+    similarity_index.find_duplicates(thresh=threshold)


@brimoor This is what I mean with embeddings/models/similarity boilerplate. This function is 60 lines of code for effectively one line.

Yeah but this boilerplate does serve some purposes:

Allow compute_near_duplicates() to use a different default model than compute_similarity()

Enforce that a pre-existing similarity_index must implement the DuplicatesMixin mixin

jacobsela · 2024-12-02T18:03:46Z

fiftyone/brain/internal/core/leaky_splits.py

@@ -30,18 +30,21 @@ def compute_leaky_splits(
    samples,
    splits,
    threshold=None,
+    roi_field=None,


I like this for copy-paste augments and other strange edge cases.

jacobsela · 2024-12-02T18:05:29Z

fiftyone/brain/similarity.py

 _DEFAULT_MODEL = "mobilenet-v2-imagenet-torch"
 _DEFAULT_BATCH_SIZE = None


 def compute_similarity(
    samples,
    patches_field,
+    roi_field,


what's the difference between ROI field and patches field from a functional standpoint?

The patches_field argument throughout the Brain instructs the relevant methods to operate on object patches rather than samples. So for example compute_similarity(patches_field=) says to generate a similarity index keyed by label ID rather than sample ID.

The roi_field argument means that you're doing something at the sample-level, but you want to use a specific ROI in the image rather than the full image to do the analysis. The most common use case here would be if you have a single Detection per image. If you have Detections, then roi_field aggregates the per-object embeddings into a single embedding (currently by averaging them) and uses that vector to represent the sample.

So in that case passing a single patch per sample to patches_field is equivalent to ROI field? feels strange to have them as two different arguments.

No they're not quite equivalent. In that case (one patch per sample), the embeddings in the index will be the same. But if you use roi_field then the primary key for the index will be sample ID, while if you use patches_field the primary key will be label ID.

And of course if there are in fact multiple object patches per sample, then roi_field vs patches_field also differs in that ROI field will have # objects vectors in the index while patches_field will have # objects vectors.

brimoor added the feature Work on a feature request label Nov 29, 2024

brimoor requested a review from jacobsela November 29, 2024 23:27

brimoor changed the title ~~Compute near duplicates~~ Compute near duplicates + ROI fields Nov 30, 2024

add compute_near_duplicates() method

d4f80bc

brimoor force-pushed the near-duplicates branch from f126730 to d4f80bc Compare December 2, 2024 01:09

jacobsela reviewed Dec 2, 2024

View reviewed changes

Base automatically changed from leaky-splits-updates to develop December 2, 2024 18:39

jacobsela approved these changes Dec 2, 2024

View reviewed changes

cleanup

fecbef7

brimoor merged commit fcbafef into develop Dec 3, 2024
5 checks passed

brimoor deleted the near-duplicates branch December 3, 2024 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute near duplicates + ROI fields #214

Compute near duplicates + ROI fields #214

brimoor commented Nov 29, 2024 •

edited

Loading

jacobsela Dec 2, 2024

brimoor Dec 2, 2024 •

edited

Loading

jacobsela Dec 2, 2024

jacobsela Dec 2, 2024

brimoor Dec 2, 2024

jacobsela Dec 2, 2024

brimoor Dec 2, 2024

Compute near duplicates + ROI fields #214

Compute near duplicates + ROI fields #214

Conversation

brimoor commented Nov 29, 2024 • edited Loading

Change log

Example near duplicates usage

Example roi_field usage

Setup

Similarity w/ ROI

Near duplicates w/ ROI

Leaky splits w/ ROI

jacobsela Dec 2, 2024

Choose a reason for hiding this comment

brimoor Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

jacobsela Dec 2, 2024

Choose a reason for hiding this comment

jacobsela Dec 2, 2024

Choose a reason for hiding this comment

brimoor Dec 2, 2024

Choose a reason for hiding this comment

jacobsela Dec 2, 2024

Choose a reason for hiding this comment

brimoor Dec 2, 2024

Choose a reason for hiding this comment

brimoor commented Nov 29, 2024 •

edited

Loading

Example `roi_field` usage

brimoor Dec 2, 2024 •

edited

Loading