Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute near duplicates + ROI fields #214

Merged
merged 2 commits into from
Dec 3, 2024
Merged

Compute near duplicates + ROI fields #214

merged 2 commits into from
Dec 3, 2024

Conversation

brimoor
Copy link
Contributor

@brimoor brimoor commented Nov 29, 2024

Change log

  • Adds a fob.compute_near_duplicates() method that provides a use case-centric interface to the existing DuplicatesMixin.find_duplicates() method
  • Also adds roi_field arguments to compute_similarity() and compute_leaky_splits() for consistency with compute_uniqueness() and compute_representativeness()

Example near duplicates usage

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Load some COCO data
dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    max_samples=1000,
    persistent=True,
)

# Compute embeddings
model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")
dataset.compute_embeddings(model, embeddings_field="embeddings")

# Find near duplicates
index = fob.compute_near_duplicates(dataset, embeddings="embeddings")

print(index.thresh)  # 0.2
print(index.duplicate_ids)  # ['XXXX', ...]

duplicates = index.duplicates_view()

session = fo.launch_app(duplicates)

Example roi_field usage

Setup

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
import fiftyone.utils.random as four

dataset = foz.load_zoo_dataset("quickstart")

# model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")
# dataset.compute_patch_embeddings(model, "ground_truth", embeddings_field="embeddings")

Similarity w/ ROI

index = fob.compute_similarity(
    dataset[:100],
    roi_field="ground_truth",
    # embeddings="embeddings",
)

print(index.total_index_size)  # 100

embeddings, sample_ids, _ = index.compute_embeddings(dataset[100:])
index.add_to_index(embeddings, sample_ids)

print(index.total_index_size)  # 200

Near duplicates w/ ROI

index = fob.compute_near_duplicates(
    dataset,
    roi_field="ground_truth",
    # embeddings="embeddings",
)

print(index.thresh)  # 0.2
print(index.duplicate_ids)  # ['XXXX', ...]

duplicates = index.duplicates_view()

Leaky splits w/ ROI

dataset.untag_samples(dataset.distinct("tags"))
four.random_split(dataset, {"train": 0.7, "test": 0.3})
index = fob.compute_leaky_splits(
    dataset,
    ["train", "test"],
    roi_field="ground_truth",
    # embeddings="embeddings",
)

print(index.thresh)  # 0.2
print(index.leak_ids)  # ['XXXX', ...]

leaks = index.leaks_view()

@brimoor brimoor added the feature Work on a feature request label Nov 29, 2024
@brimoor brimoor requested a review from jacobsela November 29, 2024 23:27
@brimoor brimoor changed the title Compute near duplicates Compute near duplicates + ROI fields Nov 30, 2024
"%s mixin" % fbs.DuplicatesMixin
)

similarity_index.find_duplicates(thresh=threshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brimoor This is what I mean with embeddings/models/similarity boilerplate. This function is 60 lines of code for effectively one line.

Copy link
Contributor Author

@brimoor brimoor Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but this boilerplate does serve some purposes:

  • Allow compute_near_duplicates() to use a different default model than compute_similarity()
  • Enforce that a pre-existing similarity_index must implement the DuplicatesMixin mixin

@@ -30,18 +30,21 @@ def compute_leaky_splits(
samples,
splits,
threshold=None,
roi_field=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this for copy-paste augments and other strange edge cases.

_DEFAULT_MODEL = "mobilenet-v2-imagenet-torch"
_DEFAULT_BATCH_SIZE = None


def compute_similarity(
samples,
patches_field,
roi_field,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between ROI field and patches field from a functional standpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patches_field argument throughout the Brain instructs the relevant methods to operate on object patches rather than samples. So for example compute_similarity(patches_field=) says to generate a similarity index keyed by label ID rather than sample ID.

The roi_field argument means that you're doing something at the sample-level, but you want to use a specific ROI in the image rather than the full image to do the analysis. The most common use case here would be if you have a single Detection per image. If you have Detections, then roi_field aggregates the per-object embeddings into a single embedding (currently by averaging them) and uses that vector to represent the sample.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in that case passing a single patch per sample to patches_field is equivalent to ROI field? feels strange to have them as two different arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they're not quite equivalent. In that case (one patch per sample), the embeddings in the index will be the same. But if you use roi_field then the primary key for the index will be sample ID, while if you use patches_field the primary key will be label ID.

And of course if there are in fact multiple object patches per sample, then roi_field vs patches_field also differs in that ROI field will have # objects vectors in the index while patches_field will have # objects vectors.

Base automatically changed from leaky-splits-updates to develop December 2, 2024 18:39
@brimoor brimoor merged commit fcbafef into develop Dec 3, 2024
5 checks passed
@brimoor brimoor deleted the near-duplicates branch December 3, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Work on a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants