-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping from spatial features to summary feature #81
Comments
Hi. Unfortunately, there isn't a direct way to "ground" the spatial features to be in the same representation space as the summary token(s). I've been looking into doing this for the CLIP head so that we get true grounding (thus unlocking zero shot semantic segmentation), but haven't found an approach that works well yet. |
I see. Yeah zero shot semantic segmentation would be amazing to see which is I guess taking the crop down to a single pixel. What if we use the trained RADIO model that uses some global pooling to get the summary feature. This way we should be able to crop the spatial features then pool to get the summary. I don't think it would work for very tiny crops or pixels but might work for bigger crops ? Do you have plans on releasing the weights of the pooling version of RADIO mentioned in the paper ? I can run tests to see the feasibility. |
Hello, I was wondering if their is a way to map the spatial features (or a crop of it) to the summary feature?
I am seeing that the released 2.5 models use CLS tokens for the summary as opposed to pooling the spatial features. So there is no direct mapping between spatial and summary.
Why do I care? because I want to be able to get summary features for many crops of a single image without rerunning the whole encoder for each crop.
Why summary features? because those are the ones that can be language aligned after the CLIP summary adapter.
Your thoughts on this are highly appreciated.
The text was updated successfully, but these errors were encountered: