-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about the prototype matrix #11
Comments
Hi, thanks for your interest. The cross-modal prototype matrix is to learn and record the cross-modal patterns for each class rather than the image/sentence features for some specific samples. Therefore, the cross-modal prototype is initialized from several clustered, concatenated features (high level statistics, extract the common pattern) instead of the representations from some specific samples. Note that those cross-modal patterns are designed for class-level rather than instance-level. Hence, they can be applied for both the images/sectences or tokens (fine-grained) within the same class. For example, some cross-modal patterns may guide the model how to decribe the content (style, detailed or brief). In addition, some learned cross-modal patterns can also be fine-grained as samples within the class are grouped, hence each group may focus more on some parts of sentence or patches ( imagine you group ten different type of cars from the car category, what the model will fcous?) Moreover, the initilization is to ensure that the cross-modal prototype matrix has a good semantic information at the begining. Through the design of cross-moal prototype quering and corresponding and the contrastive learning, the model will learn what patterns should be learned and recorded, and optimize the cross-modal matrix during the training. Hope this can help you figure out the problem |
Thank you for your reply. I almost understand. Can it be summarized in the following three points?
3.This will be further optimized in subsequent training. In addition, is there some ambiguity regarding the representation of r_j^s in the following figure in the paper? The subscript of r in Figure 1 represents a certain patches; The subscript of r in Figure 2 represents a certain sample. Maybe the r in Figure 2 should be bold? I don't know if I understand correctly. |
Hi, thanks for your code! I have some questions about the model.
When we construct the prototype matrix(N_l x N_p x D), the 1xD vectors in it is derived from the whole image/sentence;
However, when conducting subsequent operations of the Cross-modal Prototype Querying and the Cross-modal Prototype Responding, it is to look for the most suitable vector in the prototype matrix for each patch or word. Does this sound not so matching? image -patch, sentence - word?
The text was updated successfully, but these errors were encountered: