Different from major classification methods based on large amounts of annotation data, we introduce a cross-modal alignment for zero-shot image classification.The key is utilizing the query of text attribute learned from the seen classes to guide local feature responses in unseen classes.First, an encoder is used to ice blue graphic tee align semantic matching between visual features and their corresponding text attribute.Second, an attention module is used to get response maps through read more feature maps activated by the query of text attribute.Finally, the cosine distance metric is used to measure the matching degree of the text attribute and its corresponding feature response.
The experiment results show that the method get better performance than existing Zero-shot Learning in embedding-based methods as well as other generative methods in CUB-200-2011 dataset.