Abstract: Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating distinct modalities to improve the performance or robustness of the model.
Abstract: Audio-visual zero-shot learning (ZSL) leverages both video and audio information for model training, aiming to classify new video categories that were not seen during the training. However, ...