ICCVW-2023-Papers Application What is Next in Multimodal Foundation Models? Title Repo Paper Video Coarse to Fine Frame Selection for Online Open-Ended Video Question Answering ➖ Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models ➖ ➖ Video-and-Language (VidL) Models and their Cognitive Relevance ➖ ➖ Video Attribute Prototype Network: A New Perspective for Zero-Shot Video Classification ➖ Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection ➖ ClipCrop: Conditioned Cropping Driven by Vision-Language Model ➖ ➖ Towards an Exhaustive Evaluation of Vision-Language Foundation Models ➖ ➖ Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts ➖ Painter: Teaching Auto-Regressive Language Models to Draw Sketches ➖ ➖