ORYX

All

21 repositories

AIN
Public
AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding across diverse domains.
ocr culture remote-sensing vqa vlm vision-and-language lmm multi-images
HTML
•
MIT License
•0•13•0•0•Updated Feb 4, 2025Feb 4, 2025
LlamaV-o1
Public
Rethinking Step-by-step Visual Reasoning in LLMs
Python
•
Apache License 2.0
•15•224•2•2•Updated Jan 24, 2025Jan 24, 2025
GeoPixel
Public
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
remote-sensing segmentation-models foundation-models large-vision-language-models large-multimodal-models vision-language-models grounding-llms
1•41•2•0•Updated Jan 24, 2025Jan 24, 2025
Camel-Bench
Public
[NACCL 2025 🔥] CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
benchmark vqa arabic multimodal-learning visual-question-answering mbzuai large-multimodal-models
Python
•
MIT License
•1•31•0•0•Updated Jan 23, 2025Jan 23, 2025
UniMed-CLIP
Public
Official repository of paper titled "UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities".
Python
•
Other
•4•70•0•0•Updated Dec 26, 2024Dec 26, 2024
BiMediX2
Public
Bio-Medical EXpert LMM with English and Arabic Language Capabilities
6•62•1•0•Updated Dec 15, 2024Dec 15, 2024
VideoGLaMM
Public
A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
vision-and-language lmm foundation-models vision-language-model llm-agent
1•40•3•0•Updated Dec 13, 2024Dec 13, 2024
ALM-Bench
Public
🔥 ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.
multilingual benchmarking multi-modal cultural gpt-4 multimodal-large-language-models
Python
•
Other
•1•28•0•0•Updated Nov 29, 2024Nov 29, 2024
GeoChat
Public
[CVPR 2024 🔥] GeoChat, the first grounded Large Vision Language Model for Remote Sensing
remote-sensing vlm
Python
•39•503•33•1•Updated Nov 28, 2024Nov 28, 2024
BiMediX
Public
Bilingual Medical Mixture of Experts LLM
Other
•1•28•1•0•Updated Nov 23, 2024Nov 23, 2024
groundingLMM
Public
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
vision-and-language lmm foundation-models vision-language-model llm-agent
Python
•41•820•27•0•Updated Nov 23, 2024Nov 23, 2024
ClimateGPT
Public
[EMNLP'23] ClimateGPT: a specialized LLM for conversations related to Climate Change and Sustainability topics in both English and Arabic languages.
Python
•10•77•0•0•Updated Sep 24, 2024Sep 24, 2024
PALO
Public
(WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu.
Python
•
Apache License 2.0
•5•81•5•0•Updated Sep 10, 2024Sep 10, 2024
Video-ChatGPT
Public
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
chatbot llama clip mulit-modal vision-language vicuna gpt-4 vision-language-pretraining llava video-chatboat
Python
•
Creative Commons Attribution 4.0 International
•111•1.3k•21•0•Updated Aug 27, 2024Aug 27, 2024
CVRR-Evaluation-Suite
Public
Official repository of paper titled "How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs".
Python
•
Creative Commons Attribution 4.0 International
•4•44•0•0•Updated Aug 23, 2024Aug 23, 2024
VideoGPT-plus
Public
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
chatbot clip image-encoder video-encoder multimodal dual-encoder vision-language vicuna gpt4 vision-language-pretraining
Python
•
Creative Commons Attribution 4.0 International
•15•251•15•1•Updated Aug 11, 2024Aug 11, 2024
XrayGPT
Public
[BIONLP@ACL 2024] XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models.
Python
•58•482•19•2•Updated Aug 8, 2024Aug 8, 2024
LLaVA-pp
Public
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
conversation lmms vision-language llm llava llama3 phi3 llava-llama3 llava-phi3 llama3-llava
Python
•62•827•17•2•Updated Jul 10, 2024Jul 10, 2024
MobiLlama
Public
MobiLlama : Small Language Model tailored for edge devices
slm llm efficient-llm mobile-llm tiny-llm
Python
•
Apache License 2.0
•49•618•13•2•Updated Mar 3, 2024Mar 3, 2024
Video-LLaVA
Public
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
video transcription lmm grounding video-grounding llm video-conversation
Python
•12•249•15•0•Updated Jan 2, 2024Jan 2, 2024
Awesome-CV-Foundational-Models
Public
29•8•0•0•Updated Jul 31, 2023Jul 31, 2023