Skip to content

A tool for analyzing food intake from videos using VLMs and Pose Estimation.

License

Notifications You must be signed in to change notification settings

isobarbaric/FoodVideoQA

Repository files navigation

FoodVideoQA: A Novel Framework for Dietary Monitoring

arXiv arXiv License

🔥 Highlights

  • Cost-Effective: Uses pre-trained vision-language models without requiring fine-tuning, expensive GPUs, or specialized datasets.
  • Context-Aware Analysis: Detects foods, utensils, and eating actions frame-by-frame for accurate tracking throughout video input.
  • Domain Adaptable/Scalable: Provides labeled dietary insights applicable to healthcare, childcare, and assisted living environments without additional equipment.

🚀 Functionality

Workflow Image

🧩 VLM-Driven Insights

Extracts nutritional information, ingredients, and utensils from video frames using Vision-Language Models. Groups frames into intervals based on consistent food item presence. The code can be modified to accomodate any of the following HuggingFace VLMs:

  • liuhaotian/llava-v1.5-7b
  • llava-hf/llava-1.5-7b-hf
  • llava-hf/llava-v1.6-mistral-7b-hf
  • Salesforce/blip2-opt-2.7b

🤖 Pose Estimation

Detects eating behavior by checking if the mouth is open and if food is near the mouth using bounding boxes and pose landmarks. We use DWPose to detect mouth landmarks, and GroundingDINO to localize food items.

Example frame of a person eating:

Example frame of a person NOT eating:

Example face plot using DWPose:

🔧 Hyperparameters

Hyperparameter Symbol Value
Frame Step Size $\tau$ 20 frames
Frame Tolerance Threshold $\epsilon$ 15 frames
Lip Separation Threshold $\beta$ 8.0
IoU Threshold $\delta$ 0.15

View and modify hyperparameters here.

🙏 Acknowledgements

About

A tool for analyzing food intake from videos using VLMs and Pose Estimation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages