- Cost-Effective: Uses pre-trained vision-language models without requiring fine-tuning, expensive GPUs, or specialized datasets.
- Context-Aware Analysis: Detects foods, utensils, and eating actions frame-by-frame for accurate tracking throughout video input.
- Domain Adaptable/Scalable: Provides labeled dietary insights applicable to healthcare, childcare, and assisted living environments without additional equipment.
Extracts nutritional information, ingredients, and utensils from video frames using Vision-Language Models. Groups frames into intervals based on consistent food item presence. The code can be modified to accomodate any of the following HuggingFace VLMs:
liuhaotian/llava-v1.5-7b
llava-hf/llava-1.5-7b-hf
llava-hf/llava-v1.6-mistral-7b-hf
Salesforce/blip2-opt-2.7b
Detects eating behavior by checking if the mouth is open and if food is near the mouth using bounding boxes and pose landmarks. We use DWPose to detect mouth landmarks, and GroundingDINO to localize food items.
Hyperparameter | Symbol | Value |
---|---|---|
Frame Step Size | 20 frames | |
Frame Tolerance Threshold | 15 frames | |
Lip Separation Threshold | 8.0 | |
IoU Threshold | 0.15 |
View and modify hyperparameters here.
-
This work was supported by the National Research Council Canada (NRC) through the Aging in Place (AiP) Challenge Program. Project number AiP-006.
-
The authors thank the Vision and Image Processing Lab (VIP Lab) at the University of Waterloo for facilitating this project.