BOLLD employs a multi-modal approach, integrating body language analysis, lip transcriptions, and reinforcement learning to detect threats in real-time using computer vision and natural language processing.
- Detecting potential threats or violent language when audio is corrupted or unavailable during meetings. π
β οΈ - Aimed at enhancing safety and providing an alternative threat detection system that doesn't rely on sound. π₯π
- Can be applied in public safety scenarios, such as campus surveillance, to alert authorities of potential threats in real-time. ππ¨
- Assistive technology: Can be implemented in glasses with cameras to help people with disabilities, like blindness, by notifying them of potential threats they might not visually perceive. ππ€π
Run the app using the following command
streamlit run app.py
Currently the app.py contains the body language code training and details about which can be found in the body_lang_decoder folder, the lip transcription component details about the model can be found in the lip_to_text folder where each key word is compared to a list of threatening words and the threat level is calculated. The threat level is then used to determine the state of the system. Passed into the Q-Learning table, the state is used to determine the action to take using reinforcement learning.
- Use a trained body language model πΊ and lip reading (via Mediapipe landmarks) π to compute a numerical threat probability (0-1) for each.
- Combine both values to get a combined threat score π’.
- Based on the two inputs from the first stage, train a reinforcement learning model π€ to recognize sequences of actions and lip movements that suggest malicious behavior.
- Output: 0 β Non-malicious, 1 β Malicious, and a scale (0-1) representing the threat level of key words (0 = non-threatening, 1 = threatening).
- The model will influence the environment state π:
- De-escalate if the threat is correctly identified ποΈ.
- All clear! if the threat is incorrectly identified π¨.
- Using the EMOLIPS model (CNN-LSTM) to detect emotions from lip movement based on face details. ππ
- Negative emotions (e.g. anger, disgust) π₯΄ can assist in identifying potential threats.
β οΈ - Oct 27: Shifted to a facial emotion recognition model using DeepFace due to better performance. π§βπ¨
- Integrating body language into a threat vs. non-threat classification using Mediapipe. π§βπ» The model trains on coordinates from landmarks in frames with associated labels.
- Jan 13: Decided to use one body language model (Mediapipe) after facing multiprocessing conflicts with running two models simultaneously (initial goal was to get an average). π€β
- Closely following the methods of LipNet, as it's proven and well-documented. π
- Methodology: Uses Dlib for facial landmark detection, preprocessing the GRID dataset, followed by a CNN architecture with bidirectional GRUs. CTC training used for model optimization. π
- Jan 13: Switching models as the previous one couldnβt handle live video streams. Transitioning to a more suitable approach (e.g., Whisper model) to transcribe lip movement to text, then applying custom models to detect violence levels. π»
- Jan 21: Exploring a new technique using lip/mouth landmarks to detect phonemes and then identify key words stored in a dictionary with associated threat levels. π
- Jan 27: Enhanced LipNet model to process live video streams π₯ and detect mouth region with Dlib + ShapePredictor68.
- Jan 29: Added algorithm to detect key words and produce a violence value. π
- Jan 31: Integrated into app.py. π
- π Project Kickoff: Setup environment and tools
- π₯ Task Assignment
- π― Define goals and objectives
- π Data exploration and preparation
- π Create basic frontend & backend
- π₯ Set up OpenCV for video processing
- π Split into lip reading and reinforcement learning (RL) stages
- π€ Research different models and methods for both stages
- π» Start implementation
- β Finish body language part of stage 1
- π± Set up RL environment
- π Finish preprocessing for lip to text part of stage 1
- π Continue implementation of lip to text training
- π Finish training lip to text part of stage 1
- π Complete RL stage 2
- π₯ Create a demo video
- π Connect stage 1 and 2
- π§ Continue reinforcement learning model training
- π Frontend & Backend integration with ML scripts
- β Finalize body language model
- β Finalize lip to text model
- π§ Continue working on RL
- π§ Finish lip to text model
- π Integrate lip to text into the main app.py
- β¨ Final touches
- βοΈ Improve accuracy and fine-tuning
- π₯οΈ Test the model with webcam integration
Below are the images that contain the key landmarks used to detect the lip area:
Additionally, the app.py contains reinforcement learning code, details below:
State Space:
def get_state(threatness_level):
if threatness_level < 0.4:
return "low"
elif 0.4 <= threatness_level <= 0.7:
return "medium"
else:
return "high"
State space is simplified into three levels (low, medium, high) based on the threat probability from the body language model. This simplifcation allows the learningto be more manageable while still capturing the essential threat levels.
Action Space:
actions = ["escalate", "de-escalate"]
Action space is simplified into two actions (escalate and de-escalate) based on the current state. This simplifies the learning process as well as the decision making process.
Q-Learning Table:
def update_q_table(state, action, reward, next_state):
if state not in st.session_state.q_table:
st.session_state.q_table[state] = {a: 0 for a in actions}
if next_state not in st.session_state.q_table:
st.session_state.q_table[next_state] = {a: 0 for a in actions}
st.session_state.q_table[state][action] += learning_rate * (
reward + discount_factor * max(st.session_state.q_table[next_state].values()) -
st.session_state.q_table[state][action]
)
The Q-Learning table is a dictionary that stores the Q-values for each state-action pair. The Q-values are updated based on the current state, action, reward, and next state. The Q-values are used to determine the best action to take in the next state.
Action Selection:
def choose_action(state):
if np.random.rand() < epsilon:
return np.random.choice(actions)
if state in st.session_state.q_table:
return max(st.session_state.q_table[state], key=st.session_state.q_table[state].get)
return np.random.choice(actions)
The action selection process is based on the current state and the Q-Learning table. The action selection process is random if the exploration rate is high, and based on the Q-values if the exploration rate is low. The Q-values are updated based on the current state, action, reward, and next state.
Reward Calculation:
if action == "escalate":
reward = -1 if threatness_level < 0.5 else 1
else:
reward = 1 if threatness_level < 0.5 else -1
The reward calculation is based on the current action and the threat probability. (Add more info on how the reward is calculated.)
- Learning from trial and error it improves the accuracy of the model.
- It allows for adaptation to new situations.
- The reward system provides immediate feedback about the appropriateness of actions.
- Continuously improve its decision-making based on experience.
- and many more to be added soon...
- Evaluate the performance of the reinforcement learning model and create some graphs to visualize the learning process.
- Create a decision making tree that shows all the possible actions and their outcomes and how the rl model learns from this/chooses its actions.
- Update research doc (currently in progress).
- Update process flow diagram.