I propose Ear-VoM, an integrated system designed to prevent and address voice phishing, consisting of three stages: prevention, response & reporting, and investigation. This repository contains two simple experiments to support the core ideas of Ear-VoM.
2024년 2학기 이화여자대학교 도전학기제 개인 프로젝트로 진행하였음.
The d-vector model is designed to learn embeddings that effectively differentiate speakers based on various emotional combinations. The training process aims that even if different emotions are present, the embeddings are capable of distinguishing speakers regardless of emotional variation while maintaining proximity among embeddings of the same speaker regardless of emotional variation.
- Dataset: Emotion-tagged free conversations (adults)
- Source: AI Hub
- Verify that d-vector embeddings can effectively compare different speakers based on their emotional combinations.
- Ensure that embeddings of the same speaker, even with varying emotions, are closely aligned.

The x-vector model aims to distinguish between different deep voice generation models by analyzing their generated outputs.
- Dataset: ASVspoof 2019
- Reference: Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., ... & Ling, Z. H. (2020). ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64, 101114.
- Effectively classify and differentiate between the outputs of various deep voice generation models.