Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Sign up

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

embeddings-benchmark / mteb Public

Notifications You must be signed in to change notification settings
Fork 380
Star 2.4k

Code
Issues 252
Pull requests 30
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Add audioset (WIP) #2331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

RahulSChand wants to merge 1 commit into embeddings-benchmark:maeb

base: maeb

Choose a base branch

Loading

Loading

from anime-sh:audioSet

Draft

Add audioset (WIP) #2331

RahulSChand wants to merge 1 commit into embeddings-benchmark:maeb from anime-sh:audioSet

Conversation 8 Commits 1 Checks 8 Files changed

Conversation

Copy link

RahulSChand commented Mar 11, 2025 •

edited

Loading

Add audioset dataset part of #2319 . Also addressed #2049

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

Sorry, something went wrong.

All reactions


          Added audioset draft commit

f257289

RahulSChand added the maeb Audio extension label

RahulSChand self-assigned this

RahulSChand marked this pull request as draft

March 11, 2025 19:13

RahulSChand mentioned this pull request

Add AudioSet #2049

Open

RahulSChand changed the title ~~Add audioset draft commit (WIP)~~ Add audioset (WIP)

KennethEnevoldsen requested changes

View reviewed changes

Copy link

Contributor

KennethEnevoldsen left a comment •

edited

Loading

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The description of task is missing.

Please remove irrelevant stuff (model checklist) from the message. I would also really like to see that a model has actually been run on the task to confirm that it works.

Sorry, something went wrong.

All reactions

mteb/tasks/Audio/AudioMultilabelClassification/eng/AudioSet.py

Comment on lines +38 to +40

+                      descriptive_stats={
+                          "n_samples": {"test": 8961},  # Need to change
+                      },

Copy link

Contributor

KennethEnevoldsen Mar 11, 2025

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

      
                    descriptive_stats={
          
                        "n_samples": {"test": 8961},  # Need to change
          
                    },

Sorry, something went wrong.

All reactions

mteb/tasks/Audio/AudioMultilabelClassification/eng/AudioSet.py

Comment on lines +28 to +30

+                      task_subtypes=[
+                          "Environment Sound Classification"
+                      ],  # Since this dataset has sounds of ALL types, this seems to be the best option

Copy link

Contributor

KennethEnevoldsen Mar 11, 2025

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

      
                    task_subtypes=[
          
                        "Environment Sound Classification"
          
                    ],  # Since this dataset has sounds of ALL types, this seems to be the best option
          
                    task_subtypes=[],

Hmm not sure about this one

Sorry, something went wrong.

All reactions

mteb/tasks/Audio/AudioMultilabelClassification/eng/AudioSet.py

+              class AudioSetMultilingualClassification(AbsTaskAudioMultilabelClassification):
+                  metadata = TaskMetadata(
+                      name="AudioSet",
+                      description="Multilabel Audio Classification.",

Copy link

Contributor

KennethEnevoldsen Mar 11, 2025

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to know anything about the dataset from this

Sorry, something went wrong.

All reactions

mteb/tasks/Audio/AudioMultilabelClassification/eng/AudioSet.py

+                      dataset={
+                          "path": "agkphysics/AudioSet",
+                          "revision": "5a2fa42a1506470d275a47ff8e1fdac5b364e6ef",
+                      },  # this is actually used to download the data

Copy link

Contributor

KennethEnevoldsen Mar 11, 2025

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

      
                    },  # this is actually used to download the data
          
                    },

Sorry, something went wrong.

All reactions

mteb/tasks/Audio/AudioMultilabelClassification/eng/AudioSet.py

Comment on lines +23 to +26

+                      date=(
+                          "2020-01-01",
+                          "2020-01-30",
+                      ),  # Estimated date when this dataset was committed, what should be the second tuple?

Copy link

Contributor

KennethEnevoldsen Mar 11, 2025

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the estimated time when the data was created (sounds produced, tweets posted, images taken). In between A and B.

Sorry, something went wrong.

All reactions

mteb/tasks/Audio/AudioMultilabelClassification/eng/AudioSet.py

+                          "2020-01-01",
+                          "2020-01-30",
+                      ),  # Estimated date when this dataset was committed, what should be the second tuple?
+                      domains=["Web"],  # obtained from Freesound - online collaborative platform

Copy link

Contributor

KennethEnevoldsen Mar 11, 2025

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

      
                    domains=["Web"],  # obtained from Freesound - online collaborative platform
          
                    domains=[],

I'm not sure about this one - it's hard to say with the description, though.

Sorry, something went wrong.

All reactions

Copy link

Author

RahulSChand commented Mar 11, 2025

Thanks for the PR. The description of task is missing.

Please remove irrelevant stuff (model checklist) from the message. I would also really like to see that a model has actually been run on the task to confirm that it works.

Yeah, makes sense. Its currently a draft PR haven't run the actual model yet to get the numbers. Again this dataset is bigger than others so it will take some time

KennethEnevoldsen reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

KennethEnevoldsen KennethEnevoldsen requested changes

Assignees

Labels

maeb

Audio extension

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.