Bags of identifiers generated from 140,000 most starred projects on GitHub in October 2016 - ~112k after deduplication.
Example:
from sourced.ml.models import BOW
bow = BOW().load("1e3da42a-28b6-4b33-94a2-a5671f4102f4")
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))
ID | 1e3da42a-28b6-4b33-94a2-a5671f4102f4 |
Uploaded | 2017-06-19 09:16:08.942880 |
Version | 1.0.0 |
File | https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e3da42a-28b6-4b33-94a2-a5671f4102f4.asdf |
Size | 380.8 MB |
Data collection date | October 2016 |
Number of (sub)tokens | 999,424 |
Number of repositories | 112,273 |
License |