Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try Out New Machine Learning Models #142

Open
maxachis opened this issue Jan 25, 2025 · 0 comments
Open

Try Out New Machine Learning Models #142

maxachis opened this issue Jan 25, 2025 · 0 comments

Comments

@maxachis
Copy link
Collaborator

maxachis commented Jan 25, 2025

This is a more broad issue related to trying out different Machine Learning models for the purposes of URL relevancy and classification.

Models

Bag-Of-Words (BoW) Training

A means of measuring word frequency without taking into account word order. For example:

{"John":1,"likes":3,"to":2,"watch":2,"movies":2,"Mary":2,"too":1,"also":1,"football":1,"games":1}

The theory for this is that certain words are more likely to be present for URLs of certain categories, and that using this count alone can help with URL classification.

This is also a fairly simple and lightweight metric that can be used in similarly simple and lightweight machine learning models.

TF-IDF Training

Stands for "Term Frequency - Inverse Document Frequency"

A step up in sophistication compared to BoW, where certain words are identified as having more "importance" in a corpus, adjusted for the fact that some words appear more frequently in general.

This requires taking into account the broader collection of words in a corpus.

Use DeepSeek's API

DeepSeek's API can be queried for free. So I can pipe relevant data about the URL to it and have it output its classification.

LLMs in my experience tend to be fairly strong in making classifications in limited context windows, and don't get easily tripped up by semi-structured data.

The downside is that, as long as I'm using the API, I can't improve on the model. If it's 90% accurate, it will stay that way.

Still, that's a higher level of accuracy than our current models, and we could switch out later if need be. So I'll give it a shot and see how it plays.

Latent Dirichlet Allocation (LDA)

An effective algorithm for topic discovery. Could be combined with things like Bag-Of-Words and IDF to provide means to categorize different web pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant