You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The theory for this is that certain words are more likely to be present for URLs of certain categories, and that using this count alone can help with URL classification.
This is also a fairly simple and lightweight metric that can be used in similarly simple and lightweight machine learning models.
A step up in sophistication compared to BoW, where certain words are identified as having more "importance" in a corpus, adjusted for the fact that some words appear more frequently in general.
This requires taking into account the broader collection of words in a corpus.
Use DeepSeek's API
DeepSeek's API can be queried for free. So I can pipe relevant data about the URL to it and have it output its classification.
LLMs in my experience tend to be fairly strong in making classifications in limited context windows, and don't get easily tripped up by semi-structured data.
The downside is that, as long as I'm using the API, I can't improve on the model. If it's 90% accurate, it will stay that way.
Still, that's a higher level of accuracy than our current models, and we could switch out later if need be. So I'll give it a shot and see how it plays.
Latent Dirichlet Allocation (LDA)
An effective algorithm for topic discovery. Could be combined with things like Bag-Of-Words and IDF to provide means to categorize different web pages.
The text was updated successfully, but these errors were encountered:
This is a more broad issue related to trying out different Machine Learning models for the purposes of URL relevancy and classification.
Models
Bag-Of-Words (BoW) Training
A means of measuring word frequency without taking into account word order. For example:
The theory for this is that certain words are more likely to be present for URLs of certain categories, and that using this count alone can help with URL classification.
This is also a fairly simple and lightweight metric that can be used in similarly simple and lightweight machine learning models.
TF-IDF Training
Stands for "Term Frequency - Inverse Document Frequency"
A step up in sophistication compared to BoW, where certain words are identified as having more "importance" in a corpus, adjusted for the fact that some words appear more frequently in general.
This requires taking into account the broader collection of words in a corpus.
Use DeepSeek's API
DeepSeek's API can be queried for free. So I can pipe relevant data about the URL to it and have it output its classification.
LLMs in my experience tend to be fairly strong in making classifications in limited context windows, and don't get easily tripped up by semi-structured data.
The downside is that, as long as I'm using the API, I can't improve on the model. If it's 90% accurate, it will stay that way.
Still, that's a higher level of accuracy than our current models, and we could switch out later if need be. So I'll give it a shot and see how it plays.
Latent Dirichlet Allocation (LDA)
An effective algorithm for topic discovery. Could be combined with things like Bag-Of-Words and IDF to provide means to categorize different web pages.
The text was updated successfully, but these errors were encountered: