Try Out New Machine Learning Models #142

maxachis · 2025-01-25T22:08:22Z

This is a more broad issue related to trying out different Machine Learning models for the purposes of URL relevancy and classification.

Models

Bag-Of-Words (BoW) Training

A means of measuring word frequency without taking into account word order. For example:

{"John":1,"likes":3,"to":2,"watch":2,"movies":2,"Mary":2,"too":1,"also":1,"football":1,"games":1}

The theory for this is that certain words are more likely to be present for URLs of certain categories, and that using this count alone can help with URL classification.

This is also a fairly simple and lightweight metric that can be used in similarly simple and lightweight machine learning models.

TF-IDF Training

Stands for "Term Frequency - Inverse Document Frequency"

A step up in sophistication compared to BoW, where certain words are identified as having more "importance" in a corpus, adjusted for the fact that some words appear more frequently in general.

This requires taking into account the broader collection of words in a corpus.

Use DeepSeek's API

DeepSeek's API can be queried for free. So I can pipe relevant data about the URL to it and have it output its classification.

LLMs in my experience tend to be fairly strong in making classifications in limited context windows, and don't get easily tripped up by semi-structured data.

The downside is that, as long as I'm using the API, I can't improve on the model. If it's 90% accurate, it will stay that way.

Still, that's a higher level of accuracy than our current models, and we could switch out later if need be. So I'll give it a shot and see how it plays.

Latent Dirichlet Allocation (LDA)

An effective algorithm for topic discovery. Could be combined with things like Bag-Of-Words and IDF to provide means to categorize different web pages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try Out New Machine Learning Models #142

Try Out New Machine Learning Models #142

maxachis commented Jan 25, 2025 •

edited

Loading

Try Out New Machine Learning Models #142

Try Out New Machine Learning Models #142

Comments

maxachis commented Jan 25, 2025 • edited Loading

Models

Bag-Of-Words (BoW) Training

TF-IDF Training

Use DeepSeek's API

Latent Dirichlet Allocation (LDA)

maxachis commented Jan 25, 2025 •

edited

Loading