Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133

maxachis · 2025-01-07T21:53:41Z

Now that we've started accumulating a number of fresh baby URLs, we need to decide which are actually relevant for our needs and which should be discarded.

Change url_metadata to collector_metadata in the URL table-- The entire URL row is arguably metadata for the URL, so we should be specific as to the source of metadata.
Add a relevant boolean column to the URL table, initialized to null. This will be used to identify whether a URL is considered relevant or not.
Create a pipeline that submits all URLs with pending status and null relevancy to the Huggingface ML model. Based on the response received from them, update the relevant column accordingly.
Create an endpoint which can be used to trigger this batch
Determine how to log the process of this operation for later, in case we need to debug.

The text was updated successfully, but these errors were encountered:

maxachis · 2025-01-09T15:38:49Z

@josh-chamberlain So I may be missing some details, as my familiarity with HuggingFace is very fresh, but it doesn't seem like the relevancy model is ready for prime time just yet. I tried putting a few URLs into the relevancy pipeline and got results barely above 0.5, which doesn't seem like sufficient confidence for us to use it:

[{'label': 'LABEL_0', 'score': 0.5437766313552856}, {'url': 'https://coloradosprings.gov/police-department/article/news/i-25-traffic-safety-deployment-after-stop'}]
[{'label': 'LABEL_0', 'score': 0.5175026059150696}, {'url': 'https://example.com'}]
[{'label': 'LABEL_0', 'score': 0.5336290597915649}, {'url': 'https://police.com'}]

Here was my simple widdle test I used to validate this:

from transformers import pipeline

def test_relevancy_pipeline():
    pipe = pipeline("text-classification", model="PDAP/url-relevance")
    urls = ["example.com", "police.com", "https://coloradosprings.gov/police-department/article/news/i-25-traffic-safety-deployment-after-stop"]
    results = pipe(urls)
    for result, url in zip(results, urls):
        print(f"{url}: {result}")

The repository also seems scant -- within the /hugging_face/url_relevance directory in the repository, there are 11 "clean data examples", but that's all the training data I've found so far. Unsure if there's more somewhere else, or if this solely exists thus far in proof of concept form.

This suggests to me that the next thing to flesh out is the label studio pipeline! 👆

josh-chamberlain · 2025-01-09T16:17:30Z

@maxachis that confidence does seem low. this board implies the model has decently high accuracy: https://huggingface.co/PDAP/url-relevance/tensorboard

though, until the accuracy gets higher still, we should probably still be importing these as pre-labels for human users to confirm/deny.

training URLs should be here, around 4300 of them: https://huggingface.co/datasets/PDAP/training-urls

they are labeled for both relevance and record type. the idea is that this includes anything in our database plus anything we have labeled separately.

maxachis · 2025-01-09T20:14:54Z

@josh-chamberlain Got it, so in that case, we may need to modify our existing workflow. Does this make sense?

---
title: Early Workflow
---
flowchart TD
  
SC[Source Collector]
FFR[ML: Filter For Relevance]
RQ(Relevant?)
DS((Discard))
TFA[Try to Find Agency]
AF(Agency Found?)
LS[Send to Label Studio]
LSPA[Send Pre-Annotation to Label Studio]
DSAPI[Send to Data Sources API for Approval]
UBM[Update Batch Metadata]
UTD[🤗Update Training Dataset]

style LS fill:#d2b48c, color:#000
style LSPA fill:#d2b48c, color:#000
style UBM fill:#2aa621, color:#000
style FFR fill:#75018a, color:#000
style TFA fill:#ba5400, color:#000
style DSAPI fill:#003eba, color:#000
style SC fill:#00bab7, color:#000
style UTD fill:#fbbc39, color:#000

SC --> FFR
FFR --> LSPA
LSPA --> RQ
RQ -->|No| DS
DS --> UBM
UBM --> UTD
RQ -->|Yes| TFA
TFA --> AF
AF -->|No| LS
LS --> UBM
AF -->|Yes| DSAPI
DSAPI --> UBM

josh-chamberlain · 2025-01-09T21:27:35Z

@maxachis hmm, it depends how much talking back and forth we want to do with LabelStudio. To be clear about terminology, I meant "pre-label" as in, when the user opens the labeling interface, the option is already selected as our best guess. I don't think we need to send it to label studio any sooner.

So, if the pipeline thinks the URL is relevant, there's a decent chance now that it will be wrong... but if it also thinks it found an agency for it, it's probably worth submitting to the database! And then if not, it can go to label studio.

I think the proportion of URLs that take the happy path to submission will be quite small at first, and ideally grow.

maxachis · 2025-01-09T21:47:35Z

@josh-chamberlain Got it, so in that case, nothing at this point is discarded based on relevancy. We keep track of its tentative rating of relevant or not relevant, and use that tentative score as the pre-annotation. At this point, we don't discard based on relevance.

---
title: Early Workflow
---
flowchart TD
  
SC[Source Collector]
FFR[🤗 Get Tentative Relevance Classification]
DS((🗑️Discard))
TFA[Try to Find Agency]
AF(Agency Found?)
LS[Send to Label Studio]
DSAPI[Send to Data Sources API for Approval]
UBM[Update Batch Metadata]
UTD[🤗Update Training Dataset]
RQ[Relevant?]

style LS fill:#d2b48c, color:#000
style UBM fill:#2aa621, color:#000
style FFR fill:#75018a, color:#000
style TFA fill:#ba5400, color:#000
style DSAPI fill:#003eba, color:#000
style SC fill:#00bab7, color:#000
style UTD fill:#fbbc39, color:#000

SC --> FFR
FFR --> TFA
DS --> UBM
UBM --> UTD
TFA --> AF
AF -->|No| LS
LS --> RQ
RQ --> |No| DS
RQ --> |Yes| DSAPI
AF -->|Yes| DSAPI
DSAPI --> UBM

josh-chamberlain · 2025-01-09T22:51:18Z

@maxachis nice, true! it feels agency is most critical, so this makes more intuitive sense as well.

maxachis mentioned this issue Jan 19, 2025

Mc 133 huggingface pipeline #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133

Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133

maxachis commented Jan 7, 2025

maxachis commented Jan 9, 2025 •

edited

Loading

josh-chamberlain commented Jan 9, 2025

maxachis commented Jan 9, 2025

josh-chamberlain commented Jan 9, 2025

maxachis commented Jan 9, 2025 •

edited

Loading

josh-chamberlain commented Jan 9, 2025

Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133

Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133

Comments

maxachis commented Jan 7, 2025

maxachis commented Jan 9, 2025 • edited Loading

josh-chamberlain commented Jan 9, 2025

maxachis commented Jan 9, 2025

josh-chamberlain commented Jan 9, 2025

maxachis commented Jan 9, 2025 • edited Loading

josh-chamberlain commented Jan 9, 2025

maxachis commented Jan 9, 2025 •

edited

Loading

maxachis commented Jan 9, 2025 •

edited

Loading