Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133

Open
5 tasks
maxachis opened this issue Jan 7, 2025 · 6 comments
Open
5 tasks

Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133

maxachis opened this issue Jan 7, 2025 · 6 comments

Comments

@maxachis
Copy link
Collaborator

maxachis commented Jan 7, 2025

Now that we've started accumulating a number of fresh baby URLs, we need to decide which are actually relevant for our needs and which should be discarded.

  • Change url_metadata to collector_metadata in the URL table-- The entire URL row is arguably metadata for the URL, so we should be specific as to the source of metadata.
  • Add a relevant boolean column to the URL table, initialized to null. This will be used to identify whether a URL is considered relevant or not.
  • Create a pipeline that submits all URLs with pending status and null relevancy to the Huggingface ML model. Based on the response received from them, update the relevant column accordingly.
  • Create an endpoint which can be used to trigger this batch
  • Determine how to log the process of this operation for later, in case we need to debug.
@maxachis
Copy link
Collaborator Author

maxachis commented Jan 9, 2025

@josh-chamberlain So I may be missing some details, as my familiarity with HuggingFace is very fresh, but it doesn't seem like the relevancy model is ready for prime time just yet. I tried putting a few URLs into the relevancy pipeline and got results barely above 0.5, which doesn't seem like sufficient confidence for us to use it:

[{'label': 'LABEL_0', 'score': 0.5437766313552856}, {'url': 'https://coloradosprings.gov/police-department/article/news/i-25-traffic-safety-deployment-after-stop'}]
[{'label': 'LABEL_0', 'score': 0.5175026059150696}, {'url': 'https://example.com'}]
[{'label': 'LABEL_0', 'score': 0.5336290597915649}, {'url': 'https://police.com'}]

Here was my simple widdle test I used to validate this:

from transformers import pipeline

def test_relevancy_pipeline():
    pipe = pipeline("text-classification", model="PDAP/url-relevance")
    urls = ["example.com", "police.com", "https://coloradosprings.gov/police-department/article/news/i-25-traffic-safety-deployment-after-stop"]
    results = pipe(urls)
    for result, url in zip(results, urls):
        print(f"{url}: {result}")

The repository also seems scant -- within the /hugging_face/url_relevance directory in the repository, there are 11 "clean data examples", but that's all the training data I've found so far. Unsure if there's more somewhere else, or if this solely exists thus far in proof of concept form.

This suggests to me that the next thing to flesh out is the label studio pipeline! 👆

@josh-chamberlain
Copy link
Contributor

@maxachis that confidence does seem low. this board implies the model has decently high accuracy: https://huggingface.co/PDAP/url-relevance/tensorboard

though, until the accuracy gets higher still, we should probably still be importing these as pre-labels for human users to confirm/deny.

training URLs should be here, around 4300 of them: https://huggingface.co/datasets/PDAP/training-urls

they are labeled for both relevance and record type. the idea is that this includes anything in our database plus anything we have labeled separately.

@maxachis
Copy link
Collaborator Author

maxachis commented Jan 9, 2025

@josh-chamberlain Got it, so in that case, we may need to modify our existing workflow. Does this make sense?

---
title: Early Workflow
---
flowchart TD
  
SC[Source Collector]
FFR[ML: Filter For Relevance]
RQ(Relevant?)
DS((Discard))
TFA[Try to Find Agency]
AF(Agency Found?)
LS[Send to Label Studio]
LSPA[Send Pre-Annotation to Label Studio]
DSAPI[Send to Data Sources API for Approval]
UBM[Update Batch Metadata]
UTD[🤗Update Training Dataset]

style LS fill:#d2b48c, color:#000
style LSPA fill:#d2b48c, color:#000
style UBM fill:#2aa621, color:#000
style FFR fill:#75018a, color:#000
style TFA fill:#ba5400, color:#000
style DSAPI fill:#003eba, color:#000
style SC fill:#00bab7, color:#000
style UTD fill:#fbbc39, color:#000

SC --> FFR
FFR --> LSPA
LSPA --> RQ
RQ -->|No| DS
DS --> UBM
UBM --> UTD
RQ -->|Yes| TFA
TFA --> AF
AF -->|No| LS
LS --> UBM
AF -->|Yes| DSAPI
DSAPI --> UBM
Loading

@josh-chamberlain
Copy link
Contributor

@maxachis hmm, it depends how much talking back and forth we want to do with LabelStudio. To be clear about terminology, I meant "pre-label" as in, when the user opens the labeling interface, the option is already selected as our best guess. I don't think we need to send it to label studio any sooner.

So, if the pipeline thinks the URL is relevant, there's a decent chance now that it will be wrong... but if it also thinks it found an agency for it, it's probably worth submitting to the database! And then if not, it can go to label studio.

I think the proportion of URLs that take the happy path to submission will be quite small at first, and ideally grow.

@maxachis
Copy link
Collaborator Author

maxachis commented Jan 9, 2025

@josh-chamberlain Got it, so in that case, nothing at this point is discarded based on relevancy. We keep track of its tentative rating of relevant or not relevant, and use that tentative score as the pre-annotation. At this point, we don't discard based on relevance.

---
title: Early Workflow
---
flowchart TD
  
SC[Source Collector]
FFR[🤗 Get Tentative Relevance Classification]
DS((🗑️Discard))
TFA[Try to Find Agency]
AF(Agency Found?)
LS[Send to Label Studio]
DSAPI[Send to Data Sources API for Approval]
UBM[Update Batch Metadata]
UTD[🤗Update Training Dataset]
RQ[Relevant?]

style LS fill:#d2b48c, color:#000
style UBM fill:#2aa621, color:#000
style FFR fill:#75018a, color:#000
style TFA fill:#ba5400, color:#000
style DSAPI fill:#003eba, color:#000
style SC fill:#00bab7, color:#000
style UTD fill:#fbbc39, color:#000

SC --> FFR
FFR --> TFA
DS --> UBM
UBM --> UTD
TFA --> AF
AF -->|No| LS
LS --> RQ
RQ --> |No| DS
RQ --> |Yes| DSAPI
AF -->|Yes| DSAPI
DSAPI --> UBM

Loading

@josh-chamberlain
Copy link
Contributor

@maxachis nice, true! it feels agency is most critical, so this makes more intuitive sense as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants