-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline: Fresh URLs into Huggingface Relevancy ML Model #133
Comments
@josh-chamberlain So I may be missing some details, as my familiarity with HuggingFace is very fresh, but it doesn't seem like the relevancy model is ready for prime time just yet. I tried putting a few URLs into the relevancy pipeline and got results barely above 0.5, which doesn't seem like sufficient confidence for us to use it:
Here was my simple widdle test I used to validate this: from transformers import pipeline
def test_relevancy_pipeline():
pipe = pipeline("text-classification", model="PDAP/url-relevance")
urls = ["example.com", "police.com", "https://coloradosprings.gov/police-department/article/news/i-25-traffic-safety-deployment-after-stop"]
results = pipe(urls)
for result, url in zip(results, urls):
print(f"{url}: {result}") The repository also seems scant -- within the This suggests to me that the next thing to flesh out is the label studio pipeline! 👆 |
@maxachis that confidence does seem low. this board implies the model has decently high accuracy: https://huggingface.co/PDAP/url-relevance/tensorboard though, until the accuracy gets higher still, we should probably still be importing these as pre-labels for human users to confirm/deny. training URLs should be here, around 4300 of them: https://huggingface.co/datasets/PDAP/training-urls they are labeled for both relevance and record type. the idea is that this includes anything in our database plus anything we have labeled separately. |
@josh-chamberlain Got it, so in that case, we may need to modify our existing workflow. Does this make sense? ---
title: Early Workflow
---
flowchart TD
SC[Source Collector]
FFR[ML: Filter For Relevance]
RQ(Relevant?)
DS((Discard))
TFA[Try to Find Agency]
AF(Agency Found?)
LS[Send to Label Studio]
LSPA[Send Pre-Annotation to Label Studio]
DSAPI[Send to Data Sources API for Approval]
UBM[Update Batch Metadata]
UTD[🤗Update Training Dataset]
style LS fill:#d2b48c, color:#000
style LSPA fill:#d2b48c, color:#000
style UBM fill:#2aa621, color:#000
style FFR fill:#75018a, color:#000
style TFA fill:#ba5400, color:#000
style DSAPI fill:#003eba, color:#000
style SC fill:#00bab7, color:#000
style UTD fill:#fbbc39, color:#000
SC --> FFR
FFR --> LSPA
LSPA --> RQ
RQ -->|No| DS
DS --> UBM
UBM --> UTD
RQ -->|Yes| TFA
TFA --> AF
AF -->|No| LS
LS --> UBM
AF -->|Yes| DSAPI
DSAPI --> UBM
|
@maxachis hmm, it depends how much talking back and forth we want to do with LabelStudio. To be clear about terminology, I meant "pre-label" as in, when the user opens the labeling interface, the option is already selected as our best guess. I don't think we need to send it to label studio any sooner. So, if the pipeline thinks the URL is relevant, there's a decent chance now that it will be wrong... but if it also thinks it found an agency for it, it's probably worth submitting to the database! And then if not, it can go to label studio. I think the proportion of URLs that take the happy path to submission will be quite small at first, and ideally grow. |
@josh-chamberlain Got it, so in that case, nothing at this point is discarded based on relevancy. We keep track of its tentative rating of relevant or not relevant, and use that tentative score as the pre-annotation. At this point, we don't discard based on relevance. ---
title: Early Workflow
---
flowchart TD
SC[Source Collector]
FFR[🤗 Get Tentative Relevance Classification]
DS((🗑️Discard))
TFA[Try to Find Agency]
AF(Agency Found?)
LS[Send to Label Studio]
DSAPI[Send to Data Sources API for Approval]
UBM[Update Batch Metadata]
UTD[🤗Update Training Dataset]
RQ[Relevant?]
style LS fill:#d2b48c, color:#000
style UBM fill:#2aa621, color:#000
style FFR fill:#75018a, color:#000
style TFA fill:#ba5400, color:#000
style DSAPI fill:#003eba, color:#000
style SC fill:#00bab7, color:#000
style UTD fill:#fbbc39, color:#000
SC --> FFR
FFR --> TFA
DS --> UBM
UBM --> UTD
TFA --> AF
AF -->|No| LS
LS --> RQ
RQ --> |No| DS
RQ --> |Yes| DSAPI
AF -->|Yes| DSAPI
DSAPI --> UBM
|
@maxachis nice, true! it feels agency is most critical, so this makes more intuitive sense as well. |
Now that we've started accumulating a number of fresh baby URLs, we need to decide which are actually relevant for our needs and which should be discarded.
url_metadata
tocollector_metadata
in theURL
table-- The entireURL
row is arguably metadata for the URL, so we should be specific as to the source of metadata.relevant
boolean column to theURL
table, initialized tonull
. This will be used to identify whether a URL is considered relevant or not.pending
status andnull
relevancy to the Huggingface ML model. Based on the response received from them, update therelevant
column accordingly.The text was updated successfully, but these errors were encountered: