Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/context #109

Merged
merged 33 commits into from
Feb 20, 2024
Merged

Features/context #109

merged 33 commits into from
Feb 20, 2024

Conversation

steffencruz
Copy link
Collaborator

@steffencruz steffencruz commented Feb 19, 2024

  • Refactors dataset.py into a submodule with each dataset type having its' own file
  • Adds a base Dataset class which enables context generation via 3 methods
    • get (specific context retrieval, fully deterministic)
    • search (dataset-specific search algorithm, generally deterministic)
    • random (produces a random context, can be seeded)
  • Removes existing requests based wikipedia dataclass in favor of the wiki python api. Closes Improve wikipedia retrieval mechanism #76
  • Adds Selector class and variants, which enable customizable selection from sets of link-like objects
  • Adds Context dataclass which all datasets now return.
    • Enforces consistent schema.
    • Makes development, logging and testing more robust and efficient.
  • Adds MaxRetryError exception class for handling api call errors

@steffencruz steffencruz changed the base branch from features/crawler to pre-staging February 19, 2024 16:52
@steffencruz steffencruz marked this pull request as ready for review February 19, 2024 17:00
@steffencruz
Copy link
Collaborator Author

625 tests passing for each python version :)

@steffencruz
Copy link
Collaborator Author

Tracked experiment

from .base import Dataset
from ..selector import Selector

# DO WE NEED INTERNAL LINKS??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you define the page sections as internal_links, I believe that would be more than enough for our use case. Do we still need this comment?

month = self.rng.randint(1, 12)

max_days = 31 if month in (1, 3, 5, 7, 8, 10, 12) else 30
max_days = max_days if month != 2 else 29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we are not handling possible leap year scenarios, which could break the code with ValueError: day is out of range for month, even though its probability of happening is small

Copy link
Contributor

@p-ferreira p-ferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting is a bit weird (which will be solved once black PR is merged) but overall this PR seems to add a very nice structure around the tasks and rewards organization, not to mention the dataset modularization + extra refactoring.

The only consideration I have is towards date qa not taking leap years in consideration, which could cause an exception but it wouldn't affect the flow that much considering that we have a task creation retry policy in place.

@p-ferreira p-ferreira mentioned this pull request Feb 20, 2024
@steffencruz steffencruz merged commit 3d7b577 into pre-staging Feb 20, 2024
3 checks passed
@steffencruz steffencruz mentioned this pull request Feb 21, 2024
@mccrindlebrian mccrindlebrian deleted the features/context branch April 16, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants