Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/context #109

Merged
merged 33 commits into from
Feb 20, 2024
Merged
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
051ace1
Add base dataset
steffencruz Feb 19, 2024
94cda65
Add selector class
steffencruz Feb 19, 2024
5d40ff7
Add wiki datasets (date and normal)
steffencruz Feb 19, 2024
41f1615
Add context class
steffencruz Feb 19, 2024
481632b
Add mock dataset
steffencruz Feb 19, 2024
17873e9
Add code dataset
steffencruz Feb 19, 2024
a6e7b6a
Add math dataset
steffencruz Feb 19, 2024
b2c8425
Add init
steffencruz Feb 19, 2024
3768036
Remove old monolothic dataset file
steffencruz Feb 19, 2024
9fcf494
Update submodule init
steffencruz Feb 19, 2024
ae9769a
Refactor QA task to use new context class, and cleanup
steffencruz Feb 19, 2024
849ab97
Refactor summarization task to use new context class, and cleanup
steffencruz Feb 19, 2024
7171a28
Update base task so that context can be unpacked into state dict
steffencruz Feb 19, 2024
d5ceec4
Refactor date QA task to use new context class, and cleanup
steffencruz Feb 19, 2024
9824cb5
Refactor math task to use new context class, and cleanup
steffencruz Feb 19, 2024
edb0a39
Refactor debugging task to use new context class, and cleanup
steffencruz Feb 19, 2024
b0cc7da
Add TASKS list in submodule init
steffencruz Feb 19, 2024
1f6b2af
Add MaxRetryError exception class
steffencruz Feb 19, 2024
b37b110
Catch MaxRetryError and continue validation
steffencruz Feb 19, 2024
7682f76
Update dependencies: synapse fork of mathegenerator and wiki sections
steffencruz Feb 19, 2024
bfdf7b3
Update fixtures for dataset tests to use updated dataset and context …
steffencruz Feb 19, 2024
73ed922
Update tests for dataset and context
steffencruz Feb 19, 2024
5aa8673
Update tests for tasks
steffencruz Feb 19, 2024
5dca2dd
Add pre-staging to workflows
steffencruz Feb 19, 2024
e282a27
Fix dataset name typos
steffencruz Feb 19, 2024
5543e90
Import REWARD_MODELS dict from pipeline for global access to reward m…
steffencruz Feb 19, 2024
a57a491
Import TASKS from tasks submodule
steffencruz Feb 19, 2024
a58c973
Remove redundant args
steffencruz Feb 19, 2024
9694848
Remove redundant args
steffencruz Feb 19, 2024
9229eac
Remove redundant args
steffencruz Feb 19, 2024
0dbd3f0
Add more task fields to tests
steffencruz Feb 19, 2024
77ee3ed
Add tests for reward and penalty definitions and make test_task_field…
steffencruz Feb 19, 2024
5112b47
Remove hanging reference to score decay
steffencruz Feb 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add base dataset
  • Loading branch information
steffencruz committed Feb 19, 2024
commit 051ace127165be2aa2efaac6598f28ed75a6a2f9
80 changes: 80 additions & 0 deletions prompting/tools/datasets/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# The MIT License (MIT)
# Copyright © 2024 Yuma Rao
# Copyright © 2023 Opentensor Foundation

# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the “Software”), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
# and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all copies or substantial portions of
# the Software.

# THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
# THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

import time
from abc import ABC, abstractmethod
from typing import Dict
import bittensor as bt

from ..selector import Selector
from .context import Context


class Dataset(ABC):
"""Base class for datasets."""

max_tries: int = 10

@abstractmethod
def search(self, name):
...

@abstractmethod
def random(self, name):
...

@abstractmethod
def get(self, name):
...

def next(self, method: str = 'random', selector: Selector = Selector(), **kwargs) -> Dict:
tries = 1
t0 = time.time()

while True:

# TODO: Multithread the get method so that we don't have to suffer nonexistent pages
info = {}
if method == 'random':
info = self.random(selector=selector, **kwargs)
elif method == 'search':
info = self.search(selector=selector, **kwargs)
elif method == 'get':
info = self.get(selector=selector, **kwargs)
else:
raise ValueError(f"Unknown dataset get method {method!r}")

if info:
break

bt.logging.warning(f"Could not find an sample which meets {self.__class__.__name__} requirements after {tries} tries. Retrying... ({self.max_tries - tries} tries remaining.)")

tries += 1
if tries == self.max_tries:
raise Exception(
f"Could not find an sample which meets {self.__class__.__name__} requirements after {tries} tries."
)

info['stats'] = {
'creator': self.__class__.__name__,
'fetch_time': time.time() - t0,
'num_tries': tries,
'fetch_method': method,
'next_kwargs': kwargs
}
return Context(**info)