Skip to content

Commit b042693

Browse files
authored
Update readmes, fix typing issue breaking older python, update weblinx.eval (#8)
* change type hint to be compatible with older pythons * update a few instructions, including flash attention * update title * ignore more * Remove reference to webtasks, make uid_key generic (instead of hardcoding "data-webtasks-id" * ignore checkpoints * Fix reference to candidates.path * Fix naming in splits.json data file * improve command line for weblinx.eval CLI * Update modeling instructions * Update main readme with examples * Final readme update
1 parent 98a0220 commit b042693

15 files changed

+149
-94
lines changed

.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -159,3 +159,8 @@ cython_debug/
159159
# and can be added to the global gitignore or merged into this file. For a more nuclear
160160
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
161161
#.idea/
162+
163+
# modeling specific
164+
modeling/logs/
165+
modeling/wl_data/
166+
modeling/checkpoints/

README.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,9 @@ To run the automatic evaluation, you can use the following command:
7575
python -m weblinx.eval --help
7676
```
7777

78-
Note: We are still working on the code for `weblinx.eval` and `weblinx.processing.outputs`. If you have any questions or would like to contribute docs, please feel free to open an issue or a pull request.
78+
For more examples on how to use `weblinx.eval`, take a look at the [modeling README](./modeling/README.md).
79+
80+
> Note: We are still working on the code for `weblinx.eval` and `weblinx.processing.outputs`. If you have any questions or would like to contribute docs, please feel free to open an issue or a pull request.
7981
8082
### Citations
8183

modeling/README.md

+80-49
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,35 @@
11
The following instructions assume you are running from this directory (you may need to `cd` to this directory).
22

3-
### Download Candidates
3+
### Download Data
44

5-
First, you need to download the `train.jsonl` candidate selected by `McGill-NLP/MiniLM-L6-DMR`:
5+
First, you need to download the `splits.json` file containing information about all the splits, as well as the `train.jsonl` candidate selected by `McGill-NLP/MiniLM-L6-DMR`:
66

77
```python
88
from huggingface_hub import snapshot_download
99

10+
# splits.json
11+
snapshot_download(
12+
repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", allow_patterns="splits.json", local_dir="./wl_data/"
13+
)
14+
15+
# candidates files
1016
snapshot_download(
1117
repo_id="McGill-NLP/WebLINX-full",
1218
repo_type="dataset",
1319
allow_patterns="candidates/*.jsonl",
14-
local_dir="./"
20+
local_dir="./wl_data/"
1521
)
1622
```
1723

18-
Download entire dataset:
24+
Download the full dataset (warning: this will take a while):
1925

2026
```python
2127
from huggingface_hub import snapshot_download
2228

2329
snapshot_download(repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data/")
24-
25-
# If you only want the splits.json file, you can just run:
26-
snapshot_download(
27-
repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", allow_patterns="splits.json", local_dir="./wl_data/"
28-
)
29-
30-
# If you only want candidates:
31-
snapshot_download(
32-
repo_id="McGill-NLP/WebLINX-full",
33-
repo_type="dataset",
34-
allow_patterns="candidates/*.jsonl",
35-
local_dir="./wl_data/"
36-
)
3730
```
3831

39-
The default configs (`config.yml`) assume that the `train.jsonl` is located at `./candidates/train.jsonl`. If you want to change the path, you need to modify the `config.yml` accordingly.
32+
The default configs (`llama/conf/config.yml`) assume that the `train.jsonl` is located at `./wl_data/candidates/train.jsonl`. If you want to change the path, you need to modify the `config.yml` accordingly.
4033

4134
### Set `WEBLINX_PROJECT_DIR`
4235

@@ -57,7 +50,54 @@ You need to install the dependencies by running the following command:
5750
pip install -r requirements.txt
5851
```
5952

60-
### Action Model: LLaMA
53+
However, due to `flash-attention` requiring `torch` to be pre-installed, it has to be install right after everything else has been installed:
54+
```bash
55+
# Regular install
56+
pip install "flash-attn>=2.3.0"
57+
# IF you have limited RAM, you can try this:
58+
MAX_JOBS=4 pip install "flash-attn>=2.3.0" --no-build-isolation
59+
# If you have issues with nvcc, try this:
60+
FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install "flash-attn>=2.3.0" --no-build-isolation
61+
```
62+
63+
### Dense Markup Ranking (DMR)
64+
65+
#### Train DMR
66+
67+
You can train the model by running the following command (it will automatically use the hydra config from `conf/`):
68+
69+
```bash
70+
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
71+
72+
# Finetune MiniLM-L6-DMR (Default)
73+
python -m dmr.train
74+
75+
# Finetune variant gte or bge
76+
python -m dmr.train +variant=gte
77+
python -m dmr.train +variant=bge
78+
```
79+
80+
Results will be saved in `./results` and checkpoints in `./checkpoints`.
81+
82+
#### Inference for DMR
83+
84+
You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
85+
86+
```bash
87+
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
88+
89+
# On just one
90+
python -m dmr.eval eval.split=valid
91+
92+
# On multiple splits (e.g. test_iid, test_vis)
93+
python -m dmr.eval eval.split=test_iid,test_web,test_geo,test_cat,test_vis
94+
95+
# Or for bge, gte
96+
python -m dmr.eval +variant=gte eval.split=test_iid,test_web,test_geo,test_cat,test_vis
97+
python -m dmr.eval +variant=bge eval.split=test_iid,test_web,test_geo,test_cat,test_vis
98+
```
99+
100+
### Action Model
61101

62102
#### Train LLaMA
63103

@@ -84,7 +124,7 @@ accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_13b.yaml -m lla
84124
Results will be saved in `./results` and checkpoints in `./checkpoints`.
85125

86126

87-
### Evaluate LLaMA
127+
#### Run LLaMA on Evaluation Splits
88128

89129
You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
90130

@@ -98,39 +138,30 @@ python -m llama.eval +variant="ft_1.3b" eval.split=valid
98138
python -m llama.eval -m +variant="ft_2.7b" eval.split=test_iid,test_web,test_geo,test_cat,test_vis
99139
```
100140

101-
### Dense Markup Ranking (DMR)
102-
103-
#### Train DMR
141+
### Evaluation
104142

105-
You can train the model by running the following command (it will automatically use the hydra config from `conf/`):
143+
To run the evaluation metrics, you can use the following command (from this directory):
106144

107145
```bash
108-
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
109-
110-
# Finetune MiniLM-L6-DMR (Default)
111-
python -m dmr.train
112-
113-
# Finetune variant gte or bge
114-
python -m dmr.train +variant=gte
115-
python -m dmr.train +variant=bge
146+
python -m weblinx.eval -d results -b ./wl_data/demonstrations
116147
```
117148

118-
Results will be saved in `./results` and checkpoints in `./checkpoints`.
119-
120-
#### Evaluate DMR
121-
122-
You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
123-
124-
```bash
125-
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
126-
127-
# On just one
128-
python -m dmr.eval eval.split=valid
149+
In this case, `-b` is the base directory for the demonstrations, and `-d` is the directory containing the results (generated above by the `llama.eval` script). This will automatically run the evaluation metrics and save the results in the `results/aggregated_scores.json` directory. If you are only interested in the overall score for a split (e.g. `valid`), you can find look for the following entry in the aggregated score file (as an example):
150+
151+
```json
152+
// ...
153+
{
154+
"split": "valid",
155+
"intent": "overall",
156+
"metric": "overall",
157+
"model_name": "princeton-nlp/Sheared-LLaMA-1.3B",
158+
"project_name": "llama_ft",
159+
"score": 0.21667765869744438,
160+
"unconditional_score": 0.15307513104251605
161+
},
162+
// ...
163+
```
129164

130-
# On multiple splits (e.g. test_iid, test_vis)
131-
python -m dmr.eval eval.split=test_iid,test_web,test_geo,test_cat,test_vis
165+
Behind the scene, this will use the `weblinx.eval.auto_eval_and_save` function to run the evaluation metrics. If you want more control, you can also use that `weblinx.eval.auto_eval_and_save` function directly if you prefer; for an example, check out `weblinx/eval/__main__.py`.
132166

133-
# Or for bge, gte
134-
python -m dmr.eval +variant=gte eval.split=test_iid,test_web,test_geo,test_cat,test_vis
135-
python -m dmr.eval +variant=bge eval.split=test_iid,test_web,test_geo,test_cat,test_vis
136-
```
167+
Note that it might be slow the first time you run, because it reads a lot of demonstrations and load millions of files. However, a demo-level cache is automatically created (see `./.cache/demonstrations`), so the next time you run it, it should be much faster.

modeling/llama/conf/config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ candidates:
4343
project_name: dmr # unused but potentially useful
4444
split: ${eval.split}
4545
train_path: ${project_dir}/wl_data/candidates/train.jsonl
46-
path: ${project_dir}/wl_data/candidates/${split}.jsonl
46+
path: ${project_dir}/wl_data/candidates/${candidates.split}.jsonl
4747

4848
hydra:
4949
run:

modeling/llama/eval.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@
1515
)
1616
from transformers.pipelines.pt_utils import KeyDataset
1717

18-
import webtasks as wt
19-
from webtasks.processing import load_candidate_elements
20-
from webtasks.processing.prompt import build_input_records_from_selected_turns, select_turns_and_candidates_for_prompts
21-
from webtasks.utils.hydra import save_path_to_hydra_logs
18+
import weblinx as wl
19+
from weblinx.processing import load_candidate_elements
20+
from weblinx.processing.prompt import build_input_records_from_selected_turns, select_turns_and_candidates_for_prompts
21+
from weblinx.utils.hydra import save_path_to_hydra_logs
2222

2323
from .processing import (
2424
build_prompt_records_for_llama_truncated,
@@ -48,8 +48,8 @@ def main(cfg):
4848
tokenizer.pad_token = tokenizer.eos_token
4949

5050
# Data loading
51-
demo_names = wt.utils.load_demo_names_in_split(split_path, split=split)
52-
demos = [wt.Demonstration(name, base_dir=cfg.data.base_dir) for name in demo_names]
51+
demo_names = wl.utils.load_demo_names_in_split(split_path, split=split)
52+
demos = [wl.Demonstration(name, base_dir=cfg.data.base_dir) for name in demo_names]
5353

5454
format_intent = build_formatter_for_multichoice()
5555
build_prompt_records_fn = partial(

modeling/requirements.txt

+1-2
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,4 @@ coloredlogs
1919
sacrebleu
2020
bert-score
2121
packaging
22-
ninja
23-
flash-attn>=2.3.0
22+
ninja

weblinx/__init__.py

+1-6
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ def replay(self) -> dict:
8989
9090
Note
9191
----
92-
If you want a `webtasks.Replay` object, call `webtasks.Replay.from_demonstration(demo)`.
92+
If you want a `Replay` object, call `Replay.from_demonstration(demo)`.
9393
"""
9494
return self.load_json("replay.json")
9595

@@ -292,8 +292,6 @@ def join(self, *args) -> Path:
292292
return self.path.joinpath(*args)
293293

294294

295-
# Example of a turn:
296-
# {'type': 'browser', 'timestamp': 58.901999950408936, 'state': {'screenshot': 'screenshot-3-0.png', 'page': 'page-4-0.html'}, 'action': {'intent': 'click', 'arguments': {'metadata': {'mouseX': 214, 'mouseY': 245, 'tabId': 102448212, 'timestamp': 1684736360511, 'url': 'https://www.google.com/search?q=google+scholar&oq=google+scholar&aqs=chrome..69i57j35i39i650j0i433i650j0i512j0i433i512l2j0i131i433i512j5.10083j0j4&sourceid=chrome&ie=UTF-8', 'viewportHeight': 714, 'viewportWidth': 1536, 'zoomLevel': 1.25}, 'properties': {'altKey': False, 'button': 0, 'buttons': 1, 'clientX': 267.5, 'clientY': 306.25, 'composed': True, 'ctrlKey': False, 'detail': 1, 'eventPhase': 0, 'layerX': 19, 'layerY': 63, 'metaKey': False, 'movementX': 0, 'movementY': 0, 'offsetX': 23.75, 'offsetY': 31.25, 'pageX': 267.5, 'pageY': 306.25, 'returnValue': True, 'screenX': 267.5, 'screenY': 435.0, 'shiftKey': False, 'timeStamp': 2889.2999999970198, 'x': 267.5, 'y': 306.25}, 'element': {'attributes': {'class': 'LC20lb MBeuO DKV0Md', 'data-webtasks-id': 'ba92f02d-debb-4985'}, 'bbox': {'bottom': 314.1718864440918, 'height': 38.75, 'left': 244.0625, 'right': 416.39062881469727, 'top': 275.4218864440918, 'width': 172.32812881469727, 'x': 244.0625, 'y': 275.4218864440918}, 'innerHTML': 'Google Scholar', 'outerHTML': '<h3 class="LC20lb MBeuO DKV0Md" data-webtasks-id="ba92f02d-debb-4985">Google Scholar</h3>', 'tagName': 'H3', 'textContent': 'Google Scholar', 'xpath': 'id("rso")/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/a[1]/h3[1]'}}, 'event_id': 4, 'element_html': '<h3 class="LC20lb MBeuO DKV0Md" data-webtasks-id="ba92f02d-debb-4985">Google Scholar</h3>', 'screenshot_effect': None}}
297295
class Turn(dict):
298296
def __init__(
299297
self,
@@ -981,9 +979,6 @@ def get_xpaths_dict(
981979
return xpaths
982980

983981

984-
# EXAMPLE OF REPLAY
985-
#
986-
# {'type': 'browser', 'timestamp': 47.28099989891052, 'state': {'screenshot': 'screenshot-6-0.png', 'page': 'page-6-0.html'}, 'action': {'intent': 'click', 'arguments': {'metadata': {'mouseX': 390, 'mouseY': 505, 'tabId': 2011623910, 'timestamp': 1685102753001, 'url': 'https://www.thefork.com/', 'viewportHeight': 657, 'viewportWidth': 1366, 'zoomLevel': 1}, 'properties': {'altKey': False, 'button': 0, 'buttons': 1, 'clientX': 390, 'clientY': 505, 'composed': True, 'ctrlKey': False, 'detail': 1, 'eventPhase': 0, 'layerX': 203, 'layerY': 27, 'metaKey': False, 'movementX': 0, 'movementY': 0, 'offsetX': 204, 'offsetY': 28, 'pageX': 390, 'pageY': 505, 'returnValue': True, 'screenX': 390, 'screenY': 576, 'shiftKey': False, 'timeStamp': 21110, 'x': 390, 'y': 505}, 'element': {'attributes': {'class': 'css-m080s5 ektx8jp0', 'data-test': 'search-form-submit-button', 'data-testid': 'search-form-submit-button', 'data-webtasks-id': '20fbac1a-3c62-475a', 'display': 'block', 'type': 'submit', 'width': '100%'}, 'bbox': {'bottom': 523.328125, 'height': 46, 'left': 186.5, 'right': 588.5, 'top': 477.328125, 'width': 402, 'x': 186.5, 'y': 477.328125}, 'innerHTML': 'Search', 'outerHTML': '<button width="100%" type="submit" data-test="search-form-submit-button" data-testid="search-form-submit-button" display="block" class="css-m080s5 ektx8jp0" data-webtasks-id="20fbac1a-3c62-475a">Search</button>', 'tagName': 'BUTTON', 'textContent': 'Search', 'xpath': 'id("root")/main[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/button[1]'}}, 'event_id': 6, 'element_html': '<button class="css-m080s5 ektx8jp0" data-test="search-form-submit-button" data-testid="search-form-submit-button" data-webtasks-id="20fbac1a-3c62-475a" display="block" type="submit" width="100%">Search</button>', 'screenshot_effect': None}}
987982
class Replay:
988983
"""
989984
A replay is one of the core components of a demonstration. It is a list of turns, each of

weblinx/_data/splits.json

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"blind": [
2+
"test_vis": [
33
"fndboyk",
44
"thjakvr",
55
"nvdejnk",
@@ -445,7 +445,7 @@
445445
"zoeiakj",
446446
"zwdzqmo"
447447
],
448-
"geography": [
448+
"test_geo": [
449449
"brzunzn",
450450
"xmjfzvn",
451451
"bxnoafu",
@@ -737,7 +737,7 @@
737737
"svbtpbx",
738738
"zdnetmz"
739739
],
740-
"subcategory": [
740+
"test_cat": [
741741
"qqrjzop",
742742
"ucjhfyp",
743743
"dtiyhkm",
@@ -962,7 +962,7 @@
962962
"kngasft",
963963
"pdlfqli"
964964
],
965-
"website": [
965+
"test_web": [
966966
"apajlpi",
967967
"qqhbegt",
968968
"tlkvkmk",
@@ -2450,7 +2450,7 @@
24502450
"kaxzpgm",
24512451
"ejryoez"
24522452
],
2453-
"indomain": [
2453+
"test_iid": [
24542454
"scicrdo",
24552455
"iszaysr",
24562456
"tdzkbmv",

weblinx/eval/__init__.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ def run_evaluation(processed_results, metrics, num_turns=1):
8080
return scores
8181

8282

83-
def validate_reference_action(ref_action, next_turn):
83+
def validate_reference_action(ref_action, next_turn, uid_key="data-webtasks-id"):
8484
"""
8585
This verifies that the reference action is valid for evaluation. This means:
8686
- The reference action is not None
@@ -101,11 +101,11 @@ def validate_reference_action(ref_action, next_turn):
101101

102102
# If the reference action does not have an elemnt or an attributes field,
103103
# then we cannot evaluate. If it does have an attributes field, then we need
104-
# to check if it has a data-webtasks-id field in it
104+
# to check if it has a <uid> field in it
105105
if ref_action.get("element") is None or ref_action.get("element") is None:
106106
return False
107107

108-
if ref_action["element"]["attributes"].get("data-webtasks-id") is None:
108+
if ref_action["element"]["attributes"].get(uid_key) is None:
109109
return False
110110

111111
if next_turn is not None and ref_action["intent"] == "click":
@@ -117,11 +117,11 @@ def validate_reference_action(ref_action, next_turn):
117117
return False
118118

119119
# If next turn is a submit intent, then we only keep the current click
120-
# intent if the submit intent has a data-webtasks-id that is different
120+
# intent if the submit intent has a <uid> that is different
121121
# from the current click intent
122122
if next_turn.intent == "submit":
123-
next_uid = next_turn.element["attributes"]["data-webtasks-id"]
124-
cur_uid = ref_action["element"]["attributes"]["data-webtasks-id"]
123+
next_uid = next_turn.element["attributes"][uid_key]
124+
cur_uid = ref_action["element"]["attributes"][uid_key]
125125

126126
if next_uid == cur_uid:
127127
print(

0 commit comments

Comments
 (0)