McGill-NLP
diff --git a/‎.gitignore
+5 b/‎.gitignore
+5
diff --git a/‎README.md
+3-1 b/‎README.md
+3-1
diff --git a/‎modeling/README.md
+80-49 b/‎modeling/README.md
+80-49
diff --git a/‎modeling/llama/conf/config.yaml
+1-1 b/‎modeling/llama/conf/config.yaml
+1-1
diff --git a/‎modeling/llama/eval.py
+6-6 b/‎modeling/llama/eval.py
+6-6
diff --git a/‎modeling/requirements.txt
+1-2 b/‎modeling/requirements.txt
+1-2
diff --git a/‎weblinx/__init__.py
+1-6 b/‎weblinx/__init__.py
+1-6
diff --git a/‎weblinx/_data/splits.json
+5-5 b/‎weblinx/_data/splits.json
+5-5
diff --git a/‎weblinx/eval/__init__.py
+6-6 b/‎weblinx/eval/__init__.py
+6-6
@@ -159,3 +159,8 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+# modeling specific
+modeling/logs/
+modeling/wl_data/
+modeling/checkpoints/
@@ -75,7 +75,9 @@ To run the automatic evaluation, you can use the following command:
 python -m weblinx.eval --help
 ```
 
-Note: We are still working on the code for `weblinx.eval` and `weblinx.processing.outputs`. If you have any questions or would like to contribute docs, please feel free to open an issue or a pull request.
+For more examples on how to use `weblinx.eval`, take a look at the [modeling README](./modeling/README.md).
+
+> Note: We are still working on the code for `weblinx.eval` and `weblinx.processing.outputs`. If you have any questions or would like to contribute docs, please feel free to open an issue or a pull request.
 
 ### Citations
 
 
@@ -1,42 +1,35 @@
 The following instructions assume you are running from this directory (you may need to `cd` to this directory).
 
-### Download Candidates
+### Download Data
 
-First, you need to download the `train.jsonl` candidate selected by `McGill-NLP/MiniLM-L6-DMR`:
+First, you need to download the `splits.json` file containing information about all the splits, as well as the `train.jsonl` candidate selected by `McGill-NLP/MiniLM-L6-DMR`:
 
 ```python
 from huggingface_hub import snapshot_download
 
+# splits.json
+snapshot_download(
+    repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", allow_patterns="splits.json", local_dir="./wl_data/"
+)
+
+# candidates files
 snapshot_download(
     repo_id="McGill-NLP/WebLINX-full", 
     repo_type="dataset", 
     allow_patterns="candidates/*.jsonl", 
-    local_dir="./"
+    local_dir="./wl_data/"
 )
 ```
 
-Download entire dataset:
+Download the full dataset (warning: this will take a while):
 
 ```python
 from huggingface_hub import snapshot_download
 
 snapshot_download(repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data/")
-
-# If you only want the splits.json file, you can just run:
-snapshot_download(
-    repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", allow_patterns="splits.json", local_dir="./wl_data/"
-)
-
-# If you only want candidates:
-snapshot_download(
-    repo_id="McGill-NLP/WebLINX-full", 
-    repo_type="dataset", 
-    allow_patterns="candidates/*.jsonl", 
-    local_dir="./wl_data/"
-)
 ```
 
-The default configs (`config.yml`) assume that the `train.jsonl` is located at `./candidates/train.jsonl`. If you want to change the path, you need to modify the `config.yml` accordingly.
+The default configs (`llama/conf/config.yml`) assume that the `train.jsonl` is located at `./wl_data/candidates/train.jsonl`. If you want to change the path, you need to modify the `config.yml` accordingly.
 
 ### Set `WEBLINX_PROJECT_DIR`
 
@@ -57,7 +50,54 @@ You need to install the dependencies by running the following command:
 pip install -r requirements.txt
 ```
 
-### Action Model: LLaMA
+However, due to `flash-attention` requiring `torch` to be pre-installed, it has to be install right after everything else has been installed:
+```bash
+# Regular install
+pip install "flash-attn>=2.3.0"
+# IF you have limited RAM, you can try this:
+MAX_JOBS=4 pip install "flash-attn>=2.3.0" --no-build-isolation
+# If you have issues with nvcc, try this:
+FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install "flash-attn>=2.3.0" --no-build-isolation
+```
+
+### Dense Markup Ranking (DMR)
+
+#### Train DMR
+
+You can train the model by running the following command (it will automatically use the hydra config from `conf/`):
+
+```bash
+export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
+
+# Finetune MiniLM-L6-DMR (Default)
+python -m dmr.train
+
+# Finetune variant gte or bge
+python -m dmr.train +variant=gte
+python -m dmr.train +variant=bge
+```
+
+Results will be saved in `./results` and checkpoints in `./checkpoints`.
+
+#### Inference for DMR
+
+You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
+
+```bash
+export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
+
+# On just one
+python -m dmr.eval eval.split=valid
+
+# On multiple splits (e.g. test_iid, test_vis)
+python -m dmr.eval eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+
+# Or for bge, gte
+python -m dmr.eval +variant=gte eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+python -m dmr.eval +variant=bge eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+```
+
+### Action Model
 
 #### Train LLaMA
 
@@ -84,7 +124,7 @@ accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_13b.yaml -m lla
 Results will be saved in `./results` and checkpoints in `./checkpoints`.
 
 
-### Evaluate LLaMA
+#### Run LLaMA on Evaluation Splits
 
 You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
 
@@ -98,39 +138,30 @@ python -m llama.eval +variant="ft_1.3b" eval.split=valid
 python -m llama.eval -m +variant="ft_2.7b" eval.split=test_iid,test_web,test_geo,test_cat,test_vis
 ```
 
-### Dense Markup Ranking (DMR)
-
-#### Train DMR
+### Evaluation
 
-You can train the model by running the following command (it will automatically use the hydra config from `conf/`):
+To run the evaluation metrics, you can use the following command (from this directory):
 
 ```bash
-export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
-
-# Finetune MiniLM-L6-DMR (Default)
-python -m dmr.train
-
-# Finetune variant gte or bge
-python -m dmr.train +variant=gte
-python -m dmr.train +variant=bge
+python -m weblinx.eval -d results -b ./wl_data/demonstrations
 ```
 
-Results will be saved in `./results` and checkpoints in `./checkpoints`.
-
-#### Evaluate DMR
-
-You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
-
-```bash
-export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
-
-# On just one
-python -m dmr.eval eval.split=valid
+In this case, `-b` is the base directory for the demonstrations, and `-d` is the directory containing the results (generated above by the `llama.eval` script). This will automatically run the evaluation metrics and save the results in the `results/aggregated_scores.json` directory. If you are only interested in the overall score for a split (e.g. `valid`), you can find look for the following entry in the aggregated score file (as an example):
+
+```json
+// ...
+  {
+    "split": "valid",
+    "intent": "overall",
+    "metric": "overall",
+    "model_name": "princeton-nlp/Sheared-LLaMA-1.3B",
+    "project_name": "llama_ft",
+    "score": 0.21667765869744438,
+    "unconditional_score": 0.15307513104251605
+  },
+// ...
+```
 
-# On multiple splits (e.g. test_iid, test_vis)
-python -m dmr.eval eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+Behind the scene, this will use the `weblinx.eval.auto_eval_and_save` function to run the evaluation metrics. If you want more control, you can also use that `weblinx.eval.auto_eval_and_save` function directly if you prefer; for an example, check out `weblinx/eval/__main__.py`.
 
-# Or for bge, gte
-python -m dmr.eval +variant=gte eval.split=test_iid,test_web,test_geo,test_cat,test_vis
-python -m dmr.eval +variant=bge eval.split=test_iid,test_web,test_geo,test_cat,test_vis
-```
+Note that it might be slow the first time you run, because it reads a lot of demonstrations and load millions of files. However, a demo-level cache is automatically created (see `./.cache/demonstrations`), so the next time you run it, it should be much faster.
@@ -43,7 +43,7 @@ candidates:
   project_name: dmr  # unused but potentially useful
   split: ${eval.split}
   train_path: ${project_dir}/wl_data/candidates/train.jsonl
-  path: ${project_dir}/wl_data/candidates/${split}.jsonl
+  path: ${project_dir}/wl_data/candidates/${candidates.split}.jsonl
 
 hydra:
   run:
 
@@ -15,10 +15,10 @@
 )
 from transformers.pipelines.pt_utils import KeyDataset
 
-import webtasks as wt
-from webtasks.processing import load_candidate_elements
-from webtasks.processing.prompt import build_input_records_from_selected_turns, select_turns_and_candidates_for_prompts
-from webtasks.utils.hydra import save_path_to_hydra_logs
+import weblinx as wl
+from weblinx.processing import load_candidate_elements
+from weblinx.processing.prompt import build_input_records_from_selected_turns, select_turns_and_candidates_for_prompts
+from weblinx.utils.hydra import save_path_to_hydra_logs
 
 from .processing import (
     build_prompt_records_for_llama_truncated,
@@ -48,8 +48,8 @@ def main(cfg):
     tokenizer.pad_token = tokenizer.eos_token
 
     # Data loading
-    demo_names = wt.utils.load_demo_names_in_split(split_path, split=split)
-    demos = [wt.Demonstration(name, base_dir=cfg.data.base_dir) for name in demo_names]
+    demo_names = wl.utils.load_demo_names_in_split(split_path, split=split)
+    demos = [wl.Demonstration(name, base_dir=cfg.data.base_dir) for name in demo_names]
 
     format_intent = build_formatter_for_multichoice()
     build_prompt_records_fn = partial(
 
@@ -19,5 +19,4 @@ coloredlogs
 sacrebleu
 bert-score
 packaging
-ninja
-flash-attn>=2.3.0
+ninja
@@ -89,7 +89,7 @@ def replay(self) -> dict:
 
         Note
         ----
-        If you want a `webtasks.Replay` object, call `webtasks.Replay.from_demonstration(demo)`.
+        If you want a `Replay` object, call `Replay.from_demonstration(demo)`.
         """
         return self.load_json("replay.json")
 
@@ -292,8 +292,6 @@ def join(self, *args) -> Path:
         return self.path.joinpath(*args)
 
 
-# Example of a turn:
-# {'type': 'browser', 'timestamp': 58.901999950408936, 'state': {'screenshot': 'screenshot-3-0.png', 'page': 'page-4-0.html'}, 'action': {'intent': 'click', 'arguments': {'metadata': {'mouseX': 214, 'mouseY': 245, 'tabId': 102448212, 'timestamp': 1684736360511, 'url': 'https://www.google.com/search?q=google+scholar&oq=google+scholar&aqs=chrome..69i57j35i39i650j0i433i650j0i512j0i433i512l2j0i131i433i512j5.10083j0j4&sourceid=chrome&ie=UTF-8', 'viewportHeight': 714, 'viewportWidth': 1536, 'zoomLevel': 1.25}, 'properties': {'altKey': False, 'button': 0, 'buttons': 1, 'clientX': 267.5, 'clientY': 306.25, 'composed': True, 'ctrlKey': False, 'detail': 1, 'eventPhase': 0, 'layerX': 19, 'layerY': 63, 'metaKey': False, 'movementX': 0, 'movementY': 0, 'offsetX': 23.75, 'offsetY': 31.25, 'pageX': 267.5, 'pageY': 306.25, 'returnValue': True, 'screenX': 267.5, 'screenY': 435.0, 'shiftKey': False, 'timeStamp': 2889.2999999970198, 'x': 267.5, 'y': 306.25}, 'element': {'attributes': {'class': 'LC20lb MBeuO DKV0Md', 'data-webtasks-id': 'ba92f02d-debb-4985'}, 'bbox': {'bottom': 314.1718864440918, 'height': 38.75, 'left': 244.0625, 'right': 416.39062881469727, 'top': 275.4218864440918, 'width': 172.32812881469727, 'x': 244.0625, 'y': 275.4218864440918}, 'innerHTML': 'Google Scholar', 'outerHTML': '<h3 class="LC20lb MBeuO DKV0Md" data-webtasks-id="ba92f02d-debb-4985">Google Scholar</h3>', 'tagName': 'H3', 'textContent': 'Google Scholar', 'xpath': 'id("rso")/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/a[1]/h3[1]'}}, 'event_id': 4, 'element_html': '<h3 class="LC20lb MBeuO DKV0Md" data-webtasks-id="ba92f02d-debb-4985">Google Scholar</h3>', 'screenshot_effect': None}}
 class Turn(dict):
     def __init__(
         self,
@@ -981,9 +979,6 @@ def get_xpaths_dict(
         return xpaths
 
 
-# EXAMPLE OF REPLAY
-#
-# {'type': 'browser', 'timestamp': 47.28099989891052, 'state': {'screenshot': 'screenshot-6-0.png', 'page': 'page-6-0.html'}, 'action': {'intent': 'click', 'arguments': {'metadata': {'mouseX': 390, 'mouseY': 505, 'tabId': 2011623910, 'timestamp': 1685102753001, 'url': 'https://www.thefork.com/', 'viewportHeight': 657, 'viewportWidth': 1366, 'zoomLevel': 1}, 'properties': {'altKey': False, 'button': 0, 'buttons': 1, 'clientX': 390, 'clientY': 505, 'composed': True, 'ctrlKey': False, 'detail': 1, 'eventPhase': 0, 'layerX': 203, 'layerY': 27, 'metaKey': False, 'movementX': 0, 'movementY': 0, 'offsetX': 204, 'offsetY': 28, 'pageX': 390, 'pageY': 505, 'returnValue': True, 'screenX': 390, 'screenY': 576, 'shiftKey': False, 'timeStamp': 21110, 'x': 390, 'y': 505}, 'element': {'attributes': {'class': 'css-m080s5 ektx8jp0', 'data-test': 'search-form-submit-button', 'data-testid': 'search-form-submit-button', 'data-webtasks-id': '20fbac1a-3c62-475a', 'display': 'block', 'type': 'submit', 'width': '100%'}, 'bbox': {'bottom': 523.328125, 'height': 46, 'left': 186.5, 'right': 588.5, 'top': 477.328125, 'width': 402, 'x': 186.5, 'y': 477.328125}, 'innerHTML': 'Search', 'outerHTML': '<button width="100%" type="submit" data-test="search-form-submit-button" data-testid="search-form-submit-button" display="block" class="css-m080s5 ektx8jp0" data-webtasks-id="20fbac1a-3c62-475a">Search</button>', 'tagName': 'BUTTON', 'textContent': 'Search', 'xpath': 'id("root")/main[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/button[1]'}}, 'event_id': 6, 'element_html': '<button class="css-m080s5 ektx8jp0" data-test="search-form-submit-button" data-testid="search-form-submit-button" data-webtasks-id="20fbac1a-3c62-475a" display="block" type="submit" width="100%">Search</button>', 'screenshot_effect': None}}
 class Replay:
     """
     A replay is one of the core components of a demonstration. It is a list of turns, each of
 
@@ -1,5 +1,5 @@
 {
-  "blind": [
+  "test_vis": [
     "fndboyk",
     "thjakvr",
     "nvdejnk",
@@ -445,7 +445,7 @@
     "zoeiakj",
     "zwdzqmo"
   ],
-  "geography": [
+  "test_geo": [
     "brzunzn",
     "xmjfzvn",
     "bxnoafu",
@@ -737,7 +737,7 @@
     "svbtpbx",
     "zdnetmz"
   ],
-  "subcategory": [
+  "test_cat": [
     "qqrjzop",
     "ucjhfyp",
     "dtiyhkm",
@@ -962,7 +962,7 @@
     "kngasft",
     "pdlfqli"
   ],
-  "website": [
+  "test_web": [
     "apajlpi",
     "qqhbegt",
     "tlkvkmk",
@@ -2450,7 +2450,7 @@
     "kaxzpgm",
     "ejryoez"
   ],
-  "indomain": [
+  "test_iid": [
     "scicrdo",
     "iszaysr",
     "tdzkbmv",
 
@@ -80,7 +80,7 @@ def run_evaluation(processed_results, metrics, num_turns=1):
     return scores
 
 
-def validate_reference_action(ref_action, next_turn):
+def validate_reference_action(ref_action, next_turn, uid_key="data-webtasks-id"):
     """
     This verifies that the reference action is valid for evaluation. This means:
     - The reference action is not None
@@ -101,11 +101,11 @@ def validate_reference_action(ref_action, next_turn):
 
     # If the reference action does not have an elemnt or an attributes field,
     # then we cannot evaluate. If it does have an attributes field, then we need
-    # to check if it has a data-webtasks-id field in it
+    # to check if it has a <uid> field in it
     if ref_action.get("element") is None or ref_action.get("element") is None:
         return False
 
-    if ref_action["element"]["attributes"].get("data-webtasks-id") is None:
+    if ref_action["element"]["attributes"].get(uid_key) is None:
         return False
 
     if next_turn is not None and ref_action["intent"] == "click":
@@ -117,11 +117,11 @@ def validate_reference_action(ref_action, next_turn):
             return False
 
         # If next turn is a submit intent, then we only keep the current click
-        # intent if the submit intent has a data-webtasks-id that is different
+        # intent if the submit intent has a <uid> that is different
         # from the current click intent
         if next_turn.intent == "submit":
-            next_uid = next_turn.element["attributes"]["data-webtasks-id"]
-            cur_uid = ref_action["element"]["attributes"]["data-webtasks-id"]
+            next_uid = next_turn.element["attributes"][uid_key]
+            cur_uid = ref_action["element"]["attributes"][uid_key]
 
             if next_uid == cur_uid:
                 print(