Running CWQ #6

KyuhwanYeom · 2025-03-03T07:13:24Z

Hi,

I tried to run inference on CWQ, but faced some challenging issues.

For step 1: Graph-constrained decoding, there is hyper parameter index_path_length as default value of 2.
Especially, when I run scripts/graph_constrained_decoding.sh as given default setting, its performance on CWQ is fairly low. (about Hits@1 62, F1-score 52)

When I changed this value to 4 and run decoding, too much time is to be consumed (about 2 weeks).
Should I change this value to 4 for CWQ? Or is are any other solution?

Best regards,
Kyuhwan

RManLuo · 2025-03-03T08:57:12Z

Hi, Thanks for your interest in our work. I think your current settings are correct. Have you tried to run graph_inductive_reasoning.sh to see the result? This is used to report the final results in the paper.

KyuhwanYeom · 2025-03-04T01:49:12Z

Thanks for your reply.

We ran graph_inductive_reasoning.sh followed by graph_constrained_decoding.sh.
For inference, we did not change any setting in both of script files.

step1_result3090_GCR.txt
step2_result_GCR_gpt.txt

We provide our output log files for convienence.

Best regards,
Kyuhwan

RManLuo · 2025-03-12T11:04:25Z

Hi, our experiment was run on an A100 GPU with BF16 precision. I think 3090 only supports the fp16, which could lead to the degradation of the model's performance. I have attached our results for your reference, which are higher than your step 1 results.

Accuracy: 63.82162375903748 Hit: 69.11931818181819 F1: 49.26580602221372 Precision: 45.7756470959596 Recall: 63.01873370694562 Path F1: 28.599600479637243 Path Precision: 27.203869047619047 Path Recall: 42.35924793000597 Path Answer F1: 50.34972586790975 Path Answer Precision: 47.15007215007215 Path Answer Recall: 63.82162375903748

llama3.1-cwq.zip

KaeHyun · 2025-03-15T14:00:31Z

Thank you for your response.
Based on your feedback, I reran the Step 1 experiment using an RTX 4090, which supports bf16, but the performance still did not improve.
The F1 score seems particularly low, and I can't figure out why.
When comparing the provided files, I noticed that the model_name in args.txt is different.
I'll attach the result.

Best regards,
Kaehyun

llama3.1-8b GCR RTX4090.zip

RManLuo · 2025-03-17T13:34:10Z

Thanks for providing this information. I will try to investigate it. Meanwhile, I just renamed the model to have better readability on HF. Can you try to see if the Llama2 model on HF can generate similar results?

RManLuo mentioned this issue Mar 12, 2025

CWQ performance #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running CWQ #6

Running CWQ #6

KyuhwanYeom commented Mar 3, 2025

RManLuo commented Mar 3, 2025 •

edited

Loading

KyuhwanYeom commented Mar 4, 2025

RManLuo commented Mar 12, 2025

KaeHyun commented Mar 15, 2025

RManLuo commented Mar 17, 2025

Running CWQ #6

Running CWQ #6

Comments

KyuhwanYeom commented Mar 3, 2025

RManLuo commented Mar 3, 2025 • edited Loading

KyuhwanYeom commented Mar 4, 2025

RManLuo commented Mar 12, 2025

KaeHyun commented Mar 15, 2025

RManLuo commented Mar 17, 2025

RManLuo commented Mar 3, 2025 •

edited

Loading