Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running CWQ #6

Open
KyuhwanYeom opened this issue Mar 3, 2025 · 5 comments
Open

Running CWQ #6

KyuhwanYeom opened this issue Mar 3, 2025 · 5 comments

Comments

@KyuhwanYeom
Copy link

Hi,

I tried to run inference on CWQ, but faced some challenging issues.

For step 1: Graph-constrained decoding, there is hyper parameter index_path_length as default value of 2.
Especially, when I run scripts/graph_constrained_decoding.sh as given default setting, its performance on CWQ is fairly low. (about Hits@1 62, F1-score 52)

When I changed this value to 4 and run decoding, too much time is to be consumed (about 2 weeks).
Should I change this value to 4 for CWQ? Or is are any other solution?

Best regards,
Kyuhwan

@RManLuo
Copy link
Owner

RManLuo commented Mar 3, 2025

Hi, Thanks for your interest in our work. I think your current settings are correct. Have you tried to run graph_inductive_reasoning.sh to see the result? This is used to report the final results in the paper.

@KyuhwanYeom
Copy link
Author

Thanks for your reply.

We ran graph_inductive_reasoning.sh followed by graph_constrained_decoding.sh.
For inference, we did not change any setting in both of script files.

step1_result3090_GCR.txt
step2_result_GCR_gpt.txt

We provide our output log files for convienence.

Best regards,
Kyuhwan

@RManLuo
Copy link
Owner

RManLuo commented Mar 12, 2025

Hi, our experiment was run on an A100 GPU with BF16 precision. I think 3090 only supports the fp16, which could lead to the degradation of the model's performance. I have attached our results for your reference, which are higher than your step 1 results.

Accuracy: 63.82162375903748 Hit: 69.11931818181819 F1: 49.26580602221372 Precision: 45.7756470959596 Recall: 63.01873370694562 Path F1: 28.599600479637243 Path Precision: 27.203869047619047 Path Recall: 42.35924793000597 Path Answer F1: 50.34972586790975 Path Answer Precision: 47.15007215007215 Path Answer Recall: 63.82162375903748

llama3.1-cwq.zip

@RManLuo RManLuo mentioned this issue Mar 12, 2025
@KaeHyun
Copy link

KaeHyun commented Mar 15, 2025

Thank you for your response.
Based on your feedback, I reran the Step 1 experiment using an RTX 4090, which supports bf16, but the performance still did not improve.
The F1 score seems particularly low, and I can't figure out why.
When comparing the provided files, I noticed that the model_name in args.txt is different.
I'll attach the result.

Best regards,
Kaehyun

llama3.1-8b GCR RTX4090.zip

@RManLuo
Copy link
Owner

RManLuo commented Mar 17, 2025

Thanks for providing this information. I will try to investigate it. Meanwhile, I just renamed the model to have better readability on HF. Can you try to see if the Llama2 model on HF can generate similar results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants