-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v-jinyi/Change Dataset used in newsrec to MIND #1153
v-jinyi/Change Dataset used in newsrec to MIND #1153
Conversation
…nto v-jinyi/add-news-reco-methods
Check out this pull request on Review Jupyter notebook visual diffs & provide feedback on notebooks. Powered by ReviewNB |
Using MINDsmall dataset for training takes about 20minites per epoch, which is a little longer for quick start. A sample MINDdemo dataset is used in Juputer notebook. The file format is totally the same as those in MINDsmall and MIND large. |
@@ -162,8 +162,10 @@ def test_naml_smoke(notebooks): | |||
) | |||
results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"] | |||
|
|||
assert results["res_syn"]["group_auc"] == pytest.approx(0.5565, rel=TOL, abs=ABS_TOL) | |||
assert results["res_syn"]["mean_mrr"] == pytest.approx(0.1811, rel=TOL, abs=ABS_TOL) | |||
assert results["res_syn"]["group_auc"] == pytest.approx( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yjw1029 there is something weird happening. I was running these tests and it is taking too long, a couple of hours in Prometheus (still hasn't finished). Then I went and created a NC24 machine and run naml_MIND
with 1 epoch. The training phase is taking long as well. I don't see any GPU consumption.
Could you please check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run the notebook recommenders/examples/00_quick_start/naml_MIND
with the default parameters and 1 epoch, the training took Wall time: 12min 13s
. The metrics were:
{'group_auc': 0.5825, 'mean_mrr': 0.2553, 'ndcg@5': 0.2802, 'ndcg@10': 0.3448}
then I changed the batch size to 512, hparams.batch_size = 512
the train took longer: Wall time: 18min 50s
and the metrics were:
{'group_auc': 0.5663, 'mean_mrr': 0.2403, 'ndcg@5': 0.2594, 'ndcg@10': 0.3283}
With bs=512, I expect the metrics to be lower, however the training time should be faster. Still I can't see memory consumtion with nvidia-smi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I think I found the issue, epoch
vs epochs
:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With gpu the time for 1 epoch should be about 3min per epoch. I already fix it. Sorry for my mistake.
I will check the time of 512 batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running pytest tests/smoke/test_notebooks_gpu.py::test_naml_smoke
, will report the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NAML model takes news abstract, news title, news vert and news subvert, which is much bigger than other newsrec model. It will cause OOM error when setting the batch_size as 512. I guess the longer training time may caused by shortage of gpu memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running pytest tests/smoke/test_notebooks_gpu.py::test_naml_smoke, will report the results
on a K80 it takes 3090s (around 50min), this is too much for smoke tests. I haven't tried to the integration tests yet, but if NAML test has 8 epochs, it would easily take more than 6 hours.
Could you please erase all the smoke tests for the news algos and reduce the integration tests to 1 epoch, so we get an affordable time?
Also, if it is possible to increase the batch size so NAML takes around 15-20min in the integration test, it would be perfect.
For you to get an estimate of what we had before, the integration test of all the GPU algos was a little bit more of 2h, we should target 3h max I would say in total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked test_naml_gpu time in smoke test again. It only takes 6 minutes.
And all algos in interaction test only take 50 minutes.
However, the smoke test does take a lot of time, which may not be caused by newsrec algos (we are now use a sample demo dataset of MIND and only run 1 epoch). Could you please check the reason of long smoke test again? @miguel-ferreira
…w1029/recommenders into v-jinyi/add-news-reco-methods
Description
Change Dataset used in newsrec algorithms to MIND
Related Issues
#1152 (comment)
Checklist:
staging
and notmaster
.