Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v-jinyi/Change Dataset used in newsrec to MIND #1153

Merged

Conversation

yjw1029
Copy link
Contributor

@yjw1029 yjw1029 commented Jul 20, 2020

Description

Change Dataset used in newsrec algorithms to MIND

Related Issues

#1152 (comment)

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging and not master.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

@yjw1029
Copy link
Contributor Author

yjw1029 commented Jul 20, 2020

Using MINDsmall dataset for training takes about 20minites per epoch, which is a little longer for quick start. A sample MINDdemo dataset is used in Juputer notebook. The file format is totally the same as those in MINDsmall and MIND large.

@yjw1029 yjw1029 changed the title V jinyi/add news reco methods v-jinyi/Change Dataset used in newsrec to MIND Jul 20, 2020
@@ -162,8 +162,10 @@ def test_naml_smoke(notebooks):
)
results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"]

assert results["res_syn"]["group_auc"] == pytest.approx(0.5565, rel=TOL, abs=ABS_TOL)
assert results["res_syn"]["mean_mrr"] == pytest.approx(0.1811, rel=TOL, abs=ABS_TOL)
assert results["res_syn"]["group_auc"] == pytest.approx(
Copy link
Collaborator

@miguelgfierro miguelgfierro Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yjw1029 there is something weird happening. I was running these tests and it is taking too long, a couple of hours in Prometheus (still hasn't finished). Then I went and created a NC24 machine and run naml_MIND with 1 epoch. The training phase is taking long as well. I don't see any GPU consumption.

Could you please check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try the naml_MIND notebook again on my gcr server. The run_eval part has 27% gpu utilization, and the training part has 93% gpu ultilization.
image
image
Do other notebooks have the same issue?

Copy link
Collaborator

@miguelgfierro miguelgfierro Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run the notebook recommenders/examples/00_quick_start/naml_MIND with the default parameters and 1 epoch, the training took Wall time: 12min 13s. The metrics were:
{'group_auc': 0.5825, 'mean_mrr': 0.2553, 'ndcg@5': 0.2802, 'ndcg@10': 0.3448}
then I changed the batch size to 512, hparams.batch_size = 512 the train took longer: Wall time: 18min 50s and the metrics were:
{'group_auc': 0.5663, 'mean_mrr': 0.2403, 'ndcg@5': 0.2594, 'ndcg@10': 0.3283}

With bs=512, I expect the metrics to be lower, however the training time should be faster. Still I can't see memory consumtion with nvidia-smi.

Copy link
Collaborator

@miguelgfierro miguelgfierro Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I think I found the issue, epoch vs epochs :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With gpu the time for 1 epoch should be about 3min per epoch. I already fix it. Sorry for my mistake.
I will check the time of 512 batch size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running pytest tests/smoke/test_notebooks_gpu.py::test_naml_smoke, will report the results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NAML model takes news abstract, news title, news vert and news subvert, which is much bigger than other newsrec model. It will cause OOM error when setting the batch_size as 512. I guess the longer training time may caused by shortage of gpu memory.

Copy link
Collaborator

@miguelgfierro miguelgfierro Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running pytest tests/smoke/test_notebooks_gpu.py::test_naml_smoke, will report the results

on a K80 it takes 3090s (around 50min), this is too much for smoke tests. I haven't tried to the integration tests yet, but if NAML test has 8 epochs, it would easily take more than 6 hours.

Could you please erase all the smoke tests for the news algos and reduce the integration tests to 1 epoch, so we get an affordable time?

Also, if it is possible to increase the batch size so NAML takes around 15-20min in the integration test, it would be perfect.

For you to get an estimate of what we had before, the integration test of all the GPU algos was a little bit more of 2h, we should target 3h max I would say in total

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked test_naml_gpu time in smoke test again. It only takes 6 minutes.
And all algos in interaction test only take 50 minutes.

However, the smoke test does take a lot of time, which may not be caused by newsrec algos (we are now use a sample demo dataset of MIND and only run 1 epoch). Could you please check the reason of long smoke test again? @miguel-ferreira

@miguelgfierro miguelgfierro merged commit f1891d3 into recommenders-team:staging Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants