-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeePMD-kit PyTorch backend error #4540
Comments
It seems the problem is related to the device. On my cluster, I am using a single A100 GPU, and the same error occurs when using 1 A100 on Bohrium. However, the issue disappears when using 2 A100 GPUs (previously, I was using 8 V100 GPUs). |
TorchScript computations may require all tensors to reside on the same device (e.g., GPU or CPU). If certain tensors are inadvertently placed on different devices (e.g., one on CPU and another on GPU), it may result in a runtime error like above. Possible Causes: |
Is your log truncated? LAMMPS should print something like |
**Not yet. I also downloaded the log file directly from the Bohrium platform (test on 3.0.0 version), and it matches exactly what I see from my local computer center. I've attached a ZIP file containing all the outputs from Bohrium for further diagnosis and analysis.** The log from the Bohrium platform: #-------------------------------------------potential-------------------------------------
variable TEMP_DT equal ${DT}100 CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE Your simulation uses code contributions which should be cited:
@Article{Gissinger24,
@Article{Wang_ComputPhysCommun_2018_v228_p178, CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE WARNING: No fixes with time integration, atoms won't move (src/verlet.cpp:60) |
By the way, I noticed that the tests with one and two A100 GPUs on the Bohrium platform are running on dp-3.0.0, while the CentOS setup uses dp-3.0.1. |
When I reduced the number of simulation atoms from 108,000 to 32,000, the error disappeared. |
Perhaps it's due to out-of-memory |
Bug summary
After fine-tuning, distilling, and freezing a PyTorch model (model.pth) with a LAMMPS task, the model runs well on the Bohrium platform using DeePMD-kit version 3.0.1. However, it encounters a DeePMD-kit PyTorch backend error on a CentOS system. The error appears to originate from the file code/torch/deepmd/pt/model/model/dp_zbl_model.py.
Interestingly, this issue does not occur when the frozen model generated using DeePMD-kit version 3.0.0 is used, which runs normally on the CentOS system. It seems the problem is related to a bug fixed in DeePMD-kit from version 3.0.0 to 3.0.1.
DeePMD-kit Version
DeePMD-kit v3.0.1
Backend and its version
Pytorch
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Error log:
#-------------------------------------------potential-------------------------------------
pair_style deepmd model.pth
Summary of lammps deepmd module ...
variable TEMP_DT equal ${DT}100
variable TEMP_DT equal 0.001100
#-------------------------PKA-----------------------
dump PKAdump CENTER atom 1 PKA.atom
run 0
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Your simulation uses code contributions which should be cited:
@Article{Gissinger24,
author = {Jacob R. Gissinger, Ilia Nikiforov, Yaser Afshar, Brendon Waters, Moon-ki Choi, Daniel S. Karls, Alexander Stukowski, Wonpil Im, Hendrik Heinz, Axel Kohlmeyer, and Ellad B. Tadmor},
title = {Type Label Framework for Bonded Force Fields in LAMMPS},
journal = {J. Phys. Chem. B},
year = 2024,
volume = 128,
number = 13,
pages = {3282–-3297}
}
@Article{Wang_ComputPhysCommun_2018_v228_p178,
author = {Wang, Han and Zhang, Linfeng and Han, Jiequn and E, Weinan},
doi = {10.1016/j.cpc.2018.03.016},
url = {https://doi.org/10.1016/j.cpc.2018.03.016},
year = 2018,
month = {jul},
publisher = {Elsevier {BV}},
volume = 228,
journal = {Comput. Phys. Commun.},
title = {{DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics}},
pages = {178--184}
}
@misc{Zeng_JChemPhys_2023_v159_p054801,
title = {{DeePMD-kit v2: A software package for deep potential models}},
author = {Jinzhe Zeng and Duo Zhang and Denghui Lu and Pinghui Mo and Zeyu Li
and Yixiao Chen and Mari{'a}n Rynik and Li'ang Huang and Ziyao Li and
Shaochen Shi and Yingze Wang and Haotian Ye and Ping Tuo and Jiabin
Yang and Ye Ding and Yifan Li and Davide Tisi and Qiyu Zeng and Han
Bao and Yu Xia and Jiameng Huang and Koki Muraoka and Yibo Wang and
Junhan Chang and Fengbo Yuan and Sigbj{\o}rn L{\o}land Bore and Chun
Cai and Yinnian Lin and Bo Wang and Jiayan Xu and Jia-Xin Zhu and
Chenxing Luo and Yuzhi Zhang and Rhys E A Goodall and Wenshuo Liang
and Anurag Kumar Singh and Sikai Yao and Jingchao Zhang and Renata
Wentzcovitch and Jiequn Han and Jie Liu and Weile Jia and Darrin M
York and Weinan E and Roberto Car and Linfeng Zhang and Han Wang},
journal = {J. Chem. Phys.},
volume = 159,
issue = 5,
year = 2023,
pages = 054801,
doi = {10.1063/5.0155600},
}
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
WARNING: No fixes with time integration, atoms won't move (src/verlet.cpp:60)
Generated 0 of 21 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 4, bins = 28 28 28
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair deepmd, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/dp_zbl_model.py", line 54, in forward_lower
do_atomic_virial: bool=False) -> Dict[str, Tensor]:
_4 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, None, _4, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
model_predict = annotate(Dict[str, Tensor], {})
torch._set_item(model_predict, "atom_energy", model_ret["energy"])
File "code/torch/deepmd/pt/model/model/dp_zbl_model.py", line 214, in forward_common_lower
cc_ext, _36, fp, ap, input_prec, = _35
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_37 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/linear_atomic_model.py", line 48, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/linear_atomic_model.py", line 123, in forward_atomic
type_map_model = torch.to(mapping_list[0], ops.prim.device(extended_atype))
_30 = annotate(List[Optional[Tensor]], [extended_atype])
_31 = (_0).forward_common_atomic(extended_coord0, torch.index(type_map_model, 30), nlists[0], mapping, fparam, aparam, None, )
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_32 = torch.append(_29, _31["energy"])
mapping_list0 = self.mapping_list
File "code/torch/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 52, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 95, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/se_atten_v2.py", line 43, in forward
type_embedding0 = None
se_atten = self.se_atten
_2 = (se_atten).forward(nlist, extended_coord0, extended_atype, g1_ext, None, type_embedding0, )
~~~~~~~~~~~~~~~~~ <--- HERE
g1, g2, h2, rot_mat, sw, = _2
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/se_atten.py", line 219, in forward
_04 = getattr(networks2, "0")
gg_s = (_04).forward(ss, )
gg5 = torch.add(torch.mul(gg_s, gg_t3), gg_s)
~~~~~~~~~ <--- HERE
nnei5 = self.nn
Steps to Reproduce
install the dp-v3,0,1 on a centOS system, activate the deepmd-kit with the conda, and then run the lammps case below
overlap_1073.zip
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: