Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeePMD-kit PyTorch backend error #4540

Closed
Jeremy1189 opened this issue Jan 7, 2025 · 8 comments
Closed

DeePMD-kit PyTorch backend error #4540

Jeremy1189 opened this issue Jan 7, 2025 · 8 comments
Labels

Comments

@Jeremy1189
Copy link

Bug summary

After fine-tuning, distilling, and freezing a PyTorch model (model.pth) with a LAMMPS task, the model runs well on the Bohrium platform using DeePMD-kit version 3.0.1. However, it encounters a DeePMD-kit PyTorch backend error on a CentOS system. The error appears to originate from the file code/torch/deepmd/pt/model/model/dp_zbl_model.py.

Interestingly, this issue does not occur when the frozen model generated using DeePMD-kit version 3.0.0 is used, which runs normally on the CentOS system. It seems the problem is related to a bug fixed in DeePMD-kit from version 3.0.0 to 3.0.1.

DeePMD-kit Version

DeePMD-kit v3.0.1

Backend and its version

Pytorch

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Error log:
#-------------------------------------------potential-------------------------------------
pair_style deepmd model.pth
Summary of lammps deepmd module ...

Info of deepmd-kit:
installed to: /home/boxiao/deepmd-kit
source:
source branch: HEAD
source commit: c314f1b
source commit at: 2024-12-23 16:45:06 -0800
support model ver.: 1.1
build variant: cuda
build with tf inc: /home/boxiao/deepmd-kit/lib/python3.12/site-packages/tensorflow/include;/home/boxiao/deepmd-kit/include
build with tf lib: /home/boxiao/deepmd-kit/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/home/boxiao/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10.so;/home/conda/feedstock_root/build_artifacts/deepmd-kit_1735001361510/_build_env/targets/x86_64-linux/lib/stubs/libcuda.so;/home/boxiao/deepmd-kit/lib/libnvrtc.so;/home/boxiao/deepmd-kit/lib/libnvToolsExt.so;/home/boxiao/deepmd-kit/lib/libcudart.so;/home/boxiao/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
Info of lammps module:
use deepmd-kit at: /home/boxiao/deepmd-kitpair_coeff * * Ta Ti Al Cr Fe Ni Co
#-------------------------------------------run npt-------------------------------------
thermo 100
thermo_style custom step time dt temp elapsed cpu tpcpu pe etotal press vol
variable DT equal dt

variable TEMP_DT equal ${DT}100
variable TEMP_DT equal 0.001
100
#-------------------------PKA-----------------------
dump PKAdump CENTER atom 1 PKA.atom
run 0

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

@Article{Gissinger24,
author = {Jacob R. Gissinger, Ilia Nikiforov, Yaser Afshar, Brendon Waters, Moon-ki Choi, Daniel S. Karls, Alexander Stukowski, Wonpil Im, Hendrik Heinz, Axel Kohlmeyer, and Ellad B. Tadmor},
title = {Type Label Framework for Bonded Force Fields in LAMMPS},
journal = {J. Phys. Chem. B},
year = 2024,
volume = 128,
number = 13,
pages = {3282–-3297}
}

  • USER-DEEPMD package:

@Article{Wang_ComputPhysCommun_2018_v228_p178,
author = {Wang, Han and Zhang, Linfeng and Han, Jiequn and E, Weinan},
doi = {10.1016/j.cpc.2018.03.016},
url = {https://doi.org/10.1016/j.cpc.2018.03.016},
year = 2018,
month = {jul},
publisher = {Elsevier {BV}},
volume = 228,
journal = {Comput. Phys. Commun.},
title = {{DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics}},
pages = {178--184}
}
@misc{Zeng_JChemPhys_2023_v159_p054801,
title = {{DeePMD-kit v2: A software package for deep potential models}},
author = {Jinzhe Zeng and Duo Zhang and Denghui Lu and Pinghui Mo and Zeyu Li
and Yixiao Chen and Mari{'a}n Rynik and Li'ang Huang and Ziyao Li and
Shaochen Shi and Yingze Wang and Haotian Ye and Ping Tuo and Jiabin
Yang and Ye Ding and Yifan Li and Davide Tisi and Qiyu Zeng and Han
Bao and Yu Xia and Jiameng Huang and Koki Muraoka and Yibo Wang and
Junhan Chang and Fengbo Yuan and Sigbj{\o}rn L{\o}land Bore and Chun
Cai and Yinnian Lin and Bo Wang and Jiayan Xu and Jia-Xin Zhu and
Chenxing Luo and Yuzhi Zhang and Rhys E A Goodall and Wenshuo Liang
and Anurag Kumar Singh and Sikai Yao and Jingchao Zhang and Renata
Wentzcovitch and Jiequn Han and Jie Liu and Weile Jia and Darrin M
York and Weinan E and Roberto Car and Linfeng Zhang and Han Wang},
journal = {J. Chem. Phys.},
volume = 159,
issue = 5,
year = 2023,
pages = 054801,
doi = {10.1063/5.0155600},
}

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

WARNING: No fixes with time integration, atoms won't move (src/verlet.cpp:60)
Generated 0 of 21 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 4, bins = 28 28 28
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair deepmd, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/dp_zbl_model.py", line 54, in forward_lower
do_atomic_virial: bool=False) -> Dict[str, Tensor]:
_4 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, None, _4, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
model_predict = annotate(Dict[str, Tensor], {})
torch._set_item(model_predict, "atom_energy", model_ret["energy"])
File "code/torch/deepmd/pt/model/model/dp_zbl_model.py", line 214, in forward_common_lower
cc_ext, _36, fp, ap, input_prec, = _35
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_37 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/linear_atomic_model.py", line 48, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/linear_atomic_model.py", line 123, in forward_atomic
type_map_model = torch.to(mapping_list[0], ops.prim.device(extended_atype))
_30 = annotate(List[Optional[Tensor]], [extended_atype])
_31 = (_0).forward_common_atomic(extended_coord0, torch.index(type_map_model, 30), nlists[0], mapping, fparam, aparam, None, )
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_32 = torch.append(_29, _31["energy"])
mapping_list0 = self.mapping_list
File "code/torch/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 52, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 95, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/se_atten_v2.py", line 43, in forward
type_embedding0 = None
se_atten = self.se_atten
_2 = (se_atten).forward(nlist, extended_coord0, extended_atype, g1_ext, None, type_embedding0, )
~~~~~~~~~~~~~~~~~ <--- HERE
g1, g2, h2, rot_mat, sw, = _2
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/se_atten.py", line 219, in forward
_04 = getattr(networks2, "0")
gg_s = (_04).forward(ss, )
gg5 = torch.add(torch.mul(gg_s, gg_t3), gg_s)
~~~~~~~~~ <--- HERE
nnei5 = self.nn

Steps to Reproduce

install the dp-v3,0,1 on a centOS system, activate the deepmd-kit with the conda, and then run the lammps case below
overlap_1073.zip

Further Information, Files, and Links

No response

@Jeremy1189 Jeremy1189 added the bug label Jan 7, 2025
@anyangml
Copy link
Collaborator

anyangml commented Jan 7, 2025

We only added model compression to ZBL model in v3.0.1. See #4432, #4423 .

@Jeremy1189
Copy link
Author

It seems the problem is related to the device. On my cluster, I am using a single A100 GPU, and the same error occurs when using 1 A100 on Bohrium. However, the issue disappears when using 2 A100 GPUs (previously, I was using 8 V100 GPUs).

@Jeremy1189
Copy link
Author

TorchScript computations may require all tensors to reside on the same device (e.g., GPU or CPU). If certain tensors are inadvertently placed on different devices (e.g., one on CPU and another on GPU), it may result in a runtime error like above.

Possible Causes:
Data is not explicitly moved between devices as needed.
Model layers are not correctly initialized on the designated device.
Possibel Solution:
Ensure that both the model and data are consistently moved to the same device during preprocessing and initialization. Use .to(device) to transfer all relevant tensors and models to the designated device.

@njzjz
Copy link
Member

njzjz commented Jan 9, 2025

Is your log truncated? LAMMPS should print something like Last command: pair_style ..., which can be seen in other logs.

@Jeremy1189
Copy link
Author

Jeremy1189 commented Jan 9, 2025

**Not yet. I also downloaded the log file directly from the Bohrium platform (test on 3.0.0 version), and it matches exactly what I see from my local computer center. I've attached a ZIP file containing all the outputs from Bohrium for further diagnosis and analysis.**
out.zip

The log from the Bohrium platform:

#-------------------------------------------potential-------------------------------------
pair_style deepmd model.pth
Summary of lammps deepmd module ...

Info of deepmd-kit:
installed to: /opt/deepmd-kit-3.0.0
source:
source branch: HEAD
source commit: b1be266
source commit at: 2024-11-23 01:37:55 -0800
support model ver.: 1.1
build variant: cuda
build with tf inc: /opt/deepmd-kit-3.0.0/lib/python3.12/site-packages/tensorflow/include;/opt/deepmd-kit-3.0.0/include
build with tf lib: /opt/deepmd-kit-3.0.0/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/opt/deepmd-kit-3.0.0/lib/python3.12/site-packages/torch/lib/libc10.so;/home/conda/feedstock_root/build_artifacts/deepmd-kit_1732355244818/_build_env/targets/x86_64-linux/lib/stubs/libcuda.so;/opt/deepmd-kit-3.0.0/lib/libnvrtc.so;/opt/deepmd-kit-3.0.0/lib/libnvToolsExt.so;/opt/deepmd-kit-3.0.0/lib/libcudart.so;/opt/deepmd-kit-3.0.0/lib/python3.12/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
Info of lammps module:
use deepmd-kit at: /opt/deepmd-kit-3.0.0pair_coeff * * Ta Ti Al Cr Fe Ni Co
#-------------------------------------------run npt-------------------------------------
thermo 100
thermo_style custom step time dt temp elapsed cpu tpcpu pe etotal press vol
variable DT equal dt

variable TEMP_DT equal ${DT}100
variable TEMP_DT equal 0.001
100
#-------------------------PKA-----------------------
dump PKAdump CENTER atom 1 PKA.atom
run 0

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

@Article{Gissinger24,
author = {Jacob R. Gissinger, Ilia Nikiforov, Yaser Afshar, Brendon Waters, Moon-ki Choi, Daniel S. Karls, Alexander Stukowski, Wonpil Im, Hendrik Heinz, Axel Kohlmeyer, and Ellad B. Tadmor},
title = {Type Label Framework for Bonded Force Fields in LAMMPS},
journal = {J. Phys. Chem. B},
year = 2024,
volume = 128,
number = 13,
pages = {3282–-3297}
}

  • USER-DEEPMD package:

@Article{Wang_ComputPhysCommun_2018_v228_p178,
author = {Wang, Han and Zhang, Linfeng and Han, Jiequn and E, Weinan},
doi = {10.1016/j.cpc.2018.03.016},
url = {https://doi.org/10.1016/j.cpc.2018.03.016},
year = 2018,
month = {jul},
publisher = {Elsevier {BV}},
volume = 228,
journal = {Comput. Phys. Commun.},
title = {{DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics}},
pages = {178--184}
}
@misc{Zeng_JChemPhys_2023_v159_p054801,
title = {{DeePMD-kit v2: A software package for deep potential models}},
author = {Jinzhe Zeng and Duo Zhang and Denghui Lu and Pinghui Mo and Zeyu Li
and Yixiao Chen and Mari{'a}n Rynik and Li'ang Huang and Ziyao Li and
Shaochen Shi and Yingze Wang and Haotian Ye and Ping Tuo and Jiabin
Yang and Ye Ding and Yifan Li and Davide Tisi and Qiyu Zeng and Han
Bao and Yu Xia and Jiameng Huang and Koki Muraoka and Yibo Wang and
Junhan Chang and Fengbo Yuan and Sigbj{\o}rn L{\o}land Bore and Chun
Cai and Yinnian Lin and Bo Wang and Jiayan Xu and Jia-Xin Zhu and
Chenxing Luo and Yuzhi Zhang and Rhys E A Goodall and Wenshuo Liang
and Anurag Kumar Singh and Sikai Yao and Jingchao Zhang and Renata
Wentzcovitch and Jiequn Han and Jie Liu and Weile Jia and Darrin M
York and Weinan E and Roberto Car and Linfeng Zhang and Han Wang},
journal = {J. Chem. Phys.},
volume = 159,
issue = 5,
year = 2023,
pages = 054801,
doi = {10.1063/5.0155600},
}

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

WARNING: No fixes with time integration, atoms won't move (src/verlet.cpp:60)
Generated 0 of 21 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 4, bins = 28 28 28
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair deepmd, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/dp_zbl_model.py", line 54, in forward_lower
do_atomic_virial: bool=False) -> Dict[str, Tensor]:
_4 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, None, _4, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
model_predict = annotate(Dict[str, Tensor], {})
torch._set_item(model_predict, "atom_energy", model_ret["energy"])
File "code/torch/deepmd/pt/model/model/dp_zbl_model.py", line 214, in forward_common_lower
cc_ext, _36, fp, ap, input_prec, = _35
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_37 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/linear_atomic_model.py", line 48, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/linear_atomic_model.py", line 123, in forward_atomic
type_map_model = torch.to(mapping_list[0], ops.prim.device(extended_atype))
_30 = annotate(List[Optional[Tensor]], [extended_atype])
_31 = (_0).forward_common_atomic(extended_coord0, torch.index(type_map_model, 30), nlists[0], mapping, fparam, aparam, None, )
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_32 = torch.append(_29, _31["energy"])
mapping_list0 = self.mapping_list
File "code/torch/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 52, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 95, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
File "code/torch/deepmd/pt/model/descriptor/se_atten_v2.py", line 43, in forward
type_embedding0 = None
se_atten = self.se_atten
_2 = (se_atten).forward(nlist, extended_coord0, extended_atype, g1_ext, None, type_embedding0, )
~~~~~~~~~~~~~~~~~ <--- HERE
g1, g2, h2, rot_mat, sw, = _2
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/se_atten.py", line 219, in forward
_04 = getattr(networks2, "0")
gg_s = (_04).forward(ss, )
gg5 = torch.add(torch.mul(gg_s, gg_t3), gg_s)
~~~~~~~~~ <--- HERE
nnei5 = self.nnei
_35 = t

@Jeremy1189
Copy link
Author

By the way, I noticed that the tests with one and two A100 GPUs on the Bohrium platform are running on dp-3.0.0, while the CentOS setup uses dp-3.0.1.

@Jeremy1189
Copy link
Author

When I reduced the number of simulation atoms from 108,000 to 32,000, the error disappeared.

@njzjz
Copy link
Member

njzjz commented Jan 9, 2025

Perhaps it's due to out-of-memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants