-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory error outputting vcf #139
Comments
Yes, unfortunately, the biopython package is not very efficient in memory. We're working on various approaches to address this. Were you running the simulation on a whole genome? I would pare down the input fasta to just the chromosome of interest and see if that gets across the finish line.
…-Josh
________________________________
From: giobus75 ***@***.***>
Sent: Monday, February 3, 2025 4:16 AM
To: ncsa/NEAT ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [ncsa/NEAT] Memory error outputting vcf (Issue #139)
Hi,
I was checking the fix #138 <https://urldefense.com/v3/__https://github.com/ncsa/NEAT/pull/138__;!!DZ3fjg!8rXV2YjrwiNH2vlULX-d8wgq_xjxm-m4ohi8hKmyQr58R35C1LPoCIXepu255OZLnS8sMJ8GK8jctcziokajg_tJdccy1Q$> of the issue #126 <https://urldefense.com/v3/__https://github.com/ncsa/NEAT/issues/126__;!!DZ3fjg!8rXV2YjrwiNH2vlULX-d8wgq_xjxm-m4ohi8hKmyQr58R35C1LPoCIXepu255OZLnS8sMJ8GK8jctcziokajg_utoT_Dwg$> .
I ran the read simulator (using the same command as specified in issue #126 <https://urldefense.com/v3/__https://github.com/ncsa/NEAT/issues/126__;!!DZ3fjg!8rXV2YjrwiNH2vlULX-d8wgq_xjxm-m4ohi8hKmyQr58R35C1LPoCIXepu255OZLnS8sMJ8GK8jctcziokajg_utoT_Dwg$> with the NEAT version 4.2.8. The process generated a 2.4MB VCF file containing data for chr1, but it failed with a MemoryError while writing the output. This occurred despite the machine having 378GB RAM.
Log Excerpt (Final Lines)
2025-01-24 09:39:07,008:INFO:neat.read_simulator.runner:Generating variants for HLA-DRB1*16:02:01
2025-01-24 09:39:07,015:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 0.00 minutes
2025-01-24 09:39:07,015:INFO:neat.read_simulator.utils.generate_variants:Added 10 mutations to HLA-DRB1*16:02:01
2025-01-24 09:39:07,015:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...
2025-01-24 09:39:07,497:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 0.01 m
2025-01-24 09:39:07,497:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 0.01 m
2025-01-24 09:39:07,498:INFO:neat.read_simulator.runner:Outputting golden vcf: /home/neat/simulated_stuff_golden.vcf.gz
2025-01-30 01:20:32,838:ERROR:neat:read-simulator failed, see the traceback below
Traceback (most recent call last):
File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main
cmd(args)
File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute
read_simulator_runner(arguments.config, arguments.output)
File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 339, in read_simulator_runner
output_file_writer.write_final_vcf(local_variant_files, reference_index)
File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/output_file_writer.py", line 163, in write_final_vcf
ref, alt = variants.get_ref_alt(variant, reference[contig])
File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/File.py", line 227, in __getitem__
record = self._proxy.get(self._offsets[key])
File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/SeqIO/_index.py", line 52, in get
return next(self._iterator(StringIO(self.get_raw(offset).decode())))
MemoryError
Question
Is it expected that writing results consumes so much memory? Could this be a bug or an inefficiency in the output handling?
Thank you!
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ncsa/NEAT/issues/139__;!!DZ3fjg!8rXV2YjrwiNH2vlULX-d8wgq_xjxm-m4ohi8hKmyQr58R35C1LPoCIXepu255OZLnS8sMJ8GK8jctcziokajg_umH1B0ng$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AGMI722R7UJCJBTIZD5B5JL2N46XLAVCNFSM6AAAAABWLXGJX6VHI2DSMVQWIX3LMV43ASLTON2WKOZSHAZDOMBXGYYTAMI__;!!DZ3fjg!8rXV2YjrwiNH2vlULX-d8wgq_xjxm-m4ohi8hKmyQr58R35C1LPoCIXepu255OZLnS8sMJ8GK8jctcziokajg_tw2Gt0Pg$>.
You are receiving this because you are subscribed to this thread.
|
Ok, I'm gonna try the approach you suggested to get one chromosome per run. |
No current timeframe, as we have no current funding for this project. Getting it to run faster and more efficiently is our top priority, though. |
Other options: you can try NEAT3, which is closer in structure to the original version, or NEAT2 (requires Python 2.X), the original. They are faster and a little more reliable than NEAT4 is proving to be. Check our release page for the older versions. |
Hi,
I was checking the fix #138 of the issue #126 .
I ran the read simulator (using the same command as specified in issue #126 with the NEAT version 4.2.8. The process generated a 2.4MB VCF file containing data for chr1, but it failed with a MemoryError while writing the output. This occurred despite the machine having 378GB RAM.
Log Excerpt (Final Lines)
Question
Is it expected that writing results consumes so much memory? Could this be a bug or an inefficiency in the output handling?
Thank you!
The text was updated successfully, but these errors were encountered: