-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
long_name missing leading chr number #62
Comments
Thanks for opening an issue for this, and I'm glad you like the module. The fasta long_names are actually read from the FASTA file and the "short" names used as keys are read from the .fai index file. Without looking at your file I believe that there might be spaces at the end of the line before the chromosome 4 header line. I expect that this behavior falls somewhere between "bug" and "garbage in garbage out" but if you could point me to the exact file I would love to take a look and see if I can add a case for this in the pyfaidx code. |
Got it... If the last line in the previous chr (i.e., the previous line) is a full 60 characters, then the long name of the new chr is missing the first character. There are two instances of this in the version of GRCm38 that I need to use. I don't know whether the current version of 38 has this characteristic, but I'm confident that you can now recreate the problem and easily find the fix. Thanks! |
Not spaces, but the previous line. I put the reproducer in the case. From: Matt Shirley <[email protected]mailto:[email protected]> Thanks for opening an issue for this, and I'm glad you like the module. The fasta long_names are actually read from the FASTA file and the "short" names used as keys are read from the .fai index file. Without looking at your file I believe that there might be spaces at the end of the line before the chromosome 4 header line. I expect that this behavior falls somewhere between "bug" and "garbage in garbage out" but if you could point me to the exact file I would love to take a look and see if I can add a case for this in the pyfaidx code. Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-88665193. The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible. |
Blank lines are somewhat common in fasta files, so it would be a polite community service to deal with those gracefully (though most of the unofficial fasta file specs do warn against blank lines). This particular problem sounds like a different issue, though I figured it was a good opportunity to bring this up. -Sarah On Apr 1, 2015, at 9:30 PM, Al Simons [email protected] wrote:
|
Yes, @swheelan, blank lines are definitely common, and currently they are handled during indexing such that only one line per FASTA entry may contain an inconsistency, and this line must be the last line of that entry. This accounts for the possibility of either a trailing short line before the next defline, or a blank line between sequences, or a blank line at the end of the file. This particular issue is something else though, and I've downloaded the GRCm38_68 (ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa) file so I can figure out the issue. |
Thanks, Matt. I don't think you need that file, however. Just make the last line of a chromosome a full 60 bases in any fasta you have laying around. The next chr's long_name will be missing the first character. From: Matt Shirley <[email protected]mailto:[email protected]> Yes, @swheelanhttps://github.com/swheelan, blank lines are definitely common, and currently they are handled during indexinghttps://github.com/mdshw5/pyfaidx/blob/master/pyfaidx/init.py#L256 such that only one line per FASTA entry may contain an inconsistency, and this line must be the last line of that entry. This accounts for the possibility of either a trailing short line before the next defline, or a blank line between sequences, or a blank line at the end of the file. This particular issue is something else though, and I've downloaded the GRCm38_68.fa file so I can figure out the issue. Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-88894094. The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible. |
This is fixed and will be included in the v0.3.9 release. |
I think I may have found a bug. I can work around it, but thought you would want to fix it. It is really weird.
fasta file: mouse NCBI 38: GRCm38_68.fa
Symptom: on one record (chr 4), the long_name is missing the leading chr number. All the others are OK.
Hope the listings below help. (and thanks VERY much for pyfaidx…)
-Al
Confirm that the record is correct in the file:
$ awk '/^>4/{print $0; exit;}' GRCm38_68.fa
Open the file and see the top level:
See that the long name is OK for chr 3:
See that chr 4 is not OK:
The next two listings are longish, so I describe them here: first a listing of all the names, then all the long names. The names are OK for all chrs. The long names are all OK except for 4.
Now the long names
The text was updated successfully, but these errors were encountered: