Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long_name missing leading chr number #62

Closed
AlSimonsJax opened this issue Apr 1, 2015 · 7 comments
Closed

long_name missing leading chr number #62

AlSimonsJax opened this issue Apr 1, 2015 · 7 comments
Assignees
Labels

Comments

@AlSimonsJax
Copy link

I think I may have found a bug. I can work around it, but thought you would want to fix it. It is really weird.

fasta file: mouse NCBI 38: GRCm38_68.fa

Symptom: on one record (chr 4), the long_name is missing the leading chr number. All the others are OK.

Hope the listings below help. (and thanks VERY much for pyfaidx…)
-Al

Confirm that the record is correct in the file:

$ awk '/^>4/{print $0; exit;}' GRCm38_68.fa

4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

Open the file and see the top level:

from pyfaidx import Fasta
fa = Fasta('GRCm38_68.fa')
fa.keys()
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y', 'JH584295.1', 'JH584292.1', 'GL456368.1', 'GL456396.1', 'GL456359.1', 'GL456382.1', 'GL456392.1', 'GL456394.1', 'GL456390.1', 'GL456387.1', 'GL456381.1', 'GL456370.1', 'GL456372.1', 'GL456389.1', 'GL456378.1', 'GL456360.1', 'GL456385.1', 'GL456383.1', 'GL456213.1', 'GL456239.1', 'GL456367.1', 'GL456366.1', 'GL456393.1', 'GL456216.1', 'GL456379.1', 'JH584304.1', 'GL456212.1', 'JH584302.1', 'JH584303.1', 'GL456210.1', 'GL456219.1', 'JH584300.1', 'JH584298.1', 'JH584294.1', 'GL456354.1', 'JH584296.1', 'JH584297.1', 'GL456221.1', 'JH584293.1', 'GL456350.1', 'GL456211.1', 'JH584301.1', 'GL456233.1', 'JH584299.1']

See that the long name is OK for chr 3:

print fa['3'].long_name
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF

See that chr 4 is not OK:

print fa['4'].long_name
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

The next two listings are longish, so I describe them here: first a listing of all the names, then all the long names. The names are OK for all chrs. The long names are all OK except for 4.

for record in fa:
... print record.name
...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
MT
X
Y
JH584295.1
JH584292.1
GL456368.1
GL456396.1
GL456359.1
GL456382.1
GL456392.1
GL456394.1
GL456390.1
GL456387.1
GL456381.1
GL456370.1
GL456372.1
GL456389.1
GL456378.1
GL456360.1
GL456385.1
GL456383.1
GL456213.1
GL456239.1
GL456367.1
GL456366.1
GL456393.1
GL456216.1
GL456379.1
JH584304.1
GL456212.1
JH584302.1
JH584303.1
GL456210.1
GL456219.1
JH584300.1
JH584298.1
JH584294.1
GL456354.1
JH584296.1
JH584297.1
GL456221.1
JH584293.1
GL456350.1
GL456211.1
JH584301.1
GL456233.1
JH584299.1

Now the long names

for record in fa:
... print record.long_name
...
1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
JH584295.1 dna:scaffold scaffold:GRCm38:JH584295.1:1:1976:1 REF
JH584292.1 dna:scaffold scaffold:GRCm38:JH584292.1:1:14945:1 REF
GL456368.1 dna:scaffold scaffold:GRCm38:GL456368.1:1:20208:1 REF
GL456396.1 dna:scaffold scaffold:GRCm38:GL456396.1:1:21240:1 REF
L456359.1 dna:scaffold scaffold:GRCm38:GL456359.1:1:22974:1 REF
GL456382.1 dna:scaffold scaffold:GRCm38:GL456382.1:1:23158:1 REF
GL456392.1 dna:scaffold scaffold:GRCm38:GL456392.1:1:23629:1 REF
GL456394.1 dna:scaffold scaffold:GRCm38:GL456394.1:1:24323:1 REF
GL456390.1 dna:scaffold scaffold:GRCm38:GL456390.1:1:24668:1 REF
GL456387.1 dna:scaffold scaffold:GRCm38:GL456387.1:1:24685:1 REF
GL456381.1 dna:scaffold scaffold:GRCm38:GL456381.1:1:25871:1 REF
GL456370.1 dna:scaffold scaffold:GRCm38:GL456370.1:1:26764:1 REF
GL456372.1 dna:scaffold scaffold:GRCm38:GL456372.1:1:28664:1 REF
GL456389.1 dna:scaffold scaffold:GRCm38:GL456389.1:1:28772:1 REF
GL456378.1 dna:scaffold scaffold:GRCm38:GL456378.1:1:31602:1 REF
GL456360.1 dna:scaffold scaffold:GRCm38:GL456360.1:1:31704:1 REF
GL456385.1 dna:scaffold scaffold:GRCm38:GL456385.1:1:35240:1 REF
GL456383.1 dna:scaffold scaffold:GRCm38:GL456383.1:1:38659:1 REF
GL456213.1 dna:scaffold scaffold:GRCm38:GL456213.1:1:39340:1 REF
GL456239.1 dna:scaffold scaffold:GRCm38:GL456239.1:1:40056:1 REF
GL456367.1 dna:scaffold scaffold:GRCm38:GL456367.1:1:42057:1 REF
GL456366.1 dna:scaffold scaffold:GRCm38:GL456366.1:1:47073:1 REF
GL456393.1 dna:scaffold scaffold:GRCm38:GL456393.1:1:55711:1 REF
GL456216.1 dna:scaffold scaffold:GRCm38:GL456216.1:1:66673:1 REF
GL456379.1 dna:scaffold scaffold:GRCm38:GL456379.1:1:72385:1 REF
JH584304.1 dna:scaffold scaffold:GRCm38:JH584304.1:1:114452:1 REF
GL456212.1 dna:scaffold scaffold:GRCm38:GL456212.1:1:153618:1 REF
JH584302.1 dna:scaffold scaffold:GRCm38:JH584302.1:1:155838:1 REF
JH584303.1 dna:scaffold scaffold:GRCm38:JH584303.1:1:158099:1 REF
GL456210.1 dna:scaffold scaffold:GRCm38:GL456210.1:1:169725:1 REF
GL456219.1 dna:scaffold scaffold:GRCm38:GL456219.1:1:175968:1 REF
JH584300.1 dna:scaffold scaffold:GRCm38:JH584300.1:1:182347:1 REF
JH584298.1 dna:scaffold scaffold:GRCm38:JH584298.1:1:184189:1 REF
JH584294.1 dna:scaffold scaffold:GRCm38:JH584294.1:1:191905:1 REF
GL456354.1 dna:scaffold scaffold:GRCm38:GL456354.1:1:195993:1 REF
JH584296.1 dna:scaffold scaffold:GRCm38:JH584296.1:1:199368:1 REF
JH584297.1 dna:scaffold scaffold:GRCm38:JH584297.1:1:205776:1 REF
GL456221.1 dna:scaffold scaffold:GRCm38:GL456221.1:1:206961:1 REF
JH584293.1 dna:scaffold scaffold:GRCm38:JH584293.1:1:207968:1 REF
GL456350.1 dna:scaffold scaffold:GRCm38:GL456350.1:1:227966:1 REF
GL456211.1 dna:scaffold scaffold:GRCm38:GL456211.1:1:241735:1 REF
JH584301.1 dna:scaffold scaffold:GRCm38:JH584301.1:1:259875:1 REF
GL456233.1 dna:scaffold scaffold:GRCm38:GL456233.1:1:336933:1 REF
JH584299.1 dna:scaffold scaffold:GRCm38:JH584299.1:1:953012:1 REF

@mdshw5
Copy link
Owner

mdshw5 commented Apr 1, 2015

Thanks for opening an issue for this, and I'm glad you like the module. The fasta long_names are actually read from the FASTA file and the "short" names used as keys are read from the .fai index file. Without looking at your file I believe that there might be spaces at the end of the line before the chromosome 4 header line. I expect that this behavior falls somewhere between "bug" and "garbage in garbage out" but if you could point me to the exact file I would love to take a look and see if I can add a case for this in the pyfaidx code.

@AlSimonsJax
Copy link
Author

Got it...

If the last line in the previous chr (i.e., the previous line) is a full 60 characters, then the long name of the new chr is missing the first character. There are two instances of this in the version of GRCm38 that I need to use. I don't know whether the current version of 38 has this characteristic, but I'm confident that you can now recreate the problem and easily find the fix. Thanks!
-Al

@AlSimonsJax
Copy link
Author

Not spaces, but the previous line. I put the reproducer in the case.
-Al

From: Matt Shirley <[email protected]mailto:[email protected]>
Reply-To: mdshw5/pyfaidx <[email protected]mailto:[email protected]>
Date: Wednesday, April 1, 2015 at 7:45 PM
To: mdshw5/pyfaidx <[email protected]mailto:[email protected]>
Cc: Al Simons <[email protected]mailto:[email protected]>
Subject: Re: [pyfaidx] long_name missing leading chr number (#62)

Thanks for opening an issue for this, and I'm glad you like the module. The fasta long_names are actually read from the FASTA file and the "short" names used as keys are read from the .fai index file. Without looking at your file I believe that there might be spaces at the end of the line before the chromosome 4 header line. I expect that this behavior falls somewhere between "bug" and "garbage in garbage out" but if you could point me to the exact file I would love to take a look and see if I can add a case for this in the pyfaidx code.

Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-88665193.

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

@swheelan
Copy link
Collaborator

swheelan commented Apr 2, 2015

Blank lines are somewhat common in fasta files, so it would be a polite community service to deal with those gracefully (though most of the unofficial fasta file specs do warn against blank lines).

This particular problem sounds like a different issue, though I figured it was a good opportunity to bring this up.

-Sarah

On Apr 1, 2015, at 9:30 PM, Al Simons [email protected] wrote:

Not spaces, but the previous line. I put the reproducer in the case.
-Al

From: Matt Shirley <[email protected]mailto:[email protected]>
Reply-To: mdshw5/pyfaidx <[email protected]mailto:[email protected]>
Date: Wednesday, April 1, 2015 at 7:45 PM
To: mdshw5/pyfaidx <[email protected]mailto:[email protected]>
Cc: Al Simons <[email protected]mailto:[email protected]>
Subject: Re: [pyfaidx] long_name missing leading chr number (#62)

Thanks for opening an issue for this, and I'm glad you like the module. The fasta long_names are actually read from the FASTA file and the "short" names used as keys are read from the .fai index file. Without looking at your file I believe that there might be spaces at the end of the line before the chromosome 4 header line. I expect that this behavior falls somewhere between "bug" and "garbage in garbage out" but if you could point me to the exact file I would love to take a look and see if I can add a case for this in the pyfaidx code.

Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-88665193.

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

Reply to this email directly or view it on GitHub.

@mdshw5
Copy link
Owner

mdshw5 commented Apr 2, 2015

Yes, @swheelan, blank lines are definitely common, and currently they are handled during indexing such that only one line per FASTA entry may contain an inconsistency, and this line must be the last line of that entry. This accounts for the possibility of either a trailing short line before the next defline, or a blank line between sequences, or a blank line at the end of the file.

This particular issue is something else though, and I've downloaded the GRCm38_68 (ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa) file so I can figure out the issue.

@mdshw5 mdshw5 added the bug label Apr 2, 2015
@mdshw5 mdshw5 self-assigned this Apr 2, 2015
@AlSimonsJax
Copy link
Author

Thanks, Matt. I don't think you need that file, however. Just make the last line of a chromosome a full 60 bases in any fasta you have laying around. The next chr's long_name will be missing the first character.

From: Matt Shirley <[email protected]mailto:[email protected]>
Reply-To: mdshw5/pyfaidx <[email protected]mailto:[email protected]>
Date: Thursday, April 2, 2015 at 9:21 AM
To: mdshw5/pyfaidx <[email protected]mailto:[email protected]>
Cc: Al Simons <[email protected]mailto:[email protected]>
Subject: Re: [pyfaidx] long_name missing leading chr number (#62)

Yes, @swheelanhttps://github.com/swheelan, blank lines are definitely common, and currently they are handled during indexinghttps://github.com/mdshw5/pyfaidx/blob/master/pyfaidx/init.py#L256 such that only one line per FASTA entry may contain an inconsistency, and this line must be the last line of that entry. This accounts for the possibility of either a trailing short line before the next defline, or a blank line between sequences, or a blank line at the end of the file.

This particular issue is something else though, and I've downloaded the GRCm38_68.fa file so I can figure out the issue.

Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-88894094.

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

mdshw5 added a commit that referenced this issue Apr 2, 2015
@mdshw5
Copy link
Owner

mdshw5 commented Apr 5, 2015

This is fixed and will be included in the v0.3.9 release.

@mdshw5 mdshw5 closed this as completed Apr 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants