Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map all PRO terms used in CL to uniprot (where possible). #2293

Open
dosumis opened this issue Feb 23, 2024 · 9 comments
Open

Map all PRO terms used in CL to uniprot (where possible). #2293

dosumis opened this issue Feb 23, 2024 · 9 comments
Assignees

Comments

@dosumis
Copy link
Contributor

dosumis commented Feb 23, 2024

We need to be able to map PRO terms used by CL to something the rest of the world can use. I think that means uniprot. Xrefs to uniprot are rare:

https://api.triplydb.com/s/tuAThwx4i

We mostly have xrefs to

  • PIR - which often has the mappings we need, but AFAIK has no API - so we'd need to scrape?
  • IUPHAR - need to research how we might use this.

Where we can't map based on ID, I think we may need to resort to lexical mapping. One option for this is GILDA.

@addiehl - any other suggestions based on your prior work on these + other linked resources?

@dosumis dosumis added the tech label Feb 23, 2024
@dosumis
Copy link
Contributor Author

dosumis commented Feb 23, 2024

@cmungall - any suggestions for strategy?

@addiehl
Copy link
Contributor

addiehl commented Feb 23, 2024

It might be useful to ask Darren @nataled

@nataled
Copy link

nataled commented Feb 23, 2024

I'll overlook the "something the rest of the world can use" comment ;)

The results of that SPARQL query fall into two types:

  1. The xref points to a protein family. These are cases where the PRO term was created on the basis of the indicated xref at the time the term was created. Prefixes include:
    PIRSF: https://proteininformationresource.org/cgi-bin/ipcSF?id=
    PANTHER: http://www.pantherdb.org/panther/family.do?clsAccession=
    IUPHARfam: http://www.guidetopharmacology.org/GRAC/FamilyDisplayForward?familyId=
    IUPHARobj: http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=

  2. The xref points to a specific protein or proteoform. For all these, the DTO and Reactome xrefs are superfluous in that they also have a UniProtKB xref. Prefixes include:
    UniProtKB: http://purl.uniprot.org/uniprot/
    DTO: http://www.drugtargetontology.org/dto/DTO_
    Reactome: http://www.reactome.org/content/detail/

For the first set, no single UniProtKB mapping is appropriate. Are you trying to obtain all the possible UniProtKB entries pertinent to those xrefs?

@dosumis
Copy link
Contributor Author

dosumis commented Feb 24, 2024

@nataled - many thanks for the details.

Various uses. In general including IDs that bioinformaticians are familiar with opens up more possibilities for them to use markers recorded in CL in their analyses.

More specifically, we're working on a Cell Type knowledge base with a focus on cell markers in human and mouse. We have other sources of known and potential markers - curated and computed. I'd like to find some way to fold in curated cell surface markers from CL.

It looks to me like in most cases 'family' here means a general term for the gene across species.

i pro_label PRO ID xref
1 "CD19 molecule"^^http://www.w3.org/2001/XMLSchema#string obo:PR_000001002 "IUPHARobj:2764"^^http://www.w3.org/2001/XMLSchema#string
2 "CD19 molecule"^^http://www.w3.org/2001/XMLSchema#string obo:PR_000001002 "PIRSF:PIRSF016630"^^http://www.w3.org/2001/XMLSchema#string

It also looks like we could pull the mouse and human uniprot IDs from the PIR pages: https://proteininformationresource.org/cgi-bin/ipcSF?id=PIRSF016630. Is there an API option? If not we will scrape. This will work for our KB plans. I think also useful to include these IDs in CL under some AP.
 

@dosumis
Copy link
Contributor Author

dosumis commented Feb 24, 2024

Seems we can use the structure of PRO to extract many of these, e.g.

https://api.triplydb.com/s/WGSZidIVe

PRO - CL Marker Mouse specific subclass mouse xref  
ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (mouse) UniProtKB:P56528
B-cell lymphoma 6 protein B-cell lymphoma 6 protein homolog (mouse) UniProtKB:P41183
B-cell receptor CD22 B-cell receptor CD22 (mouse) UniProtKB:P35329
C-C chemokine receptor type 1 C-C chemokine receptor type 1 (mouse) UniProtKB:P51675
C-C chemokine receptor type 2 C-C chemokine receptor type 2 (mouse) UniProtKB:P51683

The subclasses are not (currently ) in the import & even if they were, we should still find some way to better support bioinformatician users. From looking at the numbers, this won't work in every case, but is a good start.

Suggested mechanism to extract:

For all PRO terms used as markers for CL terms:

  • Look for uniprot xref
  • If no uniprot xref: Find immediate subclasses for mouse and human & extract uniprot refs. Assumption is that direct subclasses will link to record for the protein in general ("representative isoform"?) rather than specific isoforms.
  • ... some other strategy for remaining terms.

TBD: Accessible representation in CL.

@dosumis
Copy link
Contributor Author

dosumis commented Feb 26, 2024

CC @AvolaAmg

@cmungall
Copy link
Member

cmungall commented Feb 26, 2024 via email

@nataled
Copy link

nataled commented Feb 26, 2024

The file containing PIRSF membership can be found at https://proteininformationresource.org/projects/pirsf/. Note that the identifiers in this file don't contain 'PIR' (so, 'SF001234' instead of 'PIRSF001234'). This file goes beyond human and mouse, if that's what you need. If you only want human and mouse, then you can use our 'descendants' API for PRO:

https://lod.proconsortium.org/api.html#/DAG/getDescendantByProIDs

which is part of a larger set of APIs given here:

https://lod.proconsortium.org/api.html

You'll want to focus on the terms with local IDs that have UniProtKB accessions without a dash.

Copy link

This issue has not seen any activity in the past 6 months; it will be closed automatically in one year from now if no action is taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants