Skip to content

Phenotypic similarity

Orion Buske edited this page Feb 21, 2016 · 2 revisions

Currently, the elasticsearch score is used directly as the phenotypic similarity score, normalized to the range [0, 1]. By indexing cases and their phenotypes (and the phenotype ancestors), we can directly query the elasticsearch index to quickly fetch the most similar cases. This should scale efficiently to a very large number of cases.

From datastore.py:

result = self._db.search(index=self._index, body=query)
scored_patients = []
for hit in result['hits']['hits'][:n]:
    # Just use the ElasticSearch TF/IDF score, normalized to [0, 1]
    score = 1 - 1 / (1 + hit['_score'])
    scored_patients.append((score, Patient(hit['_source']['doc'])))

A simpler approach would just iterate over all cases in the database, and compute a phenotypic similarity score (e.g. the UI score) directly.

Here is some pseudocode for how this might work:

query_patient = Patient(...)
# get the set of all the patient's phenotypes and their ancestors (the induced HPO subgraph)
query_phenotypes = query_patient._get_implied_present_phenotypes()
scored_patients = []
for match_patient in database:
    match_phenotypes = match_patient._get_implied_present_phenotypes()
    # the UI score
    score = len(query_phenotypes.intersection(match_phenotypes)) / len(query_phenotypes.union(match_phenotypes))
    scored_patients.append((score, match_patient))
Clone this wiki locally