Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[App Performance] Profile the time it takes for the app to load the configs and patient data from the local DB. #2066

Closed
11 tasks
dubdabasoduba opened this issue Feb 20, 2023 · 23 comments · Fixed by #2592
Assignees
Labels
Discussion This is an open discussion that may or may not lead to actionable points Performance (App or Server)

Comments

@dubdabasoduba
Copy link
Member

dubdabasoduba commented Feb 20, 2023

Describe the enhancement

  • The FHIR core performs a comprehensive initial sync. This sync gets the following items.
    • Application Composition resource (Based on the Application ID)
    • Application Congis referenced on the Composition
    • Patient data based on the sync configs

Implementation

  • Review the time and device resources it takes for the application to query the local DB to fetch the following pieces of data.
    • The time and device resources to get the Configs used by the app views.
    • The time and device resources to get the Patient data. Here we can look at 3 different resources
      • Patient resource.
      • Group resource.
      • Task resource.
      • Observation resource.
      • CarePlan resource.
      • MeasureReports resource.

Device configuration

  • RAM 3GB
  • ROM 64GB

Expected results.

  • Profiling documentation can be done by recording the following
    • Time and device resources it takes for each data fetch.
    • Amount of resources used by the device for each request. i.e RAM & CPU
@dubdabasoduba dubdabasoduba added Discussion This is an open discussion that may or may not lead to actionable points Performance (App or Server) labels Feb 20, 2023
@pld
Copy link
Member

pld commented Feb 21, 2023

How do we close this? How about through a new test that fails if this request takes longer than some very lenient amount of time

@dubdabasoduba
Copy link
Member Author

dubdabasoduba commented Feb 22, 2023

How do we close this? How about through a new test that fails if this request takes longer than some very lenient amount of time

Makes sense. I think we add that. We also want to record the times and device resources and then use that as a baseline in order to validate improvements made to the code.

@pld
Copy link
Member

pld commented Feb 22, 2023

Cool nb if we need to add example FHIR resources into this repo to support content for testing, I think that is appropriate

@ndegwamartin
Copy link
Contributor

@pld @dubdabasoduba is this ticket about profiling to get the current metrics or about writing the integration test such that it passes or fails if it exceeds some period x?

@pld
Copy link
Member

pld commented Mar 14, 2023

I think it should be about the latter, which would also mean we have to do the former, at least for a specific code path, does that make sense?

@ekigamba ekigamba assigned ekigamba and unassigned ekigamba Mar 16, 2023
@ekigamba
Copy link
Contributor

ekigamba commented Mar 17, 2023

Loading times as tested on the Blu G60 with 3 GB RAM, 64 GB ROM, Octa-core 1.6 GHz and running on Android 9

The family register page currently takes at least 60 seconds and a maximum of 3 minutes to show 25 family items.
Each register page loads 25 items.

Time taken to load the register

Register Page Time Total count on the app
Household 60 secs to 3 mins 60
Children 12 sec 219
Sick child 8 secs 133
ANC 5 secs 114
PNC 6 secs 31
FP 5 secs 149
HIV 5 secs 25

Time taken to load profile pages

Profile Page Time
Household 12 - 17 secs
Children 10 sec
Sick child 8 secs
ANC 8 secs

Optimizations tried/made on the household page

  • Reduce calls to loadRegister. Calls to loadRegister was happening ~3 times instead of once whenever the app was openned. The method RegisterViewModel.retrieveRegisterUiState() seems to be calling paginateRegisterData() multiple times
  • Use count queries instead of loading the related resources that needed to be counted. The task fetch queries were the slowest because there are 8K tasks and we need to fetch tasks for each family member. These queries each took around 100 ms
  • Write raw queries instead of using the FHIR Engine search API. I had to add a method inside FhirEngine to allow raw queries and return a raw Cursor
  • Rewrite the register queries to cache values such as family member resourceUuid and familyUuid which is the unique identifier on the DB
  • Create a single normalized table to store the register data only

Time taken after improvements (household register only)

  • Time taken after database query optimizations ~ 7 seconds
  • Time taken after using a single RegisterFamilies table ~ 300 ms where the optimized queries update the table in the background taking 7 - 9 seconds. This however completely breaks configurability and would need more time na complexity to work for all the register pages and all registers

Table to store the RegisterFamilies and their data

CREATE TABLE IF NOT EXISTS "RegisterFamilies" (
	"resourceUuid"	BLOB UNIQUE,
	"lastUpdated"	INTEGER,
	"childCount" INTEGER,
	"taskCount" INTEGER,
	"taskStatus" TEXT,
	"pregnantWomenCount" INTEGER,
	"familyName" TEXT,
	"householdNo" TEXT,
	"householdLocation" TEXT,
	PRIMARY KEY("resourceUuid")
);

INDEX TO SPEED UP SORT BY lastUpdated

CREATE INDEX "index_RegisterFamilies_lastUpdated" ON "RegisterFamilies" (
	"lastUpdated"
);

QUERY TO UPDATE THE lastUpdated in RegisterFamilies

INSERT INTO RegisterFamilies (resourceUuid, lastUpdated)
SELECT a.resourceUuid, c.index_from
FROM ResourceEntity a
LEFT JOIN DateIndexEntity b
ON a.resourceType = b.resourceType AND a.resourceUuid = b.resourceUuid AND b.index_name = "_lastUpdated"
LEFT JOIN DateTimeIndexEntity c
ON a.resourceType = c.resourceType AND a.resourceUuid = c.resourceUuid AND c.index_name = "_lastUpdated"
WHERE a.resourceType = "Group"
AND a.resourceUuid IN (
SELECT resourceUuid FROM TokenIndexEntity
WHERE resourceType = "Group" AND index_name = "type" AND (index_value = "person" AND (index_system = "http://hl7.org/fhir/group-type"))
)
AND a.resourceUuid IN (
SELECT resourceUuid FROM TokenIndexEntity
WHERE resourceType = "Group" AND index_name = "code" AND (index_value = "35359004" AND IFNULL(index_system,'') = "https://www.snomed.org")
)
ORDER BY b.index_from DESC, c.index_from DESC

======

SELECT * FROM RegisterFamilies ORDER BY lastUpdated DESC LIMIT 25 OFFSET 0

=======

QUERY TO GET Family Member resourceUuid's

SELECT resourceUuid FROM ResourceEntity WHERE resourceType = "Patient" AND resourceId IN (
SELECT SUBSTR(index_value, 9) FROM ReferenceIndexEntity WHERE index_name = "member" 
AND resourceUuid = (SELECT resourceUuid FROM ResourceEntity WHERE resourceId = ?)
)

Query to get Family resourceUuid

SELECT resourceUuid FROM ResourceEntity WHERE resourceId = ?

QUERY TO GET THE TASK COUNT FOR A PATIENT

SELECT COUNT(*) FROM TokenIndexEntity WHERE resourceType = "Task" AND index_name = "status" 
AND resourceUuid IN (
SELECT resourceUuid FROM ReferenceIndexEntity WHERE resourceType = "Task" AND index_name = "subject" 
AND index_value IN (SELECT index_value FROM ReferenceIndexEntity WHERE resourceUuid = x'$groupUUID' AND index_name = "member")
) 
AND (index_value = "failed" OR index_value = "completed" OR index_value = "cancelled")

QUERY TO GET THE CHILD COUNT IN A FAMILY

SELECT COUNT(*) FROM TokenIndexEntity a JOIN DateIndexEntity b ON a.resourceUuid = b.resourceUuid  
WHERE a.resourceUuid IN ($memberSelector) AND a.index_name = "active" AND a.index_value = "true" 
AND b.index_name = "birthdate" AND b.index_from >= ?

QUERY TO GET THE NUMBER OF PREGNANT WOMEN IN A FAMILY

SELECT COUNT(*) FROM TokenIndexEntity WHERE resourceType = "Condition" 
AND index_name = "code" AND index_system = "http://snomed.info/sct" AND index_value = "77386006" 
AND resourceUuid IN (SELECT resourceUuid FROM ReferenceIndexEntity WHERE resourceType = "Condition" 
AND index_name = "subject" AND index_value IN (SELECT index_value FROM ReferenceIndexEntity WHERE index_name = "member" AND resourceUuid = x'$groupUUID') )

=== SPEEDUP FOR FETCH MEMBERS QUERY ===

CREATE INDEX `index_ResourceEntity_resourceId` ON `ResourceEntity` (`resourceId`)

@ekigamba
Copy link
Contributor

ekigamba commented Mar 17, 2023

On measuring performance on CI, we have a number of options that we can evaluate

We can get more ideas on this before commiting to an idea

@ekigamba
Copy link
Contributor

With the optimisations done by the team @ellykits and @ndegwamartin , performance has improved by 50%

Time taken to load the register

Register Page Before After
Household 60 secs to 3 mins 24 secs
Children 12 sec 4 sec
Sick child 8 secs 3 sec
ANC 5 secs 3 secs
PNC 6 secs 3 secs
FP 5 secs 3 secs
HIV 5 secs 1 secs

These times were taken with the device wifi turned off meaning background syncing jobs were not running. This also used the previous ECBIS preview instance data used during the first benchmark

@pld
Copy link
Member

pld commented Mar 27, 2023

@ekigamba is this from the latest RC? what branch/PR are these results based on?

@ekigamba
Copy link
Contributor

ekigamba commented Mar 27, 2023

@pld This is based on the latest master main. The performance numbers with FHIR staging data are even better, 5 - 11 seconds for the household register

@ekigamba
Copy link
Contributor

ekigamba commented May 4, 2023

Results after revInclude integration

Looks good for the profile pages, general improvement.

Household profile ~2 sec
Child profile = ~3 sec,
sick child profile = ~ 3 sec

No change in the performance of the household, children and sick child register. Respectively times are 24 sec, 3 sec and 3 sec.

The bug fixes done yesterday by @kitoto did fix the performance degradation on child and sick child registers

@ekigamba
Copy link
Contributor

ekigamba commented May 9, 2023

Results after forward include integration into FHIR Core

Household register = ~20 secs
Child register = ~2.5 secs
Sick child register = ~ 2.5 secs

Household profile = ~ 2 secs
Child profile = ~ 2 secs
sick child profile = ~ 3 secs

cc @ellykits @ndegwamartin

@jingtang10
Copy link

@ekigamba any more insight into why household register is slow? how can we help now that we have tried rev and forward include?

aditya's gonna merge those two changes soon.

@ellykits
Copy link
Collaborator

ellykits commented May 12, 2023

@jingtang10 We are actively working on improving the household register performance. We identified a few issues with the code used to fetch the register data that we are optimizing leveraging the recent functionalities from search APIs.

  1. Aggregating all counts for household members retrieved via forward include. At the moment we run count for each member then sum all in code.
  2. Retrieve all household members accompanying resources all at once, we'd been handling this recursively for each member before forward include implementation.

We'll share updated stats after the above issues are addressed.

Aditya mentioned:

  1. "A plan to refactor the api's to be just search and have include and revInclude just like has in the current search api.". Current implementation requires different calls for forward and reverse include. (Internally from our last discussion both functionalities run two queries one to fetch the primary resources and then another query to retrieve the accompanying resources)

  2. Parsing JSON to the corresponding FHIR representation Java model takes longer during processing of the query results, any details on this?

Do we have any plans to support filters on accompanying resources via rev/forward to reduce the amount of data to process? I understand this may not be available on the server _include/ _revinclude which the inspiration for the client implementation.

@jingtang10
Copy link

jingtang10 commented May 13, 2023 via email

@ellykits
Copy link
Collaborator

Thanks for your response @jingtang10

In agreement with you for the asks (priority in the order they are listed).

  1. Based on Aditya's finding (which I confirmed) on this Add index to DateTimeIndexEntity table index_from column google/android-fhir#1964 about the sort functionality not using the _lastUpdated index from the DateTimeIndexEntity table. Is it possible to investigate why the index is not used, could be related to the query generation. The query I shared on Aditya's PR was generated by the SDK:
    See section 1.2. Temporary Sorting B-Trees

  2. Another ask, possiblity to limit the amount of data extracted from the resources esp. the ones that are neither used in search nor sorting. E.g: Patient.address-use is something we do not use in our queries but it is extracted into the TokenIndexEntity table anyways. Our demo server has over 219K entries in that table alone (of course with data extracted from other resource types)

@pld
Copy link
Member

pld commented Jul 7, 2023

@ekigamba let's a add a perf test to close this out with

@pld
Copy link
Member

pld commented Jul 17, 2023

thinking same pattern here, which looks in line w/the draft pr, #2067 (comment)

@f-odhiambo
Copy link
Contributor

@ekigamba Whats the ETA for this ticket ?

@ekigamba
Copy link
Contributor

The remaining fixes might take 1 - 2 FTE days. I'll create issues any unresolved and potential issues originating from this ticket

@ekigamba
Copy link
Contributor

ekigamba commented Jul 25, 2023

Summary of performance testing implementation

Tools used

  1. Jetpack microbenchmark - it uses best practices for profiling performance such as getting the median of multiple iterations and using metrics only after performance has stabiliised. It also provides trace files of the tested methods. It can be configured to run at optimal
  2. Sample DB containing a lot of data. This allows us to avoid syncing and logging in on the emulator reducing the time it takes to profile the
  3. Gradle task to evaluate the results

Possible issues

  1. Flakiness due to change in Github CI runners and infrastructure. This might cause random improvements or degradations in performance. The changes that can cause this include increased hardware resources or assigned resources for the runner.
  2. Flakiness due to change in emulator configuration. Any changes in the emulator configuration might cause improvements or degradations in performance. Jetpack benchmark library is able to stabilise performance by applying tricks such as lock clocks to lock CPU clock speeds, apply activities to prevent interaction with other apps, run no-work threads to stabilise results for multi-core metrics
  3. Unrealistic results. Given that the Github CI infrastructure is faster than a Blu G60 phone, the metrics are not indicative of how much time it takes on a Blue G60 phone but they allow for us to easily detect degradations or improvements in performance. The threshold is 50% of the current metrics and this can be easily changed in the margin value

Possible improvements

  1. Reduce the hardware configuration for the emulator to fewer cores
  2. Use an emulator running on docker or kubernetes with fewer compute units. This allows us to closely emulate performance of the field devices such as the G60. Possible use on the Hetzner server
  3. Run on rooted emulator/device to allow for full configuration by the library to get more stable results
  4. Post the results on the PR for easier tracking of performance
  5. Run the tests manually on the Blu G60 device during QA before every release
  6. Update the benchmark library to allow running on > API 28 devices. This will require increasing the compile SDK version and related changes. This will also allow running using the official Benchmark runner which configures the emulator/device for more stable performance

...

@pld
Copy link
Member

pld commented Jul 26, 2023

Cool, so I am 100% fine w/all of those Possible issues, those are all expected given the infrastructure we're using. From the Possible improvements if we choose to do any of these they can come later, (5.) seems reasonable to request, (4.) maybe.

I guess my question is, when you say flakey, how flakey, like are you seeing 10x changes between runs? I think what we can test for is that it's staying within the expected order of magnitude, like if you're seeing 2-5x changes, we can fail the test if we see a 10x change, if you're seeing 20-50x changes, we can fail the test if we see a 100x change -- we're only trying to put a wide envelope on it for now, in follow-up we'll tighten it

@ekigamba
Copy link
Contributor

ekigamba commented Jul 26, 2023

I'd think that I wouldn't expect it to be flaky initially given that the tests are isolated and emulator barely has any other interaction when the tests are running, but any changes in infrastructure that we might not know off will become noticeable. I have set to limit to ~1.5x based on this assumption and we can change this if the assumption is wrong.

The library manages stability of results by running the tests until the results are consistent and then it uses the results from the stable iterations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion This is an open discussion that may or may not lead to actionable points Performance (App or Server)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants