Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental data-store generation #3202

Merged
merged 4 commits into from
Aug 20, 2019

Conversation

dwsutherland
Copy link
Member

@dwsutherland dwsutherland commented Jun 29, 2019

The current data-store generation is inefficient (as was the old state_summary_mgr.py), in that it generates an entirely new data-store on every loop where any change has flagged...
The update/creation has been granulated into the following actions/methods:

  1. Creation of workflow/taskdef/familydef data elements.
  2. Generation of graph edges and ghost-nodes/ghost-instances/ghost-proxies (containing def data and tree relationships) for each new cycle point in the pool.
  3. Update/population of [ghost] task instance fields with dynamic data (i.e. status info) for given pool tasks.
  4. Update/population of [ghost] family instance fields with dynamic data (i.e. status info) according to given cycle points (i.e. those cycle points of the updated tasks).
  5. Update/population of workflow (data element) fields with dynamic data (i.e. status info).
  6. Prune data-store of elements (edges/nodes ...etc) and items in fields on or in reference to given list of point_strings.

These are used within the scheduler in implementation with:

  • Initiation of data-store on run/restart/reload using (1)-(5)
  • Increment of graph elements and prune old elements using (2), (5), and (6) worked out from the head and tail of the task pool respectively (workflow updated after with new state totals etc).
  • Populate/Update of dynamic fields of task/family instances and then workflow element ((3), (4), then (5)), from tasks flagged as updated by the workflow.

There was a bug in the job pool preventing job element deletion, this has also been fixed.

This change has the following impacts:

  • Significant reduction of load on the workflow, mainly from not having to generate entire graph on every change (also that job deletion bug).
  • The data-store is used both in the WS and UI Server as driving data for GraphQL, so this change sets up the ground work for implementing an incremental data sync between WS & UIS data store.

Closes #3261
Closes #3212

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • No change log entry required (e.g. change is small or internal only).
  • No documentation update required for this change.

@dwsutherland dwsutherland force-pushed the increment-data-store branch from 2dbda8c to fa35518 Compare June 29, 2019 03:11
@cylc cylc deleted a comment Jun 29, 2019
@dwsutherland dwsutherland force-pushed the increment-data-store branch 2 times, most recently from 3b73f44 to 8235d2d Compare June 29, 2019 19:26
@codecov
Copy link

codecov bot commented Jun 29, 2019

Codecov Report

Merging #3202 into master will increase coverage by 0.11%.
The diff coverage is 98.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3202      +/-   ##
==========================================
+ Coverage   83.51%   83.62%   +0.11%     
==========================================
  Files         162      162              
  Lines       17900    18004     +104     
  Branches     3899     3929      +30     
==========================================
+ Hits        14949    15056     +107     
+ Misses       2951     2948       -3
Impacted Files Coverage Δ
cylc/flow/job_pool.py 84.5% <100%> (+1.4%) ⬆️
cylc/flow/scheduler.py 82.09% <100%> (+0.04%) ⬆️
cylc/flow/config.py 84.58% <75%> (-0.13%) ⬇️
cylc/flow/ws_data_mgr.py 95.89% <99.04%> (+1.29%) ⬆️
cylc/flow/task_pool.py 89.02% <0%> (-0.51%) ⬇️
cylc/flow/task_job_mgr.py 84.88% <0%> (-0.42%) ⬇️
cylc/flow/cycling/iso8601.py 87.87% <0%> (+0.19%) ⬆️
cylc/flow/cycling/__init__.py 76.47% <0%> (+0.53%) ⬆️
cylc/flow/loggingutil.py 92.17% <0%> (+0.86%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a48169e...8235d2d. Read the comment docs.

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor comment on style.

In general, it would be nice to have a bit more comments on the complex calculations being introduced here.

I guess you'll also add some tests for the change as well?

@hjoliver
Copy link
Member

hjoliver commented Jul 9, 2019

In general, it would be nice to have a bit more comments on the complex calculations being introduced here.

I second that, a comment wherever the purpose of a code block is not obvious by inspection (to someone who did not write it!).

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach seems OK.

A few quick points on efficiency which is quite important in this code due to the load it will put onto the workflow, none of these comments are really blockers though.

@dwsutherland
Copy link
Member Author

dwsutherland commented Jul 10, 2019

@oliver-sanders, @matthewrmshin

Here's the missing piece of the puzzle:
https://developers.google.com/protocol-buffers/docs/reference/python-generated#repeated-message-fields

This is protobuf world, so the rules are a little different... (fortunately/unfortunately)
I'll explore a little more to see if there's more than in the docs we can do, but I'm not sure they are "normal" lists (...etc).

@dwsutherland
Copy link
Member Author

Another thing: I think I may need to change the approach somewhat... I'm a little worried about the edge case where suites leave the initial cycle point hanging around... So may need to gather all the cycles in the pool, and prune and/or generate edges and nodes based on that...

Thoughts?

@oliver-sanders
Copy link
Member

Thoughts?

Not sure, the current GUI will display the task in the initial cycle attaching it to the active tasks via a scissor node.

@matthewrmshin matthewrmshin added this to the cylc-8.0a1 milestone Jul 16, 2019
@dwsutherland dwsutherland mentioned this pull request Jul 17, 2019
6 tasks
@dwsutherland
Copy link
Member Author

dwsutherland commented Jul 19, 2019

Found a flaw in this approach...
Graph definitions often:

  • Do things at the initial cycle point (including future dependencies on this)
  • have offsets from the ICP [[[+P1D/PT6H]]]

So the graph edges can't be generated incrementally, the start point would have to be the ICP (although there may be some logic in there somewhere to help with this)... Would be nice to be able to continue the edge generation from where it stopped previously..

Will keep looking into this...

@dwsutherland
Copy link
Member Author

dwsutherland commented Jul 19, 2019

Looking at the old gui, it use to ask for the entire graph:

def update_gui(self):

It made a call to get_graph_raw, i.e. (7.8.x):

curl -v -s -u cylc:$(cat /niwa/test/ecox_test/cylc-run/nztide/.service/passphrase) --digest --cookie-jar cookietemp --anyauth --insecure 'https://ec-cylc01.kupe.niwa.co.nz:43089/get_graph_raw?start_point_string=20190718T00&stop_point_string=20190719T00'

Using the oldest cycle point from the get_latest_state, i.e. (7.8.x):

$ curl -v -s -u cylc:$(cat /niwa/test/ecox_test/cylc-run/nztide/.service/passphrase) --digest --cookie-jar cookietemp --anyauth --insecure 'https://ec-cylc01.kupe.niwa.co.nz:43089/get_latest_state'

Although in the case of ICP offset graphs without ICP dependencies, this could still be wrong.. Because your start point would act as the ICP...

So the way forward might be to:

  • generate edges from the oldest cycle in the pool (like the old gui)
  • Only add edges who's source or target is in the pool.

Then I'll test the ICP offset scenario..

@dwsutherland
Copy link
Member Author

dwsutherland commented Jul 19, 2019

Ok I think I have a solution; in the get_graph_edges method (config.py) I set it to always use the ICP as the reference point. This is used along with an offset to obtain the correct source cycle point, and the sequence point only needs to begin at the desired/given start point.

Quite a simple change, and appears to be quite efficient.

@dwsutherland dwsutherland force-pushed the increment-data-store branch 2 times, most recently from 31e7197 to a79aa6c Compare July 20, 2019 11:03
@cylc cylc deleted a comment Jul 20, 2019
@dwsutherland
Copy link
Member Author

dwsutherland commented Jul 20, 2019

OK significant improvements made:

  • Graph edge generation now uses ICP as the relative point with ICP offset (IMO broken in the old get_graph_raw)
  • Pruning by cycle point where source or target edge points are not in task pool point set.
  • Other changes; xtriggers now show in the edge source, data elements updates more robust, job element deletion in the data manger pruning (for availability with non-task-pool tasks/nodes) ...etc
  • More comments made throughout the code.

This change means the "edge" case of ICP dependencies is now handled correctly..

Overall I'm pretty happy with this update scheme, it will be way more efficient than before (even though a lot of logic has been added), because it only generates, updates, and/or prunes minimally when and what's needed..

Next: Add functional tests
(this change is quite urgent, but I'll try to get this done quickly)

There's a few other PR post this one:

  • Add a insert_absent option to trigger and reset state commands, because for ghosts or other nodes the user won't know they aren't in the pool (should nodes not in the pool revert to ghost state? i.e. for suicides)
  • Getting more information about the xtrigger into the data store/schema, in particular treating suite_state xtriggers and polling tasks as real edges (for multi-workflow graph view ...etc)

Anyway, input/responses welcome.

@cylc cylc deleted a comment Aug 15, 2019
@cylc cylc deleted a comment Aug 15, 2019
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in the best position to approve this never having worked with the graph generation logic, however, at a cursory glance all looks well.

I haven't profiled the generation methods to see the speedup as that's a fairly lengthy piece of work.

Screenshot from 2019-08-15 14-48-33

@cylc cylc deleted a comment Aug 16, 2019
Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to quickly see if I can debug it and see the pool being re-populated instead of re-created. But finished reviewing the code.

Great work adding the exclusions for coverage/analysis of the protobuf generated files. And great work on the tests too! 🎉

Left some comments for typos and a question, but none of them are blockers for merging it, and can be fixed afterwards.

Tested with the sibling PR of cylc-uiserver, and at least for Cylc UI it's transparent, and everything works the same as before (but must be better/faster now).

Thanks!

image

@kinow
Copy link
Member

kinow commented Aug 16, 2019

Oh, a silly question, but should we be logging the traffic between client & server via zmq? The messages are hard to read and seem to pollute the logs more than help, e.g.

Screenshot_2019-08-16_15-21-02

Maybe if the message was nicely formatted, it could be helpful? I'm running the suite with cylc run --no-detach --verbose --debug five as I normally do.

@dwsutherland
Copy link
Member Author

Oh, a silly question, but should we be logging the traffic between client & server via zmq? The messages are hard to read and seem to pollute the logs more than help, e.g.

I wouldn't do it.. it's in protobuf serialised format... that's a dump of the data into the log... If anything we could print some metadata about the request/response, but definitely not the data.

@kinow
Copy link
Member

kinow commented Aug 16, 2019

Thanks @dwsutherland . I'm not sure where the logging is being done, but we can look into that later (just wanted to check with others here as I would otherwise definitely forget it).

@dwsutherland dwsutherland requested a review from kinow August 16, 2019 03:30
@cylc cylc deleted a comment Aug 16, 2019
Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwsutherland great work!

I set a breakpoint in scheduler.update_data_structure, and later was taken to the WsDataMgr.increment_graph_elements.

From what I could understand, you are basically calculating the differences twice. First the difference of what was added to the Scheduler.pool and now needs to be added to the pool used for workflows. And then finally you check the difference for the set of tasks pruned, and remove them from the workflow pool too.

And that sets the has_updated to True, and when that happens, the WsDataMgr.update_workflow is called, where it copies the current Workflow (used by resolvers to send the GraphQL query response). I didn't understand why you had to copy the Workflow, and not simply update its fields, but there must be a reason, and it seemed to work fine, so approved! 🎉

And photo of the debugger in action. Here is the second time scheduler.run was executed, after a runahead task was added to the Scheduler pool, and now needs to be incrementally added to the workflow.

image

Unit tests added
Sorting query arg
ID delimiter var
Functional tests added
checksum to stamp
re-intialise data-store on reload
Use external id with store keys
graph sort bug fix
@oliver-sanders
Copy link
Member

Oh, a silly question, but should we be logging the traffic between client & server via zmq? The messages are hard to read and seem to pollute the logs more than help,

This is only done in --debug mode.

Perhaps the time has come to turn it off now the comms layer is tried and tested.

@cylc cylc deleted a comment Aug 19, 2019
@dwsutherland
Copy link
Member Author

dwsutherland commented Aug 19, 2019

Some profile results showing the data-store generation being faster than state-summary for one cycle of a cut down version of the complex suite:
Cylc-7: state-summary
Showing the relevant parts:

(cylc7) sutherlander@cortex-vbox:~$ /usr/bin/time -v cylc7 run --no-detach --profile --mode=simulation complex2
            ._.                                                       
            | |            The Cylc Suite Engine [7.8.3-46-gd66ae]    
._____._. ._| |_____.           Copyright (C) 2008-2019 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY;
      .___! |          see `cylc warranty`.  It is free software, you 
      !_____!           are welcome to redistribute it under certain  
PROFILE: Memory: 39724 KiB: scheduler.py: start configure
PROFILE: Memory: 39728 KiB: scheduler.py: before load_suiterc
PROFILE: Memory: 39728 KiB: config.py:config.py: start init config
PROFILE: Memory: 39728 KiB: config.py: before RawSuiteConfig init
PROFILE: Memory: 43412 KiB: config.py: after RawSuiteConfig init
PROFILE: Memory: 43412 KiB: config.py: before get(sparse=True
PROFILE: Memory: 43412 KiB: config.py: after get(sparse=True)
PROFILE: Memory: 43412 KiB: config.py: before _expand_runtime
PROFILE: Memory: 44168 KiB: config.py: after _expand_runtime
PROFILE: Memory: 44168 KiB: config.py: before compute_family_tree
PROFILE: Memory: 44180 KiB: config.py: after compute_family_tree
PROFILE: Memory: 44180 KiB: config.py: before inheritance
PROFILE: Memory: 46048 KiB: config.py: after inheritance
PROFILE: Memory: 46048 KiB: config.py: before get(sparse=False)
PROFILE: Memory: 47848 KiB: config.py: after get(sparse=False)
PROFILE: Memory: 47992 KiB: config.py: before load_graph()
PROFILE: Memory: 50104 KiB: config.py: after load_graph()
PROFILE: Memory: 50668 KiB: config.py: end init config
PROFILE: Memory: 50824 KiB: scheduler.py: after load_suiterc
2019-08-19T11:10:14Z INFO - Suite server: url=https://cortex-vbox:43064/ pid=18497
2019-08-19T11:10:14Z INFO - Run: (re)start=0 log=1
2019-08-19T11:10:14Z INFO - Cylc version: 7.8.3-46-gd66ae
2019-08-19T11:10:14Z INFO - Run mode: simulation
2019-08-19T11:10:14Z INFO - Initial point: 20150219T0000Z
2019-08-19T11:10:14Z INFO - Final point: 20150219T0000Z
PROFILE: Memory: 51892 KiB: scheduler.py: before load_tasks
2019-08-19T11:10:14Z INFO - Cold Start 20150219T0000Z
PROFILE: Memory: 54516 KiB: scheduler.py: after load_tasks
PROFILE: Memory: 55024 KiB: scheduler.py: end configure
PROFILE: Memory: 55028 KiB: scheduler.py: begin run while loop
.
.
.
2019-08-19T11:10:15Z INFO - PROFILE: scheduler loop dt (s): 1: 0.193 5: 0.193 15: 0.193
2019-08-19T11:10:15Z INFO - PROFILE: CPU %: 1: 81.5 5: 81.5 15: 81.5
PROFILE: Memory: 56440 KiB: scheduler.py: loop #0: 2019-08-19T11:10:15Z
.
.
.
2019-08-19T11:11:15Z INFO - PROFILE: scheduler loop dt (s): 1: 0.036 5: 0.039 15: 0.039
2019-08-19T11:11:15Z INFO - PROFILE: CPU %: 1: 16.4 5: 17.5 15: 17.5
PROFILE: Memory: 56968 KiB: scheduler.py: loop #60: 2019-08-19T11:11:15Z
.
.
.
2019-08-19T11:12:15Z INFO - PROFILE: scheduler loop dt (s): 1: 0.046 5: 0.043 15: 0.043
2019-08-19T11:12:15Z INFO - PROFILE: CPU %: 1: 6.8 5: 12.2 15: 12.2
PROFILE: Memory: 58284 KiB: scheduler.py: loop #120: 2019-08-19T11:12:15Z
.
.
.
2019-08-19T11:13:15Z INFO - PROFILE: scheduler loop dt (s): 1: 0.037 5: 0.041 15: 0.041
2019-08-19T11:13:15Z INFO - PROFILE: CPU %: 1: 5.7 5: 10.0 15: 10.0
PROFILE: Memory: 58552 KiB: scheduler.py: loop #180: 2019-08-19T11:13:15Z
2019-08-19T11:13:20Z INFO - [archive_logs.20150219T0000Z] status=running: (received)succeeded at 2019-08-19T11:13:20Z for job(00)
PROFILE: Memory: 58552 KiB: scheduler.py: end main loop (total loops 185): 2019-08-19T11:13:20Z
2019-08-19T11:13:20Z INFO - Suite shutting down - AUTOMATIC
2019-08-19T11:13:21Z INFO - DONE
         3482314 function calls (3396961 primitive calls) in 186.486 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.045    0.045  185.679  185.679 /home/sutherlander/repos/hilary/cylc-flow/lib/cylc/scheduler.py:1576(run)
      187  175.382    0.938  175.382    0.938 {time.sleep}

      187    0.027    0.000    1.498    0.008 /home/sutherlander/repos/hilary/cylc-flow/lib/cylc/scheduler.py:1642(update_state_summary)

       41    0.169    0.004    1.018    0.025 /home/sutherlander/repos/hilary/cylc-flow/lib/cylc/state_summary_mgr.py:50(update)

       41    0.047    0.001    0.406    0.010 /home/sutherlander/repos/hilary/cylc-flow/lib/cylc/state_summary_mgr.py:172(_get_tasks_info)
    16585    0.147    0.000    0.339    0.000 /home/sutherlander/repos/hilary/cylc-flow/lib/cylc/task_proxy.py:314(get_state_summary)

    45264    0.081    0.000    0.276    0.000 /home/sutherlander/repos/hilary/cylc-flow/lib/cylc/state_summary_mgr.py:138(<genexpr>)
	Command being timed: "cylc7 run --no-detach --profile --mode=simulation complex2"
	User time (seconds): 9.99
	System time (seconds): 2.70
	Percent of CPU this job got: 6%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:10.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 58684
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 455107
	Voluntary context switches: 2147
	Involuntary context switches: 2273
	Swaps: 0
	File system inputs: 0
	File system outputs: 68984
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

So we can see the state_summary update method running for 1.018 sec in total, with a significant amount of the time due to looping over all the pool tasks calling get_state_summary (as expected)..

Cylc-8: data-store
Showing the relevant parts:

(cylc-flow) sutherlander@cortex-vbox:~$ /usr/bin/time -v cylc run --no-detach --profile --mode=simulation complex2
            ._.                                                       
            | |                 The Cylc Suite Engine [8.0a0]         
._____._. ._| |_____.           Copyright (C) 2008-2019 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY.
      .___! |              It is free software, you are welcome to    
      !_____!             redistribute it under certain conditions;   
                        see `COPYING' in the Cylc source distribution. 
                                                                       
PROFILE: Memory: 43000 KiB: scheduler.py: start configure
PROFILE: Memory: 43028 KiB: scheduler.py: before load_suiterc
PROFILE: Memory: 43028 KiB: config.py:config.py: start init config
PROFILE: Memory: 43028 KiB: config.py: before RawSuiteConfig init
2019-08-19T22:47:06+12:00 WARNING - deprecated graph items were automatically upgraded in 'suite definition':
2019-08-19T22:47:06+12:00 WARNING -  * (8.0.0) [scheduling][dependencies][X][graph] -> [scheduling][graph][X] - for X in:
	R1
	T00
	T00, T06, T12, T18
	T00, T12
	T06
	T06, T18
	T12
	T18
PROFILE: Memory: 46836 KiB: config.py: after RawSuiteConfig init
PROFILE: Memory: 46836 KiB: config.py: before get(sparse=True
PROFILE: Memory: 46836 KiB: config.py: after get(sparse=True)
PROFILE: Memory: 46836 KiB: config.py: before _expand_runtime
PROFILE: Memory: 47152 KiB: config.py: after _expand_runtime
PROFILE: Memory: 47152 KiB: config.py: before compute_family_tree
PROFILE: Memory: 47152 KiB: config.py: after compute_family_tree
PROFILE: Memory: 47152 KiB: config.py: before inheritance
PROFILE: Memory: 48076 KiB: config.py: after inheritance
PROFILE: Memory: 48076 KiB: config.py: before get(sparse=False)
PROFILE: Memory: 48996 KiB: config.py: after get(sparse=False)
PROFILE: Memory: 49040 KiB: config.py: before load_graph()
PROFILE: Memory: 50968 KiB: config.py: after load_graph()
PROFILE: Memory: 51172 KiB: config.py: end init config
PROFILE: Memory: 51348 KiB: scheduler.py: after load_suiterc
2019-08-19T10:47:08Z INFO - Suite server: url=tcp://cortex-vbox:43061/ pid=18133
2019-08-19T10:47:08Z INFO - Run: (re)start=0 log=1
2019-08-19T10:47:08Z INFO - Cylc version: 8.0a0
2019-08-19T10:47:08Z INFO - Run mode: simulation
2019-08-19T10:47:08Z INFO - Initial point: 20150219T0000Z
2019-08-19T10:47:08Z INFO - Final point: 20150219T0000Z
PROFILE: Memory: 52240 KiB: scheduler.py: before load_tasks
2019-08-19T10:47:08Z INFO - Cold Start 20150219T0000Z
PROFILE: Memory: 53760 KiB: scheduler.py: after load_tasks
PROFILE: Memory: 54220 KiB: scheduler.py: end configure
PROFILE: Memory: 54232 KiB: scheduler.py: begin run while loop
.
.
.
2019-08-19T10:47:08Z INFO - PROFILE: scheduler loop dt (s): 1: 0.231 5: 0.231 15: 0.231
2019-08-19T10:47:08Z INFO - PROFILE: CPU %: 1: 86.0 5: 86.0 15: 86.0
PROFILE: Memory: 59644 KiB: scheduler.py: loop #0: 2019-08-19T10:47:08Z
.
.
.
2019-08-19T10:48:09Z INFO - PROFILE: scheduler loop dt (s): 1: 0.030 5: 0.033 15: 0.033
2019-08-19T10:48:09Z INFO - PROFILE: CPU %: 1: 14.8 5: 17.1 15: 17.1
PROFILE: Memory: 59680 KiB: scheduler.py: loop #61: 2019-08-19T10:48:09Z
.
.
.
2019-08-19T10:49:10Z INFO - PROFILE: scheduler loop dt (s): 1: 0.050 5: 0.042 15: 0.042
2019-08-19T10:49:10Z INFO - PROFILE: CPU %: 1: 6.0 5: 11.7 15: 11.7
PROFILE: Memory: 60240 KiB: scheduler.py: loop #121: 2019-08-19T10:49:10Z
.
.
.
2019-08-19T10:50:10Z INFO - PROFILE: scheduler loop dt (s): 1: 0.035 5: 0.039 15: 0.039
2019-08-19T10:50:10Z INFO - PROFILE: CPU %: 1: 5.3 5: 9.6 15: 9.6
PROFILE: Memory: 60384 KiB: scheduler.py: loop #181: 2019-08-19T10:50:10Z
2019-08-19T10:50:13Z INFO - [archive_logs.20150219T0000Z] status=running: (received)succeeded at 2019-08-19T10:50:13Z for job(00)
PROFILE: Memory: 60384 KiB: scheduler.py: end main loop (total loops 184): 2019-08-19T10:50:13Z
2019-08-19T10:50:13Z INFO - Suite shutting down - AUTOMATIC
2019-08-19T10:50:14Z INFO - DONE
         3212688 function calls (3207722 primitive calls) in 185.622 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.045    0.045  184.684  184.684 /home/sutherlander/repos/cylc-flow/cylc/flow/scheduler.py:1514(run)
      184  174.876    0.950  174.876    0.950 {built-in method time.sleep}
      186    0.004    0.000    0.891    0.005 /home/sutherlander/repos/cylc-flow/cylc/flow/scheduler.py:1587(update_data_structure)

       46    0.012    0.000    0.316    0.007 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:821(update_dynamic_elements)
       46    0.038    0.001    0.145    0.003 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:642(update_task_proxies)
       46    0.068    0.001    0.092    0.002 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:708(update_family_proxies)
       47    0.001    0.000    0.082    0.002 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:588(increment_graph_elements)
        1    0.015    0.015    0.080    0.080 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:393(generate_graph_elements)
       47    0.003    0.000    0.067    0.001 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:770(update_workflow_statuses)
        1    0.000    0.000    0.045    0.045 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:573(initiate_data_model)
        1    0.035    0.035    0.045    0.045 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:168(generate_definition_elements)
      246    0.018    0.000    0.019    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:284(generate_ghost_task)
      994    0.002    0.000    0.004    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:67(task_mean_elapsed_time)
        1    0.003    0.003    0.004    0.004 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:322(generate_ghost_families)
      727    0.003    0.000    0.003    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:693(<listcomp>)
        1    0.000    0.000    0.001    0.001 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:799(update_workflow)
      836    0.001    0.000    0.001    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:228(<listcomp>)
      773    0.001    0.000    0.001    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:829(<genexpr>)
      246    0.000    0.000    0.000    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:315(<listcomp>)
        1    0.000    0.000    0.000    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:257(<listcomp>)
        1    0.000    0.000    0.000    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:494(<listcomp>)
      727    0.000    0.000    0.000    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:679(<listcomp>)
       52    0.000    0.000    0.000    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:380(<listcomp>)
	Command being timed: "cylc run --no-detach --profile --mode=simulation complex2"
	User time (seconds): 9.13
	System time (seconds): 2.37
	Percent of CPU this job got: 6%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:08.46
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 60736
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 351270
	Voluntary context switches: 1901
	Involuntary context switches: 2559
	Swaps: 0
	File system inputs: 0
	File system outputs: 75136
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

So the update_data_structure (update_state_summary equivalent), called by the scheduler and updates the data store using:

       46    0.012    0.000    0.316    0.007 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:821(update_dynamic_elements)
       47    0.001    0.000    0.082    0.002 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:588(increment_graph_elements)

0.4 seconds so far, but you can probably add in the initialisation calls:

        1    0.000    0.000    0.045    0.045 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:573(initiate_data_model)

Giving a grand total of 0.443 seconds.. The graph edge generation only happened once:

        1    0.015    0.015    0.080    0.080 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:393(generate_graph_elements)

(as expected, with there being only one cycle point, included in increment_graph_elements)... It will only run once for every new cycle point.

data store wins: 57% faster!
(0.443 vs 1.018 seconds)

Note: Simulation runs don't include the job pool... However, that would just be more memory and a list comprehension on updated tasks (so maybe 0.01 seconds more)..

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, all good now 👍

@hjoliver
Copy link
Member

That seems to be four approvals 🥇

@hjoliver hjoliver merged commit e439a1d into cylc:master Aug 20, 2019
@dwsutherland
Copy link
Member Author

For the record, master (without state-summary) was way slower before this merge:

(cylc-flow) sutherlander@cortex-vbox:~$ /usr/bin/time -v cylc run --no-detach --profile --mode=simulation complex2
            ._.                                                       
            | |                 The Cylc Suite Engine [8.0a0]         
._____._. ._| |_____.           Copyright (C) 2008-2019 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY.
      .___! |              It is free software, you are welcome to    
      !_____!             redistribute it under certain conditions;   
                        see `COPYING' in the Cylc source distribution. 
                                                                       
PROFILE: Memory: 43432 KiB: scheduler.py: start configure
PROFILE: Memory: 43436 KiB: scheduler.py: before load_suiterc
PROFILE: Memory: 43436 KiB: config.py:config.py: start init config
PROFILE: Memory: 43436 KiB: config.py: before RawSuiteConfig init
2019-08-20T18:48:07+12:00 WARNING - deprecated graph items were automatically upgraded in 'suite definition':
2019-08-20T18:48:07+12:00 WARNING -  * (8.0.0) [scheduling][dependencies][X][graph] -> [scheduling][graph][X] - for X in:
	R1
	T00
	T00, T06, T12, T18
	T00, T12
	T06
	T06, T18
	T12
	T18
PROFILE: Memory: 46872 KiB: config.py: after RawSuiteConfig init
PROFILE: Memory: 46872 KiB: config.py: before get(sparse=True
PROFILE: Memory: 46872 KiB: config.py: after get(sparse=True)
PROFILE: Memory: 46872 KiB: config.py: before _expand_runtime
PROFILE: Memory: 47172 KiB: config.py: after _expand_runtime
PROFILE: Memory: 47172 KiB: config.py: before compute_family_tree
PROFILE: Memory: 47172 KiB: config.py: after compute_family_tree
PROFILE: Memory: 47172 KiB: config.py: before inheritance
PROFILE: Memory: 48092 KiB: config.py: after inheritance
PROFILE: Memory: 48092 KiB: config.py: before get(sparse=False)
PROFILE: Memory: 49016 KiB: config.py: after get(sparse=False)
PROFILE: Memory: 49052 KiB: config.py: before load_graph()
PROFILE: Memory: 50980 KiB: config.py: after load_graph()
PROFILE: Memory: 51176 KiB: config.py: end init config
PROFILE: Memory: 51344 KiB: scheduler.py: after load_suiterc
2019-08-20T06:48:09Z INFO - Suite server: url=tcp://cortex-vbox:43072/ pid=3115
2019-08-20T06:48:09Z INFO - Run: (re)start=0 log=1
2019-08-20T06:48:09Z INFO - Cylc version: 8.0a0
2019-08-20T06:48:09Z INFO - Run mode: simulation
2019-08-20T06:48:09Z INFO - Initial point: 20150219T0000Z
2019-08-20T06:48:09Z INFO - Final point: 20150219T0000Z
PROFILE: Memory: 52236 KiB: scheduler.py: before load_tasks
2019-08-20T06:48:09Z INFO - Cold Start 20150219T0000Z
PROFILE: Memory: 53764 KiB: scheduler.py: after load_tasks
PROFILE: Memory: 54228 KiB: scheduler.py: end configure
PROFILE: Memory: 54244 KiB: scheduler.py: begin run while loop
.
.
.
2019-08-20T06:48:10Z INFO - PROFILE: scheduler loop dt (s): 1: 0.267 5: 0.267 15: 0.267
2019-08-20T06:48:10Z INFO - PROFILE: CPU %: 1: 89.3 5: 89.3 15: 89.3
PROFILE: Memory: 59576 KiB: scheduler.py: loop #0: 2019-08-20T06:48:10Z
.
.
.
2019-08-20T06:49:10Z INFO - PROFILE: scheduler loop dt (s): 1: 0.061 5: 0.064 15: 0.064
2019-08-20T06:49:10Z INFO - PROFILE: CPU %: 1: 17.2 5: 18.3 15: 18.3
PROFILE: Memory: 60904 KiB: scheduler.py: loop #60: 2019-08-20T06:49:10Z
.
.
.
2019-08-20T06:50:10Z INFO - PROFILE: scheduler loop dt (s): 1: 0.079 5: 0.071 15: 0.071
2019-08-20T06:50:10Z INFO - PROFILE: CPU %: 1: 9.0 5: 13.7 15: 13.7
PROFILE: Memory: 61780 KiB: scheduler.py: loop #120: 2019-08-20T06:50:10Z
.
.
.
PROFILE: Memory: 61952 KiB: scheduler.py: loop #181: 2019-08-20T06:51:11Z
2019-08-20T06:51:14Z INFO - [archive_logs.20150219T0000Z] status=running: (received)succeeded at 2019-08-20T06:51:14Z for job(00)
PROFILE: Memory: 61952 KiB: scheduler.py: end main loop (total loops 184): 2019-08-20T06:51:15Z
2019-08-20T06:51:15Z INFO - Suite shutting down - AUTOMATIC
2019-08-20T06:51:16Z INFO - DONE
         5371608 function calls (5336357 primitive calls) in 186.351 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.049    0.049  185.368  185.368 /home/sutherlander/repos/cylc-flow/cylc/flow/scheduler.py:1516(run)
      184  170.483    0.927  170.483    0.927 {built-in method time.sleep}

      186    0.004    0.000    5.512    0.030 /home/sutherlander/repos/cylc-flow/cylc/flow/scheduler.py:1583(update_data_structure)
       41    0.005    0.000    5.134    0.125 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:63(initiate_data_model)

       41    0.469    0.011    1.906    0.046 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:257(generate_graph_elements)

       41    0.357    0.009    1.552    0.038 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:334(update_task_proxies)

       41    1.032    0.025    1.533    0.037 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:71(generate_definition_elements)

    10086    0.337    0.000    0.367    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:180(_generate_ghost_task)

       41    0.159    0.004    0.185    0.005 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:204(_generate_ghost_families)

       41    0.056    0.001    0.075    0.002 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:368(update_family_proxies)

       41    0.015    0.000    0.063    0.002 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:414(update_workflow)

    34276    0.034    0.000    0.034    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:129(<listcomp>)
       41    0.022    0.001    0.031    0.001 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:153(<listcomp>)
       41    0.020    0.000    0.020    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:170(<listcomp>)
       41    0.018    0.000    0.018    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:315(<listcomp>)
    10086    0.015    0.000    0.015    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:198(<listcomp>)
       41    0.011    0.000    0.011    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:169(<listcomp>)
       41    0.008    0.000    0.008    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:447(<listcomp>)
     9608    0.002    0.000    0.002    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:353(<listcomp>)
       41    0.001    0.000    0.001    0.000 /home/sutherlander/repos/cylc-flow/cylc/flow/ws_data_mgr.py:449(<listcomp>)
	Command being timed: "cylc run --no-detach --profile --mode=simulation complex2"
	User time (seconds): 13.85
	System time (seconds): 2.99
	Percent of CPU this job got: 8%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:09.20
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 62276
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 409991
	Voluntary context switches: 2059
	Involuntary context switches: 3669
	Swaps: 0
	File system inputs: 8
	File system outputs: 66936
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Incremental data-store generation wins:
5.134 vs 0.443 seconds

(as expected)

@oliver-sanders oliver-sanders added the efficiency For notable efficiency improvements label Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extra self parameter in mutations Bug in mutation if there is no owner
5 participants