-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use relative paths and mirror server file structure locally #340
Conversation
"""Constructor""" | ||
self._ssh = None | ||
self._retry_after = 5 | ||
self.root = server_setup.DATA_ROOT_DIR if root is None else root | ||
self.dest_root = server_setup.LOCAL_DIR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an attribute at the DataAccess
level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing it will, just waiting to generalize until the local implementation starts happening, at which point the duplicated code will be obvious and can be moved to the base class.
Do you plan to mirror the server structure in |
All the usages I see will delete the local file afterward, so was planning to leave that as is. However it does need to be updated to work with relative paths. |
if not os.path.exists(server_setup.LOCAL_DIR): | ||
os.makedirs(server_setup.LOCAL_DIR) | ||
os.makedirs(server_setup.LOCAL_DIR, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of this call signature, it's very nice!
def execute_dir(self): | ||
return self._join(EXECUTE_DIR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of these being functions instead of attributes is that self.root
can be changed and that change will propagate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but to be honest this was mostly an ad hoc way to simplify the code outside SSHDataAccess
that was using the root dir for path joins. I feel like overall things could be further simplified - so this may be temporary.
beeb42f
to
e3d20c9
Compare
@@ -0,0 +1,58 @@ | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is just moving tests around
@@ -0,0 +1,41 @@ | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is just moving tests around
@@ -3,3 +3,4 @@ markers = | |||
integration: marks tests that require external dependencies (deselect with '-m "not integration"') | |||
db: marks tests that connect to a local database | |||
ssh: marks tests that connect to the server over ssh | |||
wip: marks tests undergoing development |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this useful while writing new tests so figured I'd leave it for future use?
powersimdata/input/input_data.py
Outdated
if self.data_loc == "disk": | ||
self.data_access = SSHDataAccess(server_setup.BACKUP_DATA_ROOT_DIR) | ||
else: | ||
self.data_access = SSHDataAccess(server_setup.DATA_ROOT_DIR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we were passing any DataAccess object to the init, now we're hardcoding that we will create a new SSHDataAccess object. Maybe I'm missing something, but this seems like a regression in our abstraction process that will need to get undone as soon as we implement the local access. Could we accomplish the same functionality and keep the abstraction by still accepting a DataAccess object in the init and modifying self.data_access.root
based on data_loc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is technically a regression, though I was thinking the fix would be using a factory instead of having the parameter - for example
class Context:
@staticmethod
def get_data_access(root):
if check_environment_variable():
return LocalDataAccess(root)
return SSHDataAccess(root)
and in input_data.py:
self.data_access = Context.get_data_access(root)
Open to ideas here.. one might be passing the context as a parameter to allow easy substitution of test implementations. Your suggestion also makes sense.. lemme think about that a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a factory work, with the potential downside that we're creating a new SSH connection for every time we want to access the server, rather than re-using the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential issue. I was opening new SSH connection all over the place in the first version of the scenario framework and the server was not happy. That said, I probably did a bad job at closing the connection everytime a new one was created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is a good point.. if we share the same DataAccess instance in multiple places, we should definitely not modify the root attribute. We could still achieve connection reuse though - either cache equivalent DataAccess instances within the factory, or use a static connection (self._ssh
property) instead of one per instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: we've historically had one active SSH connection per Scenario, shared among all the Scenario methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense (and I agree it's slightly preferable to recreating connections). Was just realizing that with these changes, not all SSHDataAccess
instances are interchangeable anymore, so sharing those won't make sense (but sharing the connection still makes sense, just need to write the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are we on this issue? This seems like a fairly fundamental design point for this PR. Would it be easier (possible?) to separate the path/folder part of this PR from the refactor of how we pass around DataAccess objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the connection reuse? I was thinking if we close connections correctly (seems like we do this), it won't have much impact, and I could handle that in a follow up PR if needed. For the part about passing in the instance, I also plan on implementing the factory in the next PR when I add the local implementation (that's really the only way I see to do it, since we can't reuse the instance anymore). Let me know if this sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a conversation with Daniel I've added the initial factory layout, hopefully that gives an idea of how it will progress.
I deleted my ScenarioData folder and created a new Scenario, prepared it, ran it, and queried the results, and everything looks according to plan. The local files exactly mirror the server structure, and once raw profiles are cached locally they are not re-downloaded. >>> from powersimdata.scenario.scenario import Scenario
>>> scenario = Scenario('')
>>> scenario.state.set_builder(["Texas"])
Reading bus.csv
Reading plant.csv
Reading gencost.csv
Reading branch.csv
Reading dcline.csv
Reading sub.csv
Reading bus2sub.csv
Reading zone.csv
Transferring ScenarioList.csv from server
100%|########################################| 235k/235k [00:00<00:00, 668kb/s]
--> Summary
# Existing study
base | test | Anchor | Julia
# Available profiles
demand: ercot
hydro: v1 | v2
solar: v2 | v4.1
wind: v1 | v2 | v5.1 | v5
>>> scenario.state.builder.set_base_profile("demand", "ercot")
>>> scenario.state.builder.set_base_profile("hydro", "v2")
>>> scenario.state.builder.set_base_profile("solar", "v4.1")
>>> scenario.state.builder.set_base_profile("wind", "v5.1")
>>> scenario.state.builder.set_name("test", "daniel_tests_jons_tree")
>>> scenario.state.builder.set_time("2016-01-01 00:00:00", "2016-01-03 23:00:00", "24H")
>>> scenario.state.builder.change_table.scale_plant_capacity("solar", zone_name={"West": 2})
>>> scenario.state.builder.change_table.scale_plant_capacity("wind", zone_name={"West": 2})
>>> scenario.state.builder.change_table.scale_plant_capacity("hydro", zone_name={"West": 2})
>>> scenario.state.builder.change_table.scale_demand(zone_name={"West": 2})
>>> scenario.state.create_scenario()
CREATING SCENARIO: test | daniel_tests_jons_tree
--> Generating scenario id
--> Adding entry in ScenarioList.csv on server
--> Writing change table on local machine
Writing C:\Users\DanielOlsen\ScenarioData\1722_ct.pkl
Transferring C:\Users\DanielOlsen\ScenarioData\1722_ct.pkl to server
--> Deleting C:\Users\DanielOlsen\ScenarioData\1722_ct.pkl on local machine
--> Adding entry in execute table on server
SCENARIO SUCCESSFULLY CREATED WITH ID #1722
State switching: create --> execute
>>> scenario.state.prepare_simulation_input()
Transferring ExecuteList.csv from server
100%|######################################| 20.8k/20.8k [00:00<00:00, 128kb/s]
---------------------------
PREPARING SIMULATION INPUTS
---------------------------
--> Creating temporary folder on server for simulation inputs
machine
Transferring texas_demand_ercot.csv from server
100%|#######################################| 959k/959k [00:00<00:00, 1.50Mb/s]
Multiply demand in West (#303) by 2.00
Writing scaled demand profile in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring C:\Users\DanielOlsen\ScenarioData\1722_demand.csv to server
--> Deleting C:\Users\DanielOlsen\ScenarioData\1722_demand.csv on local machine
--> Loading hydro
texas_hydro_v2.csv not found in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring texas_hydro_v2.csv from server
100%|#####################################| 2.62M/2.62M [00:01<00:00, 2.22Mb/s]
Writing scaled hydro profile in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring C:\Users\DanielOlsen\ScenarioData\1722_hydro.csv to server
--> Deleting C:\Users\DanielOlsen\ScenarioData\1722_hydro.csv on local machine
--> Loading solar
texas_solar_v4.1.csv not found in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring texas_solar_v4.1.csv from server
100%|#####################################| 3.61M/3.61M [00:01<00:00, 2.55Mb/s]
Writing scaled solar profile in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring C:\Users\DanielOlsen\ScenarioData\1722_solar.csv to server
--> Deleting C:\Users\DanielOlsen\ScenarioData\1722_solar.csv on local machine
--> Loading wind
texas_wind_v5.1.csv not found in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring texas_wind_v5.1.csv from server
100%|#####################################| 19.0M/19.0M [00:07<00:00, 2.68Mb/s]
Writing scaled wind profile in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring C:\Users\DanielOlsen\ScenarioData\1722_wind.csv to server
--> Deleting C:\Users\DanielOlsen\ScenarioData\1722_wind.csv on local machine
--> Preparing MPC file
Scaling grid
Building MPC file
Transferring C:\Users\DanielOlsen\ScenarioData\1722_case.mat to server
--> Deleting C:\Users\DanielOlsen\ScenarioData\1722_case.mat on local machine
--> Updating status in execute table on server
>>> scenario.state.launch_simulation(threads=8)
--> Launching simulation on server
PID: 25388
<subprocess.Popen object at 0x00000147D59C4E50>
>>> quit()
PS C:\Users\DanielOlsen\repos\bes\PowerSimData>
PS C:\Users\DanielOlsen\repos\bes\PowerSimData>
PS C:\Users\DanielOlsen\repos\bes\PowerSimData> python
Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from powersimdata.scenario.scenario import Scenario
>>> scenario = Scenario(1722)
Transferring ScenarioList.csv from server
100%|########################################| 235k/235k [00:00<00:00, 701kb/s]
Transferring ExecuteList.csv from server
100%|######################################| 20.8k/20.8k [00:00<00:00, 169kb/s]
SCENARIO: test | daniel_tests_jons_tree
--> State
analyze
--> Loading grid
1722_grid.mat not found in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring 1722_grid.mat from server
100%|########################################| 191k/191k [00:00<00:00, 894kb/s]
Loading bus
Loading plant
Loading heat_rate_curve
Loading gencost_before
Loading gencost_after
Loading branch
Loading sub
Loading bus2sub
--> Loading ct
1722_ct.pkl not found in C:\Users\DanielOlsen\ScenarioData\ on local machine
Transferring 1722_ct.pkl from server
100%|#########################################| 206/206 [00:00<00:00, 1.45kb/s]
>>> scenario.state.get_lmp()
--> Loading LMP
C:\Users\DanielOlsen\ScenarioData\data/output\1722_LMP.pkl not found on local machine
Transferring 1722_LMP.pkl from server
100%|#######################################| 610k/610k [00:00<00:00, 2.20Mb/s]
3001001 3001002 ... 3008159 3008160
UTC ...
2016-01-01 00:00:00 24.806667 24.629782 ... 22.240349 22.243629
2016-01-01 01:00:00 24.739674 24.497225 ... 22.276453 22.279398
2016-01-01 02:00:00 24.489302 24.256945 ... 22.253412 22.256344
2016-01-01 03:00:00 24.783640 24.563858 ... 22.210369 22.213171
2016-01-01 04:00:00 25.037502 24.828499 ... 22.126270 22.128239
... ... ... ... ... ...
2016-01-03 19:00:00 23.253880 9.772205 ... 22.168762 22.172525
2016-01-03 20:00:00 23.256464 9.773940 ... 22.156517 22.160339
2016-01-03 21:00:00 23.889296 11.042955 ... 22.136557 22.140299
2016-01-03 22:00:00 22.933260 12.123874 ... 22.121342 22.124964
2016-01-03 23:00:00 22.650866 11.272489 ... 21.966211 21.968102
[72 rows x 2000 columns]
>>> scenario.state.get_pg()
--> Loading PG
C:\Users\DanielOlsen\ScenarioData\data/output\1722_PG.pkl not found on local machine
Transferring 1722_PG.pkl from server
100%|########################################| 183k/183k [00:00<00:00, 774kb/s]
12875 12876 ... 14048 14049
UTC ...
2016-01-01 00:00:00 93.895042 17.221796 ... 1.084221 40.035355
2016-01-01 01:00:00 113.781754 18.496275 ... 2.580470 68.812187
2016-01-01 02:00:00 117.523766 19.679838 ... 5.696044 72.946083
2016-01-01 03:00:00 117.236267 16.149055 ... 14.590879 75.538712
2016-01-01 04:00:00 117.423836 14.734890 ... 73.820221 66.276161
... ... ... ... ... ...
2016-01-03 19:00:00 5.199273 0.194389 ... 23.013470 18.864305
2016-01-03 20:00:00 3.665761 0.460093 ... 19.755426 13.048021
2016-01-03 21:00:00 17.572971 1.049806 ... 16.927418 14.015634
2016-01-03 22:00:00 16.981102 2.830077 ... 14.350667 6.151488
2016-01-03 23:00:00 0.000000 4.079185 ... 39.642094 117.152267
[72 rows x 598 columns]
>>> scenario.state.get_pf()
--> Loading PF
C:\Users\DanielOlsen\ScenarioData\data/output\1722_PF.pkl not found on local machine
Transferring 1722_PF.pkl from server
100%|#######################################| 976k/976k [00:00<00:00, 1.39Mb/s]
100915 100916 ... 104119 104120
UTC ...
2016-01-01 00:00:00 103.781952 103.781952 ... -49.854210 -49.854210
2016-01-01 01:00:00 99.933151 99.933151 ... -54.308800 -54.308800
2016-01-01 02:00:00 93.480057 93.480057 ... -50.943272 -50.943272
2016-01-01 03:00:00 105.065758 105.065758 ... -48.895115 -48.895115
2016-01-01 04:00:00 107.981354 107.981354 ... -49.638092 -49.638092
... ... ... ... ... ...
2016-01-03 19:00:00 109.480530 109.480530 ... -40.344730 -40.344730
2016-01-03 20:00:00 106.598625 106.598625 ... -39.177395 -39.177395
2016-01-03 21:00:00 104.269455 104.269455 ... -38.680443 -38.680443
2016-01-03 22:00:00 99.945633 99.945633 ... -37.980453 -37.980453
2016-01-03 23:00:00 81.679192 81.679192 ... -38.790810 -38.790810
[72 rows x 3206 columns]
>>> scenario.state.get_demand()
Reading bus.csv
Reading plant.csv
Reading gencost.csv
Reading branch.csv
Reading dcline.csv
Reading sub.csv
Reading bus2sub.csv
Reading zone.csv
--> Loading demand
Multiply demand in West (#303) by 2.00
301 302 ... 307 308
UTC ...
2016-01-01 00:00:00 2173.808470 868.636393 ... 10703.575700 1341.750663
2016-01-01 01:00:00 2206.959175 890.950782 ... 10743.720990 1449.531975
2016-01-01 02:00:00 2200.797530 873.920708 ... 10436.555930 1355.505824
2016-01-01 03:00:00 2179.300089 860.342391 ... 10117.139850 1314.445489
2016-01-01 04:00:00 2154.405686 840.608568 ... 9836.114934 1330.866082
... ... ... ... ... ...
2016-12-31 19:00:00 2047.459060 690.662724 ... 9651.340702 1142.804232
2016-12-31 20:00:00 2035.036805 675.302580 ... 9639.117255 1125.266863
2016-12-31 21:00:00 2026.887979 668.133067 ... 9637.841629 1124.054410
2016-12-31 22:00:00 2023.906935 674.449005 ... 9683.466597 1118.782875
2016-12-31 23:00:00 2032.573815 705.708943 ... 10066.856620 1176.701108
[8784 rows x 8 columns]
>>> scenario.state.get_hydro()
--> Loading hydro
12977 12978 ... 13463 13464
UTC ...
2016-01-01 00:00:00 14.526031 6.038912 ... 3.607026 7.834264
2016-01-01 01:00:00 12.650153 5.259052 ... 3.141218 6.822554
2016-01-01 02:00:00 6.301027 2.619528 ... 1.564637 3.398307
2016-01-01 03:00:00 2.821834 1.173122 ... 0.700703 1.521888
2016-01-01 04:00:00 2.292740 0.953162 ... 0.569321 1.236534
... ... ... ... ... ...
2016-12-31 19:00:00 10.123868 4.208799 ... 2.513904 5.460064
2016-12-31 20:00:00 10.159403 4.223572 ... 2.522728 5.479228
2016-12-31 21:00:00 10.290649 4.278135 ... 2.555319 5.550013
2016-12-31 22:00:00 15.623852 6.495309 ... 3.879631 8.426347
2016-12-31 23:00:00 24.787511 10.304920 ... 6.155101 13.368545
[8784 rows x 22 columns]
>>> scenario.state.get_solar()
--> Loading solar
12878 12896 ... 14005 14006
UTC ...
2016-01-01 00:00:00 0.000000 0.000000 ... 0.000000 0.000000
2016-01-01 01:00:00 0.000000 0.000000 ... 0.000000 0.000000
2016-01-01 02:00:00 0.000000 0.000000 ... 0.000000 0.000000
2016-01-01 03:00:00 0.000000 0.000000 ... 0.000000 0.000000
2016-01-01 04:00:00 0.000000 0.000000 ... 0.000000 0.000000
... ... ... ... ... ...
2016-12-31 19:00:00 4.038380 1.319879 ... 42.380628 15.867405
2016-12-31 20:00:00 3.682028 1.272921 ... 40.524097 13.642190
2016-12-31 21:00:00 3.247971 1.177243 ... 37.351799 10.046580
2016-12-31 22:00:00 2.301117 0.985453 ... 31.177795 4.248332
2016-12-31 23:00:00 0.932313 0.466155 ... 15.401827 0.205514
[8784 rows x 36 columns] |
Should we rename the BACKUP_DATA_ROOT_DIR, DATA_ROOT_DIR and LOCAL_DIR variables in |
Tests do not pass on Windows:
|
I don't feel a strong need to do so currently, but not opposed to it. |
Thanks, Windows. Looks like it's due to calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main code looks good and works to create, prepare, and run a scenario, and to load its inputs & outputs from within the scenario framework. Once the testing on windows is worked out I think it is ready to merge.
Purpose
Remove (most) dependence on server root path by using relative paths to download/upload files, and passing in the source and destination root at instantiation time, so it's reused throughout the object's lifespan. This moves the responsibility of parsing paths, creating folders, etc away from calling code, and the assumption of having two fixed "root" directories simplifies some logic.
What the code is doing
Chronologically, this involved first changing the semantics of
copy_from
to implement copying a node from a remote tree to a local tree while preserving the structure. This required a lot of path joins and such to be updated. Along the way I removed thedata_access
constructor parameter from a few classes, introduced a smallPathConfig
object that simplifies joins where still needed, and created theCommandBuilder
class to reuse the logic for building command line invocations. The commit history may be useful to see this trajectory.Testing
There are some new unit tests for building commands and file transfers, the latter using a semi-mock connection which operates locally on temp directories.
I also ran a texas simulation using this branch and successfully extracted/downloaded files to verify the end to end process.
Where to look
The
SSHDataAccess
class andserver_setup
module were the starting points, most everything else flows from that, but does include some opportunistic refactoring. TheInputData
andOutputData
classes are a good illustration of simplified logic, and as usual the unit tests are a (hopefully) good demo.Time estimate
30 min