Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap up initial data access refactor #570

Merged
merged 18 commits into from
Feb 9, 2022
Merged

Conversation

jenhagg
Copy link
Collaborator

@jenhagg jenhagg commented Jan 11, 2022

Purpose

Relates to #552

Based on the changes introduced in the data_access_refactor feature branch, this contains bugfixes and some code cleanup. Additionally it introduces blob storage as a read only source of scenario data, using fs-azureblob to maintain consistency across filesystems.

What the code is doing

  • Fix bugs encountered while launching simulations in a container and in preparing one on the server.
  • Remove the move_to method which is only being used for the scenario/execute list, and combine with push which is what is used for those files
  • use MultiFS to combine remote filesystems (represented by the DataAccess.fs attribute)
  • make code coverage report opt in
  • new unit test for creating/preparing a scenario using an ephemeral filesystem for local and "remote" storage. only external dependency is the test_usa_tamu profiles from blob storage
  • enable switch to delete state once scenario is created, and unit test for deletion

Testing

Ran a simulation in a container, and created/prepared one on the server. For the container, I removed the pip install . from the Dockerfile and mounted the PowerSimData repo on my local machine into /PowerSimData in the client container, so changes could be tested more quickly. However this is optional and shouldn't affect functionality.

Time estimate

40 min

@jenhagg jenhagg self-assigned this Jan 11, 2022
@jenhagg jenhagg force-pushed the data_access_refactor branch from 69a3a05 to d933773 Compare January 13, 2022 20:27
@jenhagg jenhagg force-pushed the data_access_refactor branch from d933773 to 85fe7f8 Compare January 14, 2022 22:14
@jenhagg jenhagg marked this pull request as draft January 21, 2022 01:11
@kasparm
Copy link
Contributor

kasparm commented Jan 28, 2022

When preparing the scenario from my machine I get:

>>> scenario.create_scenario()
CREATING SCENARIO: test | dummy1

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/scenario/create.py", line 116, in create_scenario
    self._scenario_list_manager.add_entry(self._scenario_info)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/csv_store.py", line 18, in wrapper
    checksum = self.data_access.checksum(self._FILE_NAME)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/data_access.py", line 289, in checksum
    return self.fs.checksum(full_path)
AttributeError: 'MultiFS' object has no attribute 'checksum'

@kasparm
Copy link
Contributor

kasparm commented Jan 28, 2022

Now:

>>> scenario.create_scenario()
CREATING SCENARIO: test | dummy1

Transferring ScenarioList.csv from ssh_fs
100%|#####################################################################################| 669k/669k [00:00<00:00, 3.19Mb/s]
--> Adding entry in ScenarioList.csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/scenario/create.py", line 116, in create_scenario
    self._scenario_list_manager.add_entry(self._scenario_info)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/csv_store.py", line 20, in wrapper
    self.commit(table, checksum)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/csv_store.py", line 76, in commit
    self.data_access.push(tmp_name, checksum, rename=self._FILE_NAME)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/data_access.py", line 294, in push
    self._check_file_exists(backup, should_exist=False, mode="w")
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/data_access.py", line 183, in _check_file_exists
    raise OSError(f"{path} already exists on {location}")
OSError: ScenarioList.csv.temp already exists on ssh_fs

@jenhagg
Copy link
Collaborator Author

jenhagg commented Jan 28, 2022

Couple things:

  1. the .temp file should be deleted when we update the scenario list. not sure why that didn't happen so I did it manually.
  2. based on the stack trace, i think you are on a previous version of the branch before I force pushed (it wasn't up very long, thought I would get away with that). don't think it would cause this particular issue, but still recommend doing a pull

@kasparm
Copy link
Contributor

kasparm commented Jan 28, 2022

I had done a pull but maybe somehow the .temp got left behind.
Now I see the exec_command is missing for MultiFS.
Full stack trace:

>>> scenario.create_scenario()
CREATING SCENARIO: test | dummy2

Transferring ScenarioList.csv from ssh_fs
100%|#####################################################################################| 669k/669k [00:00<00:00, 3.04Mb/s]
--> Adding entry in ScenarioList.csv
Transferring ScenarioList.csv to server
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/scenario/create.py", line 116, in create_scenario
    self._scenario_list_manager.add_entry(self._scenario_info)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/csv_store.py", line 20, in wrapper
    self.commit(table, checksum)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/csv_store.py", line 76, in commit
    self.data_access.push(tmp_name, checksum, rename=self._FILE_NAME)
  File "/Users/kmueller/EGM/PowerSimData/powersimdata/data_access/data_access.py", line 313, in push
    _, _, stderr = self.fs.exec_command(command)
AttributeError: 'MultiFS' object has no attribute 'exec_command'

@jenhagg jenhagg force-pushed the jon/refactor branch 3 times, most recently from f60d26c to 2e5fa4f Compare January 29, 2022 22:50
@jenhagg jenhagg force-pushed the data_access_refactor branch from 85fe7f8 to a8d5d95 Compare January 31, 2022 19:55
@jenhagg jenhagg force-pushed the jon/refactor branch 2 times, most recently from 6efaba5 to 009c5b6 Compare January 31, 2022 20:46
@jenhagg jenhagg marked this pull request as ready for review January 31, 2022 20:59
@kasparm kasparm self-requested a review February 1, 2022 19:54
Copy link
Contributor

@kasparm kasparm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When preparing scenario in client server mode I can successfully prepare it but after a while I get the socket exception:

>>> scenario.prepare_simulation_input()
Transferring ExecuteList.csv from ssh_fs
100%|####################################################################################| 50.5k/50.5k [00:00<00:00, 692kb/s]
---------------------------
PREPARING SIMULATION INPUTS
---------------------------
--> Loading demand
Writing scaled demand profile to tmp/scenario_3911/demand.csv
Writing tmp/scenario_3911/demand.csv
--> Loading hydro
Writing scaled hydro profile to tmp/scenario_3911/hydro.csv
Writing tmp/scenario_3911/hydro.csv
--> Loading solar
Writing scaled solar profile to tmp/scenario_3911/solar.csv
Writing tmp/scenario_3911/solar.csv
--> Loading wind
Writing scaled wind profile to tmp/scenario_3911/wind.csv
Writing tmp/scenario_3911/wind.csv
Building MPC file
Writing tmp/scenario_3911/case.mat
Transferring ExecuteList.csv from ssh_fs
100%|####################################################################################| 50.5k/50.5k [00:00<00:00, 717kb/s]
--> Setting status=prepared in execute list
Transferring ExecuteList.csv to server
>>> Socket exception: Operation timed out (60)

and every consecutive call such as scenario.print_scenario_info() returns OSError: Socket is closed

It looks like the scenario was prepared properly but I still get the operation timed out error.

@jenhagg jenhagg force-pushed the data_access_refactor branch from a8d5d95 to fbb2718 Compare February 7, 2022 19:42
@jenhagg
Copy link
Collaborator Author

jenhagg commented Feb 8, 2022

Deleting a scenario is still extremely slow. I will look through pyfilesystem source to see how it's iterating glob matches.

Update: fixed.

@kasparm
Copy link
Contributor

kasparm commented Feb 9, 2022

Deleting is much faster now. Thank you!

Copy link
Contributor

@kasparm kasparm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Thank you for your work!

@jenhagg jenhagg merged commit ed1a3ab into data_access_refactor Feb 9, 2022
@jenhagg jenhagg deleted the jon/refactor branch February 9, 2022 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants