Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add transmission to substation track selection #217

Merged
merged 1 commit into from
Sep 20, 2021
Merged

Conversation

rouille
Copy link
Collaborator

@rouille rouille commented Sep 7, 2021

Pull Request doc

Purpose

Select HIFLD substations based on HIFLD lines

What the code is doing

  • map_lines_to_substations_using_coords: create a data frame mapping lines to substations base on coordinates
  • assign_substations_to_lines: update lines and substations data frames. In particular:
    • update SUB_1 and SUB_2 columns in lines data frame (name of selected substations are used)
    • new SUB_1_ID, SUB_2_ID columns giving the selected substation ID
    • new OTHER_SUB_1_ID and OTHER_SUB_2_ID indicating other candidates at same endpoint, either a substation, tap, riser or dead end
  • build_transmission: the function has been updated to use the above method to get the lines/substations tuple

Testing

How did you test this change (unit/functional testing, manual testing, etc.)?

Where to look

All the code is in the transmission module

Usage Example/Visuals

>>> from prereise.gather.griddata.hifld.data_process.transmission import map_lines_to_substations_using_coords, build_transmission
>>> lines, substations  = build_transmission(method="line2sub", kwargs={"rounding": 2})
dropping 6892 substations of 70857 total due to LINES parameter equal to 0
filter substations based on lines
initial number of substations: 63965
substations with same location after rounding: 13769

initial number of lines: 71554
zero distance lines after rounding: 11589

there are 1117 lines with two substations missing
there are 1848 lines with one substation missing
there are 2042 unique orphan endpoint
there are 7622 substations unconnected

finding closest neighbor to unconnected lines
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2965/2965 [02:50<00:00, 17.43it/s]
2661 line voltages can't be found via neighbor consensus
238 line voltages can't be found via neighbor minimum
>>> lines
                TYPE      STATUS  NAICS_CODE                                    NAICS_DESC  ... SUB_1_ID OTHER_SUB_1_ID SUB_2_ID    OTHER_SUB_2_ID
ID                                                                                          ...                                                   
212144  AC; OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   200352           None   209798              None
212145  AC; OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   200356       [201930]   200356          [201930]
212146  AC; OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   200594           None   200593  [209809, 203487]
212147  AC; OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   200642       [209799]   200642          [209799]
212148  AC; OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   200656           None   209800              None
...              ...         ...         ...                                           ...  ...      ...            ...      ...               ...
175393      OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   173638           None   132385              None
175394      OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   141104           None   173640              None
175395      OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   172406           None   173641              None
175396      OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   173641           None   173640              None
175397      OVERHEAD  IN SERVICE      221121  ELECTRIC BULK POWER TRANSMISSION AND CONTROL  ...   173642           None   141147              None

[71554 rows x 19 columns]
>>> substations
                   X             Y            NAME            CITY STATE    ZIP        TYPE  ... VAL_METHOD             VAL_DATE  LINES MAX_VOLT  MIN_VOLT  MAX_INFER  MIN_INFER
ID                                                                                           ...                                                                                
131072 -1.069084e+07  5.898522e+06   UNKNOWN131072         AUDUBON    MN  56511  SUBSTATION  ...    IMAGERY  2015/08/28 00:00:00      1    115.0     115.0          N          N
131089 -9.311883e+06  5.323699e+06   UNKNOWN131089     GENESEE TWP    MI  48506  SUBSTATION  ...    IMAGERY  2015/08/17 00:00:00      1    138.0     138.0          Y          Y
131090 -1.049062e+07  5.162095e+06  GRAND JUNCTION  GRAND JUNCTION    IA  50107  SUBSTATION  ...    IMAGERY  2018/11/28 00:00:00      4    161.0     161.0          Y          Y
131091 -1.049057e+07  5.161922e+06   UNKNOWN131091  GRAND JUNCTION    IA  50107  SUBSTATION  ...    IMAGERY  2015/08/18 00:00:00      3    161.0 -999999.0          Y          N
131092 -1.047791e+07  5.136316e+06   UNKNOWN131092           PERRY    IA  50220  SUBSTATION  ...    IMAGERY  2015/09/11 00:00:00      1    161.0     161.0          Y          Y
...              ...           ...             ...             ...   ...    ...         ...  ...        ...                  ...    ...      ...       ...        ...        ...
131067 -1.034558e+07  5.161820e+06   UNKNOWN131067    MARSHALLTOWN    IA  50158  SUBSTATION  ...    IMAGERY  2015/09/10 00:00:00      2    161.0     161.0          Y          Y
131068 -1.034633e+07  5.144523e+06   UNKNOWN131068          LAUREL    IA  50141  SUBSTATION  ...    IMAGERY  2015/08/13 00:00:00      3    161.0 -999999.0          Y          N
131069 -1.035529e+07  5.125052e+06          JASPER          NEWTON    IA  50208  SUBSTATION  ...    IMAGERY  2018/11/28 00:00:00      3    161.0     161.0          Y          Y
131070 -1.071809e+07  5.914971e+06   UNKNOWN131070          HAWLEY    MN  56549  SUBSTATION  ...    IMAGERY  2015/08/28 00:00:00      1     69.0      69.0          Y          Y
131071 -1.048157e+07  5.915785e+06   UNKNOWN131071      PINE RIVER    MN  56474  SUBSTATION  ...    IMAGERY  2015/08/28 00:00:00      3     69.0 -999999.0          Y          N

[54492 rows x 24 columns]

Time estimate

30min

@rouille rouille self-assigned this Sep 7, 2021
@danielolsen
Copy link
Contributor

Is there any path dependence between this PR and the #210-#218-#219 chain? I think they're all independent, but maybe we want to ensure that the others still work under both methods of coming up with lines/substations?

@rouille
Copy link
Collaborator Author

rouille commented Sep 8, 2021

Is there any path dependence between this PR and the #210-#218-#219 chain? I think they're all independent, but maybe we want to ensure that the others still work under both methods of coming up with lines/substations?

Yes, I believe they are all independent. We don't have to merge this PR now, we can go over it once we have the whole chain and we want to compare both tracks.

@danielolsen
Copy link
Contributor

Running the latest code, I'm getting an error:

>>> from prereise.gather.griddata.hifld.data_process.transmission import build_transmission
>>> lines, substations = build_transmission(method="line2sub", kwargs={"rounding": 2})
dropping 6892 substations of 70857 total due to LINES parameter equal to 0
filter substations based on lines
---------------------------------
initial number of substations: 63965
substations with same location after rounding: 13769

initial number of lines: 71554
zero distance lines after rounding: 11589

there are 1117 lines with two substations missing
there are 1848 lines with one substation missing
there are 2042 unique orphan endpoint
there are 7622 substations unconnected

finding closest neighbor to unconnected lines
100%|██████████████████████████████████████| 2965/2965 [02:07<00:00, 23.31it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\DanielOlsen\repos\bes\PreREISE\prereise\gather\griddata\hifld\data_process\transmission.py", line 511, in build_transmission
    lines, substations = assign_substations_to_lines(
  File "C:\Users\DanielOlsen\repos\bes\PreREISE\prereise\gather\griddata\hifld\data_process\transmission.py", line 167, in assign_substations_to_lines
    line2sub.loc[i, f"OTHER_{endpoint}_SUB"] = list(
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 723, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 1730, in _setitem_with_indexer
    self._setitem_with_indexer_split_path(indexer, value, name)
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 1785, in _setitem_with_indexer_split_path
    raise ValueError(
ValueError: Must have equal len keys and value when setting with an iterable

Maybe this is because of something upstream?

@rouille
Copy link
Collaborator Author

rouille commented Sep 9, 2021

Running the latest code, I'm getting an error:

>>> from prereise.gather.griddata.hifld.data_process.transmission import build_transmission
>>> lines, substations = build_transmission(method="line2sub", kwargs={"rounding": 2})
dropping 6892 substations of 70857 total due to LINES parameter equal to 0
filter substations based on lines
---------------------------------
initial number of substations: 63965
substations with same location after rounding: 13769

initial number of lines: 71554
zero distance lines after rounding: 11589

there are 1117 lines with two substations missing
there are 1848 lines with one substation missing
there are 2042 unique orphan endpoint
there are 7622 substations unconnected

finding closest neighbor to unconnected lines
100%|██████████████████████████████████████| 2965/2965 [02:07<00:00, 23.31it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\DanielOlsen\repos\bes\PreREISE\prereise\gather\griddata\hifld\data_process\transmission.py", line 511, in build_transmission
    lines, substations = assign_substations_to_lines(
  File "C:\Users\DanielOlsen\repos\bes\PreREISE\prereise\gather\griddata\hifld\data_process\transmission.py", line 167, in assign_substations_to_lines
    line2sub.loc[i, f"OTHER_{endpoint}_SUB"] = list(
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 723, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 1730, in _setitem_with_indexer
    self._setitem_with_indexer_split_path(indexer, value, name)
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 1785, in _setitem_with_indexer_split_path
    raise ValueError(
ValueError: Must have equal len keys and value when setting with an iterable

Maybe this is because of something upstream?

It works for me:

>>> from prereise.gather.griddata.hifld.data_process.transmission import map_lines_to_substations_using_coords, build_transmission
>>> lines, substations  = build_transmission(method="line2sub")
dropping 6892 substations of 70857 total due to LINES parameter equal to 0
filter substations based on lines
---------------------------------
initial number of substations: 63965
substations with same location after rounding: 4307

initial number of lines: 71554
zero distance lines after rounding: 3314

there are 1197 lines with two substations missing
there are 2413 lines with one substation missing
there are 2632 unique orphan endpoint
there are 9259 substations unconnected

finding closest neighbor to unconnected lines
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3610/3610 [04:10<00:00, 14.44it/s]
2467 line voltages can't be found via neighbor consensus
296 line voltages can't be found via neighbor minimum
>>> 

@rouille
Copy link
Collaborator Author

rouille commented Sep 9, 2021

Running the latest code, I'm getting an error:

>>> from prereise.gather.griddata.hifld.data_process.transmission import build_transmission
>>> lines, substations = build_transmission(method="line2sub", kwargs={"rounding": 2})
dropping 6892 substations of 70857 total due to LINES parameter equal to 0
filter substations based on lines
---------------------------------
initial number of substations: 63965
substations with same location after rounding: 13769

initial number of lines: 71554
zero distance lines after rounding: 11589

there are 1117 lines with two substations missing
there are 1848 lines with one substation missing
there are 2042 unique orphan endpoint
there are 7622 substations unconnected

finding closest neighbor to unconnected lines
100%|██████████████████████████████████████| 2965/2965 [02:07<00:00, 23.31it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\DanielOlsen\repos\bes\PreREISE\prereise\gather\griddata\hifld\data_process\transmission.py", line 511, in build_transmission
    lines, substations = assign_substations_to_lines(
  File "C:\Users\DanielOlsen\repos\bes\PreREISE\prereise\gather\griddata\hifld\data_process\transmission.py", line 167, in assign_substations_to_lines
    line2sub.loc[i, f"OTHER_{endpoint}_SUB"] = list(
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 723, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 1730, in _setitem_with_indexer
    self._setitem_with_indexer_split_path(indexer, value, name)
  File "C:\Python39\lib\site-packages\pandas\core\indexing.py", line 1785, in _setitem_with_indexer_split_path
    raise ValueError(
ValueError: Must have equal len keys and value when setting with an iterable

Maybe this is because of something upstream?

It works for me:

>>> from prereise.gather.griddata.hifld.data_process.transmission import map_lines_to_substations_using_coords, build_transmission
>>> lines, substations  = build_transmission(method="line2sub")
dropping 6892 substations of 70857 total due to LINES parameter equal to 0
filter substations based on lines
---------------------------------
initial number of substations: 63965
substations with same location after rounding: 4307

initial number of lines: 71554
zero distance lines after rounding: 3314

there are 1197 lines with two substations missing
there are 2413 lines with one substation missing
there are 2632 unique orphan endpoint
there are 9259 substations unconnected

finding closest neighbor to unconnected lines
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3610/3610 [04:10<00:00, 14.44it/s]
2467 line voltages can't be found via neighbor consensus
296 line voltages can't be found via neighbor minimum
>>> 

Using your exact same call (passing kwargs={"rounding": 2}), it still works for me:

>>> lines, substations  = build_transmission(method="line2sub", kwargs={"rounding": 2})
dropping 6892 substations of 70857 total due to LINES parameter equal to 0
filter substations based on lines
---------------------------------
initial number of substations: 63965
substations with same location after rounding: 13769

initial number of lines: 71554
zero distance lines after rounding: 11589

there are 1117 lines with two substations missing
there are 1848 lines with one substation missing
there are 2042 unique orphan endpoint
there are 7622 substations unconnected

finding closest neighbor to unconnected lines
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2965/2965 [02:52<00:00, 17.22it/s]
2661 line voltages can't be found via neighbor consensus
238 line voltages can't be found via neighbor minimum
>>> 

@danielolsen
Copy link
Contributor

@rouille, there must be something funny about my environment, when I spin up a docker container and pipenv sync everything runs properly.

@danielolsen
Copy link
Contributor

Oh this is strange: Using Python 3.9.6 on Windows and Pandas 1.2.5, 1.3.1, or 1.3.2 (latest), I get the error shown. If I downgrade to Pandas 1.1.5 (as specified in the Pipfile), it works properly. This seems potentially related to Pandas Issue#32372.

If I change l.167 from line2sub.loc[i, f"OTHER_{endpoint}_SUB"] = list( to line2sub.at[i, f"OTHER_{endpoint}_SUB"] = list(, and l.170 from line2sub.loc[i, f"{endpoint}_SUB"] = all2one[key] to line2sub.at[i, f"{endpoint}_SUB"] = all2one[key], then the example runs without error on both Pandas 1.1.5 and 1.3.2, so it seems like that change could help future-proof our code. .at seems to be preferred over .loc when you know you're setting just a single value.

@rouille
Copy link
Collaborator Author

rouille commented Sep 9, 2021

Oh this is strange: Using Python 3.9.6 on Windows and Pandas 1.2.5, 1.3.1, or 1.3.2 (latest), I get the error shown. If I downgrade to Pandas 1.1.5 (as specified in the Pipfile), it works properly. This seems potentially related to Pandas Issue#32372.

If I change l.167 from line2sub.loc[i, f"OTHER_{endpoint}_SUB"] = list( to line2sub.at[i, f"OTHER_{endpoint}_SUB"] = list(, and l.170 from line2sub.loc[i, f"{endpoint}_SUB"] = all2one[key] to line2sub.at[i, f"{endpoint}_SUB"] = all2one[key], then the example runs without error on both Pandas 1.1.5 and 1.3.2, so it seems like that change could help future-proof our code. .at seems to be preferred over .loc when you know you're setting just a single value.

Done

@danielolsen danielolsen added the hifld Related to ingestion of the HIFLD data label Sep 9, 2021
Copy link
Contributor

@danielolsen danielolsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the code and the logic makes sense. Running the two methods with default parameters yields 71,554 lines for the new method, compared to 67,389 lines for the old method. If we trust the line coordinates more than the substation names, this method seems to avoid tossing lines unnecessarily, while only making fairly reasonable assumptions.

A test or two would be nice, but since this new functionality is not being activated by default yet, I don't think that should hold up enabling us to play around with the new method.

@rouille rouille force-pushed the ben/selection branch 2 times, most recently from 4e710cd to 3d4da0e Compare September 10, 2021 19:11
@rouille rouille force-pushed the ben/selection branch 2 times, most recently from 91e5c11 to 0528d6f Compare September 20, 2021 19:02
@rouille rouille merged commit 2631b89 into hifld Sep 20, 2021
@rouille rouille deleted the ben/selection branch September 20, 2021 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hifld Related to ingestion of the HIFLD data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants