Switch 2.0.5 (csv import/export) #119

mfripp · 2019-08-10T22:58:22Z

This pull request bumps some of the ideas from #115 down the road to 2.0.6, and fast-tracks conversion to .csv as Switch 2.0.5. This is being moved ahead quickly because I will be doing training with significant new users on Aug. 22, and it was really messy to explain which inputs and outputs should be .tab vs. .csv, .tsv or .txt. Now all inputs and outputs are .csv.

This also includes some of the code previously slated for 2.0.5 (up through 7/1/19). However, commits to the next_version branch after then will need to be merged later. This was done because some of those changes need further discussion before we can move ahead. Unfortunately, this leaves some uncontroversial changes out too. Maybe we can cherry pick some of those into this release?

…bug fixes to rhosetter.py.

…ic examples that use the PySP module of the Pyomo package. -Output files now obey the current format printed by the PySP module. -The model-directory option was changed to model-location, given that the first is deprecated. -Updated console output files to reflect changes. -Replaced references to old names of various variables to the ones currently in use in Swtich.

…sponding outputs folder).

There has been an extra 1 value in the Hydro_RoR row of upgrade_dat/hydro_simple/inputs/proj_build_costs.tab for a long time. In addition, the Hydro and Hydro_RoR rows were delimited with spaces (often multiple) instead of tab characters. These problems also occurred in the original examples/hydro_simple directory, which has been repeatedly upgraded, creating more malformed files along the way. Spaces were also used in some of the other example files, creating similar problems. These were not caught previously because the following sequence happened: Pyomo was happy to load the original version of these files (2.0.0b0), because it allows space or tab delimiters. It also ignores extra fields on each row, so it just used the first four fields (including the extra 1) and ignored the fifth field. So this example ran with the wrong values for Hydro_RoR in the third and fourth columns until now. The upgrade scripts for later versions read these files using pandas.read_csv with separator set to '\t'. For all the rows delimited with spaces, this meant that the upgrade script read the whole row into the first field and filled the other fields with nulls. Then when the script wrote the file back, it preserved the spaces between the values that had been pulled into the first field, and added on extra '.' values for the no-value fields. Subsequent upgrade scripts ran fine, because they still read all the space-delimited stuff into the first field, and read the '.' values into other fields. Switch was also happy to run with these files at each step, because Pyomo used the spaces as field delimiters and ignored the extra '.' values at the end of the rows. However, when we change to .csv inputs, this no longer works. The upgrade script could read these files in with '\t' delimiters, but then most of the values end up in the first column, and they can't be read correctly by Pyomo when it reads these files expecting ',' delimiters. The solution is as follows: (1) update the upgrade scripts to use r'\s+' (any whitespace) as delimiters when reading .tab files (pre 2.0.5), so that they will then write out correctly formatted, \t-delimited .tab files without extra columns. This fixes the upgrade tests. (2) remove the extra "1" value from the Hydro_RoR rows and update the total_cost.txt to reflect this (in examples and tests).

…i models (with warning)

This commit updates Switch to use two-row .csv files instead of .dat files for non-indexed parameters. The first row holds a list of parameter names and the second row holds their values. The upgrade script for 2.0.5 also converts .dat files into .csv files. This means non-indexed inputs can be viewed and edited with the same tools as all other inputs, and also means users don't need to learn the ampl/pyomo .dat data description language. The following files have been converted to .csv: financials.dat, trans_params.dat, spillage_penalty.dat, spinning_reserve_params.dat, lost_load_cost.dat, hydrogen.dat However, Switch still uses gen_multiple_fuels.dat, which defines indexed sets (future work will convert this to .csv too).

josiahjohnston · 2019-08-12T21:27:25Z

I just finished testing & pushing the implementation. Although this discussion isn't fully resolved, its' thankfully a minor edit for modules to call load_key_value_inputfile directly, vs load_aug(..., singleton=True) calls that call load_key_value_inputfile.

My point was that we're not using, nor plan to use most of those behaviors for key/value files. If we were using those things, I'd agree with your general sentiment to avoid duplicating code.

For .dat files load_aug ignores all parameters except optional: (i.e. auto_select/params arguments, autodetection of optional params, soon --input-aliases). The index column auto-selection is also irrelevant for this context. Given that 90% of those behaviors are irrelevant for this context, and the goal of load_aug is to provide a convenient wrapper for underlying calls to load, it seems a less-than ideal entry point.

I agree that the interface to load_aug could use streamlining & updating. I like your idea for params to appear exactly once either in mandatory_params or optional_params, and to accept the same style of arguments for both. auto_select defaulting to true sounds nice in theory, but I'd want to double-check the codebase to look for any exceptions.

I don't know if anyone else is using load_aug with custom modules not in our codebase. So, when we do those API changes to load_aug, I suggest implementing them in a new function with a better name, leaving load_aug the way it is, and having it issue a deprecation warning saying we'll be dropping load_aug in release 2.1. Just a reminder, semantic versioning recommends bumping the minor version number when we add new features and bump the patch version number for bugfixes. Given what's going into this release, I'm on the fence about calling it 2.0.5 or 2.1.

Re: more automation of documentation & input file mapping
Those ideas sound nice. Putting input file paths into parameter & set specifications would limit people's ability to use other Pyomo DataPortal formats (like direct database connections, json files, etc), but I haven't heard anyone asking for that so probably not a big loss?

mfripp · 2019-08-13T00:32:07Z

I started replying to your earlier comment on commit e1f7999, but then had to go into a meeting.

I think we want singleton parameter files to support auto_select and param arguments, autodetection of optional params and --input-aliases just as much as any other parameter files (although as I noted, I think auto_select and autodetection of optional params should be the only behavior, not optional arguments, and param should eventually be implemented automatically by tagging each component with name of the .csv it should be loaded from).

It is only by accident that we didn't support those capabilities with .dat files, and I don't think there's any argument for why we should support them for most .csv files but not the singleton .csv files. I've certainly seen a mix of optional and mandatory arguments implemented in .dat files, and it's just as plausible for users to want to apply aliases to these files as any others.

Really the singleton .csv files are just data like the other .csvs, and should be treated the same way as others. We're just requiring them to be rotated and adding code to support that to soothe your concern about viewing wide vs tall csvs. It makes the code, documentation and learning curve a little messier, but I'm ok with that because we have to tell people something about singleton parameters anyway. So it might as well be "put singleton parameters in a vertical csv with headers of 'name' and 'value'" instead of "put singleton parameters in a csv with parameter names on the first row and values on the second row." But I don't want to go further and say, "by the way, you also have to use this different function to load them, and it doesn't accept the same arguments or offer the same options as the normal load_aug function."

josiahjohnston · 2019-08-13T06:25:00Z

This is increasingly a departure from the original .dat files which did not require a-priori explicit manifests. .dat files have built-in automatic detection and selection per their specifications. AFAIK, DataPortal's support for .dat files doesn't allow for requesting which parameters to look for. The limited arguments in load_aug for .dat files was a direct reflection of DataPortal's abilities and the file specification.

I was personally happy with .dat files and found them easy to read & write. I would have also been happy to concatenate them into a single long file of miscellaneous simple params instead of splitting them up into several different files.

I see one-line-per-argument as a convenience to both end-users and developers who have to read and debug input files. For developers who are writing new modules, I think it would be clearest to either provide a transparent wrapper around the standard DataPortal.load() function (like your initial implementation), or write a separate function to do simple parsing (like my current implementation). Mashing up the concepts adds confusion and is unappealing.

If you feel really strongly about this, then we can just revert my contributions and go with Pyomo's crappy format for stuffing simple parameters in wide and shallow csv files. I've seen other pyomo models that abused this with dozens of parameters in an extremely awkward file and I wished to avoid that slippery slope into lower usability. So far, the only one of our modules that does that is your hydrogen model which I have yet to use, so it won't be a direct encumbrance to me or users that I support. All of the core modules only have a handful of params, which aren't that bad in practice for wide format.

FTR, Pyomo supports usable and scalable specifications of simple parameters in .dat, .json and .yaml formats, but their csv style is bad for anything except simple cases.

mfripp · 2019-08-13T09:20:40Z

I don't mind departing from .dat files, because my goal here is to provide more simplicity, rather than more power. I want it to be as easy as possible to learn (and then predict) how to create and read input files. We are currently not there -- I've been using this for 4 years and I still have to copy and modify an existing load_aug() call every time I want a new one, and I still get it wrong half the time. And then I have to refresh my memory of ampl's .dat language anytime I want to create a file for a singleton param (OK, I remember that much, but I still have to double-check by looking at an existing file), then go and figure out what function to call to read those.

So I am going for several principles here:

Parameter values are read from columns with the same name in the relevant table.
We ignore extra inputs that aren't needed (extra tables, extra columns, but usually not extra rows) (I don't love this, but it allows flexibility when modules are turned on and off, while enforcing some data consistency)
Input columns (and tables) are optional if and only if the corresponding parameters have default values (there's really no need for users to explicitly specify which ones are optional)
We may try using table aliases soon (applied in load_aug)
As much as possible, singleton parameters are handled the same way as indexed parameters.

Those principles point to always using load_aug (or some other single thing) to read data, always using auto_select, never using optional_params, and making load_aug complain about missing tables if and only if it is trying to read a non-optional column from one and can't find the table (we may already be there, I can't remember).

They also lead pretty directly to the wide csvs for singletons, but it's not a huge detour to use vertical csvs for singletons instead (or maybe even allow both, heaven help us). But using .dat files just doesn't fit in -- it's a whole separate data-specification and -loading scheme, just for parameters that happen to be singleton. It may make the inputs more human-readable, but that may not be a huge help since these are usually generated automatically and there are few very wide tables.

josiahjohnston · 2019-08-13T17:33:22Z

I'm hearing you've had a lot of chronic frustration with input files and anything to improve that process would have significant value to you. I'm a little surprised that you are bothered by copying+pasting+editing prior examples; to me that's an invaluable technique for speeding up development as well as reducing cognitive load, repetitive stress from typing, and the rates of typo-based errors.

It sounds like reverting to the wide and flat tables is probably the least bad solution given how much this bothers you and your perspective on API calls. The more I think about overloading load_aug with a custom parser for key/value csv files, the less I like it. The goal of that was to provide a thin wrapper around DataPortal.load() while maintaining its core behaviors as much as possible, and overloading it with singleton=True to use a different parser is at odds with that goal. To me, a key coding style is to use standard or pre-existing API calls and well-known libraries as much as possible in order to maximize readability and minimize learnings curve and cognitive load for developers. Unfortunately, I came across that concept after drafting the early versions of this codebase, and shifting closer to that ideal is a slow process.

I appreciate your explanations of the principles you are working with. I have a different take on those, but it's helpful to understand your reasoning. For example, I appreciated that I didn't have to explicitly specify the contents of a key/value file when I wanted to load data from it, and specifying those (as in the wide format) is a little cumbersome. To me, key/value inputs are different than indexed tabular data; while mashing them together into a single format & DataPortal.load() parser is possible, it can rapidly get awkward and inconvenient. I also place a higher priority on human readability of input & output files than developer convenience, especially since I expect many more labor-hours to be spent working with data files than the calls to load data. To me these principles pointed to either using a different DataPortal-supported format that was also human-readable (such as dat, yaml, or json), or using a different parsing function for csv's that weren't directly parsable by DataPortal. But yeah, in practice our core key/value files only have a handful of params, so the wide format isn't a big deal for now. Slippery slope of awkward scalability for human readability is something we can worry about later.

Anyway, the least bad compromise solution seems to be wide csv format that DataPortal can directly parse. And moving forward, we'll be doing an overhaul of our DataPortal wrapper (load_aug) to make it more developer-friendly and hopefully give it a better name. Sound workable?

mfripp · 2019-08-13T21:14:27Z

Yes, that sounds workable.

I can't say the copy-paste caused me a whole lot of trouble, but having to do it was a sign that something wasn't right. Almost everything in these calls is invariant, so it should be as easy as load_inputs(filename="fuels.csv", params=[m.f_rps_eligible]) or load_inputs(filename="rps_targets.csv", index=m.RPS_YEARS, params=[m.rps_target]). Or, eventually just m.f_rps_eligible = Param(m.FUELS, within=Binary, default=False, input_file='fuels.csv'). Then there's no need for copy-and-paste, and less room for error.

I currently create all the singleton input files from dictionaries with values assigned in my get_scenario_data.py scripts, so it doesn't make a difference to me whether they are vertical or wide. Eventually, certain of these (e.g., the hydrogen parameters) might come from a database, where I might also have a wide table to hold them. Then it's an easy export to a wide .csv. Or I suppose people might want to create a tall table in the database, so I guess that's the same debate.

I can sort of see the argument for yaml or some other data format for key-value pairs, but then we're effectively back to the .dat files again. And I find it easier to say "if there's no index, just make a csv with one header row, one data row and no index column" rather than "if there's no index, put the key-value pairs in a .dat/.yaml/tall-csv file".

I think I'm going to release commit 6c3feb as 2.0.5 (after all my pre-release tests), so people can begin installing it today for the tutorial. I'll leave the tall-csv commits as a separate feature branch. And at some point we can rebase the next_release branch on the 2.0.5 release and then release that as 2.0.6, either before the tutorial (if it doesn't change the basic interface and data formats) or soon afterward.

I know semantic versioning says you should update minor version numbers for feature releases, but pretty much everything we release is a feature upgrade. So we would quickly move up to 2.5, 2.23, 2.71, etc. That may be fine, but it would require us to talk about "Switch 2" instead of "Switch 2.0". This may just be a consequency of (hopefully) having a faster release process than Python (where we'll be talking about "3.7" for quite a while yet). On the other hand, I don't think anyone thinks about which version they're using of pandas, numpy or matplotlib -- they just want the latest.

josiahjohnston · 2019-08-13T22:27:31Z

Sounds good.

I just pushed those rearrangements to various git branches.

2.0.5 set to 6c3feb
bb2436a archived as two_column_key_value_inputs branch
Copied next_release to 2.0.6 and rebased it onto 2.0.5, manually resolving 3 or 4 conflicts in the process. Tests still pass.

I haven't deleted next_release yet, but we should plan on doing that shortly since it is superseded by 2.0.6.

Re: Semantic versioning
Projects that follow semantic versioning and have active release schedules bump the minor version frequently, which is fine as far as I know. Agreed that this would prompt us to change our language from "Switch 2.0" to "Switch v2", but I don't see any benefit in delaying that transition. And personally, I usually have more confidence in software stability when the minor and bug versions are above 0.

I reviewed the commits since 2.0.0, and most of them were bug fixes, although there were a few new minor features thrown in. My 5 outstanding pull requests do qualify as new features. We could fudge things and say those features are too minor to bump the minor version, but I don't recommend doing that indefinitely.

I just re-reviewed https://semver.org/ and the section about deprecating functionality. Given that we are changing the input file specifications and effectively deprecating .tab & .dat functionality, I think setting this version to 2.1 instead of 2.0.5 would be a much clearer signal about the impacts; at least to people who understand semantic versioning and use it as a shorthand. At the moment, I think our audience of professional developers is small, so not following semantic versioning guidelines in this case probably isn't that big of a deal in practice.

josiahjohnston · 2019-08-13T23:11:03Z

PS. After the 2.0.6 branch passes a sniff test, I'll need to update several pull requests to that branch instead of the next_version branch.

mfripp and others added 17 commits June 21, 2019 11:57

Simplify rhosetter code and remove sympy dependency

da3ea26

Factor out numpy dependency and drop requirements.txt

bd9fe59

Updating PHA example input files with updated model structure. Minor …

01b9d5c

…bug fixes to rhosetter.py.

Slightly improved dev installation instructions for Windows users.

7c80e83

Clean up useless files in PySP example (they are already in the corre…

ffa02ff

…sponding outputs folder).

Adding comment about relative import.

121fb87

Use .csv instead of .tab files (find-replaced file extensions)

1392bc7

Create outputs as .csv files

be20002

add switch_model.upgrade.upgrade_2_0_5

56f2922

upgrade examples for 2.0.5 (use .csv for all inputs)

970daeb

Update version number to 2.0.5

390eba7

Save Hawaii results in .csv format instead of .tsv

53e0e53

Allow use of cap_cost_scen_id instead of tech_scen_id for older Hawai…

b6b44d6

…i models (with warning)

Report file name and raise error if conversion fails

922481a

Show Switch version number and website in banner; improve solve messages

9c3671e

mfripp requested review from josiahjohnston, bmaluenda and rodrigomha August 10, 2019 23:32

mfripp assigned josiahjohnston, bmaluenda, rodrigomha and mfripp Aug 10, 2019

mfripp added 2 commits August 11, 2019 07:53

Use .csv instead of .dat files for examples

6c3feb8

josiahjohnston force-pushed the 2.0.5 branch from bb2436a to 6c3feb8 Compare August 13, 2019 21:43

Prefer simple conda installation in INSTALL.md

6d73e74

mfripp merged commit 6d73e74 into master Aug 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch 2.0.5 (csv import/export) #119

Switch 2.0.5 (csv import/export) #119

mfripp commented Aug 10, 2019 •

edited

Loading

josiahjohnston commented Aug 12, 2019

mfripp commented Aug 13, 2019

josiahjohnston commented Aug 13, 2019

mfripp commented Aug 13, 2019

josiahjohnston commented Aug 13, 2019

mfripp commented Aug 13, 2019

josiahjohnston commented Aug 13, 2019 •

edited

Loading

josiahjohnston commented Aug 13, 2019

Switch 2.0.5 (csv import/export) #119

Switch 2.0.5 (csv import/export) #119

Conversation

mfripp commented Aug 10, 2019 • edited Loading

josiahjohnston commented Aug 12, 2019

mfripp commented Aug 13, 2019

josiahjohnston commented Aug 13, 2019

mfripp commented Aug 13, 2019

josiahjohnston commented Aug 13, 2019

mfripp commented Aug 13, 2019

josiahjohnston commented Aug 13, 2019 • edited Loading

josiahjohnston commented Aug 13, 2019

mfripp commented Aug 10, 2019 •

edited

Loading

josiahjohnston commented Aug 13, 2019 •

edited

Loading