Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For the STAT-Analysis "filter" job type apply the -set_hdr job commands to the -dump_row output file. #1129

Open
JohnHalleyGotway opened this issue May 22, 2019 · 4 comments
Labels
priority: medium Medium Priority type: enhancement Improve something that it is currently doing

Comments

@JohnHalleyGotway
Copy link
Collaborator

This issue arose when Jonathan ran SWPC data through MET. This resulted in long strings in the MET output columns. The strings were too long for METviewer and caused the loader to fail. To patch the data, Jonathan used sed to shorten the strings. We tried running a STAT-Analysis filter job with the -set_hdr option to do this, but it didn't work. The -set_hdr option currently only applies to the output of the aggregate or aggregate_stat job types.

This task is to enable the -set_hdr option for the filter job type when creating the -dump_row output. BUT it should only apply to the -dump_row output from filter, and not the -dump_row output of the other job types!

@JohnHalleyGotway JohnHalleyGotway added type: enhancement Improve something that it is currently doing component: application code labels May 22, 2019
@JohnHalleyGotway JohnHalleyGotway added this to the MET 9.0 milestone May 22, 2019
@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented May 22, 2019

Also consider supporting the use of regular expressions for the -set_hdr option throughout stat-analysis.

Currently, the set_hdr value is applied to all of the output for the column. But this make using the "-by" option less useful.

For example, let's say you have data with FCST_VAR = TMP and UGRD, you could define:

-job aggregate -line_type SL1L2 -by FCST_VAR
-set_hdr FCST_VAR TEMPERATURE 'T.'
-set_hdr FCST_VAR UWIND 'UG.
'

So the set_hdr options would only be applied when the current string matches the regular expression listed. We'd still need to support the old logic when no regular expression is specified. So the default could be a regular expression of '.*'.

@JohnHalleyGotway
Copy link
Collaborator Author

Here's an email to Mallory Row describing the potential for related development:

Mallory,

When used for the aggregate or aggregate_stat job types, the intended purpose of the -dump_row option is for users to be able to see the actual input lines that were used when processing each job. So it's meant as a sanity check to double-check the filtering logic the user defined.

But for the filter job (which writes it output to the -dump_row file), the intention is a little different. For filter, stat_analysis is a fancy form of "grep", enabling the user to slice/dice their data however they'd like.

Just earlier today, we talked about enhancing the filter job type to support the "-set_hdr" option. For example, we have some data with a very long FCST_UNITS string and want to reset that to a shorter string. So we'd like to run a job like this:
stat_analysis -lookin stat_data -job filter -set_hdr FCST_UNITS TEC -dump_row short_units.txt

But this is not currently supported. Here's the GitHub issue for this:
#1129

If -set_hdr is used, this would require STAT-Analysis to actually parse the input lines, update strings, and write it back out. As long as we're parsing the data anyway, we could also consider updating the version number before writing it to the output. And in that step, we would, for example, add FCST_UNITS and OBS_UNITS to the output of the filter job.

It seems to me like using "-dump_row" in both contexts is confusing. Instead, perhaps we should require that the "filter" job use the "-out_stat" job command option to specify its output file?

Would that be a useful solution? Of course, that would only fix .stat output files. There is no "filter" job for MODE or MTD output data.

@JohnHalleyGotway
Copy link
Collaborator Author

Mallory confirms that this functionality would be useful. So the changes would be this:

(1) The -dump_row option remains as-is... whatever .stat lines are read as input should be written to the output -dump_row file. If we're writing the first line of output and the first line read is a header line... dump that to the output file. All future header lines should be ignored.

(2) For the -filter job, make the -out_stat command line option required. Regardless of the version of the .stat lines read as input, the output will now be written for the current version number.
Should we buffer all of the lines in one ascii table in memory to get the columns to line up? Maybe that's overkill and isn't worth the extra memory consumption.

@TaraJensen
Copy link
Contributor

Charge 277047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: medium Medium Priority type: enhancement Improve something that it is currently doing
Projects
None yet
Development

No branches or pull requests

2 participants