Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the option to filter data entries according to multiple input values #59

Merged
merged 1 commit into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,12 @@ The data is divided into `job-metadata-inputs`: the properties of the workload t
once the workload completes (e.g., items 6-9 above). The inputs and outputs specification is provided in the
`job_spec.yaml` file. See [this example](examples/MLCommons/job_spec.yaml) of a job spec.

In your job spec, you can use the `job-entry-filter` key to filter out entries from the original data according to
specific input values. In [this example](examples/MLCommons/job_spec_with_value_filter.yaml), we filter out all entries
where the Processor is`2xAMD EPYC 9374F`, but we keep Processor as a data input. The semantics between the different
entries specified in `job-entry-filter` is OR. That is, an entry matching any of the values specified will be
filtered out.

If the format of your data requires special parsing to transform into a dataframe (i.e., beyond a simple csv file), you
can implement your own parser in [this class](arise_predictions/preprocessing/custom_job_parser.py). For example, the sentiment
analysis example ([here](examples/sentiment_analysis/data)) uses `SAJsonJobParser` as its parser, since its original
Expand Down
11 changes: 4 additions & 7 deletions arise_predictions/preprocessing/job_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,6 @@ def collect_jobs_history(data_dir, output_path, job_inputs, job_outputs, start_t
columns_with_derived = utils.adjust_columns_with_duration(job_inputs + job_outputs, start_time_field_name,
end_time_field_name)

# add columns to be filtered by (to be removed at the end of processing)
filter_columns = list(job_entry_filter.keys())
columns_with_derived = columns_with_derived + filter_columns

df = pd.DataFrame(columns=columns_with_derived)

if not os.path.exists(data_dir):
Expand Down Expand Up @@ -179,9 +175,10 @@ def collect_jobs_history(data_dir, output_path, job_inputs, job_outputs, start_t
return None, None
else:
if job_entry_filter:
for key, value in job_entry_filter.items():
df = df[df[key] != value]
df = df.drop(key, axis=1)
for entry in job_entry_filter:
df = df[~df[entry[constants.JOB_ENTRY_FILTER_NAME_COL]].isin(entry[constants.JOB_ENTRY_FILTER_VALUES_COL])]
if not entry[constants.JOB_ENTRY_FILTER_KEEP_COL]:
df = df.drop(entry[constants.JOB_ENTRY_FILTER_NAME_COL], axis=1)
Comment on lines 177 to +181
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you can move the drop before the filtering and only
do the filtering when keep is True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep means whether we keep in the data the column according to which we do filtering. If it is False, we still do filtering according to its values, but after filtering we discard the column as it is not needed anymore (e.g., is_valid column to indicate invalid configurations).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand now 👍

logger.info("Found {:d} executions in history".format(len(df)))

collect_and_persist_data_metadata(df, job_inputs, job_outputs, output_path)
Expand Down
3 changes: 3 additions & 0 deletions arise_predictions/utils/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
JOB_PARSER_CLASS_NAME_FIELD = 'job-parser-class-name'
METADATA_PARSER_CLASS_NAME_FIELD = 'metadata-parser-class-name'
JOB_ENTRY_FILTER_FIELD = 'job-entry-filter'
JOB_ENTRY_FILTER_NAME_COL = 'name'
JOB_ENTRY_FILTER_VALUES_COL = 'excluded_values'
JOB_ENTRY_FILTER_KEEP_COL = 'keep_input'
DUMMY_VARS_PREFIX = 'dummy_input_'
JOB_INPUTS_FEATURE_ENGINEERING = 'job-metadata-fe'
JOB_DATA_DIR = "data"
Expand Down
17 changes: 17 additions & 0 deletions examples/MLCommons/job_spec_with_value_filter.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
job-metadata-inputs:
- "# of Nodes"
- Processor
- Accelerator
- "# of Accelerators"
- "Model MLC"
- Scenario
- "Host Processor Core Count"

job-metadata-outputs:
- tokens_per_second

job-entry-filter:
- name: Processor
excluded_values: ["2xAMD EPYC 9374F"]
keep_input: True

Loading