Skip to content

Latest commit

 

History

History
169 lines (144 loc) · 20.7 KB

metrics.rst

File metadata and controls

169 lines (144 loc) · 20.7 KB

Available metrics

This software is pre-production and should not be deployed to production servers.

Check out metrics sources documentation for more details how metrics are measured and about labels/levels.

For searchable list of metrics metrics as csv file .

  • Name: is the name of metric that will be exported to Prometheus by using Prometheus exposition format but also the name of the key in Measurements dict-like type used in Detector and Allocator plugins,

  • Help: information what metric represents and some details how metric was collected and known problems or limitations,

  • Unit: unit of the metric (usually seconds or bytes),

  • Type: only possible types are gauge and counter as described in Prometheus metric types.

  • Source: short description about mechanics that was used to collect metric, for more detailed information check out Metric sources documenation.

  • Enabled - column describes if metric is enabled by default and how to enable (option in MeasurementRunner responsible for configuring it. Please refer to metrics sources documentation for more details.)

  • Levels/Labels - some metrics have additional dimensions (more granularity than just Task or Platform) e.g. task_mem_numa_pages can be collected per NUMA node - in this case this metrics have attached additional label like numa_node=0 which creates new series in Prometheus nomenclature and represents more granular information about source of metric. When used in python API in Detector or Allocator classes this will be represented by nested dicts where each level have keys corresponding to "level" (order is important). For example doubly nested perf uncore metrics like: platform_cas_count_reads have two levels: socket and pmu_type (which physically represents memory controller) will be encoded as:

    platform_cas_count_reads{socket=0, pmu_type=17} 12345
    

    and represented in Python API as:

    measurements = {'platform_cas_count_reads': {0: {17: 12345}}}
    
Name Help Enabled Unit Type Source Levels/Labels
task_instructions Hardware PMU counter for number of instructions (PERF_COUNT_HW_INSTRUCTIONS). Fixed counter. Predefined perf PERF_TYPE_HARDWARE. Please man perf_event_open for more details. no (event_names) numeric counter perf subsystem with cgroups  
task_cycles Hardware PMU counter for number of cycles (PERF_COUNT_HW_CPU_CYCLES). Fixed counter. Predefined perf PERF_TYPE_HARDWARE. Please man perf_event_open for more details. no (event_names) numeric counter perf subsystem with cgroups  
task_cache_misses Hardware PMU counter for cache-misses (PERF_COUNT_HW_CACHE_MISSES).Predefined perf PERF_TYPE_HARDWARE. Please man perf_event_open for more details. no (event_names) numeric counter perf subsystem with cgroups  
task_cache_references Hardware PMU counter for number of cache references (PERF_COUNT_HW_CACHE_REFERENCES).Predefined perf PERF_TYPE_HARDWARE. Please man perf_event_open for more details. no (event_names) numeric counter perf subsystem with cgroups  
task_stalled_mem_loads Execution stalls while memory subsystem has an outstanding load.CYCLE_ACTIVITY.STALLS_MEM_ANYIntel SDM October 2019 19-24 Vol. 3B, Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_offcore_requests_l3_miss_demand_data_rd Increment each cycle of the number of offcore outstanding demand data read requests from SQ that missed L3.Counts number of Offcore outstanding Demand Data Read requests that miss L3 cache in the superQ every cycle.OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RDIntel SDM October 2019 19-24 Vol. 3B, Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_offcore_requests_demand_data_rd Counts the Demand Data Read requests sent to uncore. OFFCORE_REQUESTS.DEMAND_DATA_RD Intel SDM October 2019 19-24 Vol. 3B, Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_offcore_requests_demand_rfo Demand RFO read requests sent to uncore, including regular RFOs, locks, ItoM. OFFCORE_REQUESTS.DEMAND_RFO Intel SDM October 2019 19-24 Vol. 3B, Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_offcore_requests_outstanding_l3_miss_demand_data_rd Demand Data Read requests who miss L3 cache. OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD.Intel SDM October 2019 19-24 Vol. 3B, Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_mem_load_retired_local_pmm Retired load instructions with local Intel® Optane™ DC persistent memory as the data source and the datarequest missed L3 (AppDirect or Memory Mode), and DRAM cache (Memory Mode). MEM_LOAD_RETIRED.LOCAL_PMM (Mnemonic) For CLX, Intel SDM October 2019 19-24 Vol. 3B, Table 19-4 no (event_names) numeric counter perf subsystem with cgroups  
task_mem_load_retired_local_dram Retired load instructions which data sources missed L3 but serviced from local DRAM.MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM Intel SDM October 2019 Chapters 19-24 Vol. 3B Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_mem_load_retired_remote_dram Retired load instructions which data sources missed L3 but serviced from remote dram. MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAMIntel SDM October 2019 Chapters 19-24 Vol. 3B Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_mem_inst_retired_loads MEM_INST_RETIRED.ALL_LOADS All retired load instructions. Intel SDM October 2019 Chapters 19-24 Vol. 3B Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_mem_inst_retired_stores MEM_INST_RETIRED.ALL_STORES All retired store instructions. Intel SDM October 2019 Chapters 19-24 Vol. 3B Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_dtlb_load_misses DTLB_LOAD_MISSES.WALK_COMPLETEDCounts demand data loads that caused a completedpage walk of any page size (4K/2M/4M/1G). This impliesit missed in all TLB levels. The page walk can end withor without a faultIntel SDM October 2019 Chapters 19-24 Vol. 3B Table 19-3 no (event_names) numeric counter perf subsystem with cgroups  
task_scaling_factor_avg Perf subsystem metric scaling factor, averaged value of all events and cpus (value 1.0 is the best, meaning that there is no scaling at all for any metric). auto (depending on event_names) numeric gauge perf subsystem with cgroups  
task_scaling_factor_max Perf subsystem metric scaling factor, maximum value of all events and cpus (value 1.0 is the best, meaning that there is no scaling at all for any metric). auto (depending on event_names) numeric gauge perf subsystem with cgroups  
task_ips Instructions per second. no (enable_derived_metrics) numeric gauge derived from perf subsystem with cgroups  
task_ipc Instructions per cycle. no (enable_derived_metrics) numeric gauge derived from perf subsystem with cgroups  
task_cache_hit_ratio Cache hit ratio, based on cache-misses and cache-references. no (enable_derived_metrics) numeric gauge derived from perf subsystem with cgroups  
task_cache_misses_per_kilo_instructions Cache misses per kilo instructions. no (enable_derived_metrics) numeric gauge derived from perf subsystem with cgroups  
task_llc_occupancy_bytes LLC occupancy from resctrl filesystem based on Intel RDT technology. auto (rdt_enabled) bytes gauge resctrl filesystem  
task_mem_bandwidth_bytes Total memory bandwidth using Memory Bandwidth Monitoring. auto (rdt_enabled) bytes counter resctrl filesystem  
task_mem_bandwidth_local_bytes Total local memory bandwidth using Memory Bandwidth Monitoring. auto (rdt_enabled) bytes counter resctrl filesystem  
task_mem_bandwidth_remote_bytes Total remote memory bandwidth using Memory Bandwidth Monitoring. auto (rdt_enabled) bytes counter resctrl filesystem  
task_cpu_usage_seconds Time taken by task based on cpuacct.usage (total kernel and user space). yes seconds counter cgroup filesystem  
task_mem_usage_bytes Memory usage_in_bytes per tasks returned from cgroup memory subsystem. yes bytes gauge cgroup filesystem  
task_mem_max_usage_bytes Memory max_usage_in_bytes per tasks returned from cgroup memory subsystem. yes bytes gauge cgroup filesystem  
task_mem_limit_bytes Memory limit_in_bytes per tasks returned from cgroup memory subsystem. yes bytes gauge cgroup filesystem  
task_mem_soft_limit_bytes Memory soft_limit_in_bytes per tasks returned from cgroup memory subsystem. yes bytes gauge cgroup filesystem  
task_mem_numa_pages Number of used pages per NUMA node(key: hierarchical_total is used if available or justtotal with warning), from cgroup memory controller from memory.numa_stat file. yes numeric gauge cgroup filesystem numa_node
task_mem_page_faults Number of page faults for task. yes numeric counter cgroup filesystem  
task_wss_referenced_bytes Task referenced bytes during last measurements cycle based on /proc/smaps Referenced field, with /proc/PIDs/clear_refs set to after task gets stable.Warning: this is intrusive collection, because can influence kernel page reclaim policy and add latency.Refer to https://github.com/brendangregg/wss#wsspl-referenced-page-flag for more details. no (wss_reset_cycles) bytes gauge /proc/PIDS/smaps  
task_working_set_size_bytes Task referenced bytes during last stable measurements cycle based on /proc/smaps Referenced field, with /proc/PIDs/clear_refs set to after task gets stable.Warning: this is intrusive collection, because can influence kernel page reclaim policy and add latency.Refer to https://github.com/brendangregg/wss#wsspl-referenced-page-flag for more details. no (wss_reset_cycles) bytes gauge /proc/PIDS/smaps  
task_wss_measure_overhead_seconds Seconds that WCA agent spent (kernel time) waiting for /proc/smapsor reseting accessed_bits no (wss_reset_cycles) seconds counter /proc/PIDS/smaps /proc/PIDS/clear_refs  
task_sched_stat Aggregated statistics for all pids in task (sum from all pids) from /proc/PID/sched. Each field is represented by its own key label no (sched) None counter /proc/PIDS/sched key
task_sched_stat_numa_faults Aggregated statistics for all pids in task from /proc/PID/sched only but only numa_faults line (sum is used as default aggregation function). Different numa_fault fields are represented by fault_type and numa_node labels no (sched) None counter /proc/PIDS/sched numa_node, fault_type
task_requested_cpus Tasks resources cpus initial requests. yes numeric gauge orchestrator  
task_requested_mem_bytes Tasks resources memory initial requests. yes bytes gauge orchestrator  
task_last_seen Time the task was last seen. yes timestamp counter internal  
task_up Always returns 1 for running task. yes numeric counter internal  
task_subcontainers Returns number of Kubernetes Pod Containers or 0 for others. yes numeric gauge internal  
Name Help Enabled Unit Type Source Levels/Labels
platform_topology_cores Platform information about number of physical cores yes numeric gauge internal  
platform_topology_cpus Platform information about number of logical cpus yes numeric gauge internal  
platform_topology_sockets Platform information about number of sockets yes numeric gauge internal  
platform_dimm_count Number of RAM DIMM (all types memory modules) no (gather_hw_mm_topology) numeric gauge dmidecode binary output dimm_type
platform_dimm_total_size_bytes Total RAM size (all types memory modules) no (gather_hw_mm_topology) bytes gauge dmidecode binary output dimm_type
platform_mem_mode_size_bytes Size of RAM (Persistent memory) configured in memory mode. no (gather_hw_mm_topology) numeric gauge ipmctl binary output  
platform_dimm_speed_bytes_per_second Total platform DRAM speed no (gather_hw_mm_topology) bytes_per_second gauge dmidecode binary output  
platform_cpu_usage Logical CPU usage in 1/USER_HZ (usually 10ms).Calculated using values based on /proc/stat. yes numeric counter /proc filesystem cpu
platform_mem_usage_bytes Total memory used by platform in bytes based on /proc/meminfo and uses heuristic based on linux free tool (total - free - buffers - cache). yes bytes gauge /proc filesystem  
platform_mem_numa_free_bytes NUMA memory free per NUMA node based on /sys/devices/system/node/* (MemFree:) yes bytes gauge /sys filesystem numa_node
platform_mem_numa_used_bytes NUMA memory free per NUMA used based on /sys/devices/system/node/* (MemUsed:) yes bytes gauge /sys filesystem numa_node
platform_vmstat_numa_pages_migrated Virtual Memory stats based on /proc/vmstat for number of migrates pages (autonuma) yes numeric counter /proc filesystem  
platform_vmstat_pgmigrate_success Virtual Memory stats based on /proc/vmstat for number of migrates pages (succeed) yes numeric counter /proc filesystem  
platform_vmstat_pgmigrate_fail Virtual Memory stats based on /proc/vmstat for number of migrates pages (failed) yes numeric counter /proc filesystem  
platform_vmstat_numa_hint_faults Virtual Memory stats based on /proc/vmstat for pgfaults for migration hints yes numeric counter /proc filesystem  
platform_vmstat_numa_hint_faults_local Virtual Memory stats based on /proc/vmstat: pgfaults for migration hints (local) yes numeric counter /proc filesystem  
platform_vmstat_pgfaults Virtual Memory stats based on /proc/vmstat:number of page faults yes numeric counter /proc filesystem  
platform_vmstat Virtual Memory stats based on /proc/vmstat - all possible keys or matching regexp yes (vmstat) numeric counter /proc filesystem key
platform_node_vmstat Virtual Memory stats based on /sys/devices/system/node/nodeX/vmstat all keys or matching regexp yes (vmstat) numeric counter /proc filesystem numa_node, key
platform_pmm_bandwidth_reads Persistent memory module number of reads. no (uncore_event_names) numeric counter perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_pmm_bandwidth_writes Persistent memory module number of writes. no (uncore_event_names) numeric counter perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_cas_count_reads Column adress select number of reads no (uncore_event_names) numeric counter perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_cas_count_writes Column adress select number of writes no (uncore_event_names) numeric counter perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_upi_rxl_flits TBD no (uncore_event_names) numeric counter perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_upi_txl_flits TBD no (uncore_event_names) numeric counter perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_rpq_occupancy Pending queue occupancy no (uncore_event_names) numeric gauge perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_rpq_inserts Pending queue allocations no (uncore_event_names) numeric gauge perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_imc_clockticks IMC clockticks no (uncore_event_names) numeric counter perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_rpq_read_latency_seconds Read latency no (uncore_event_names: platform_imc_clockticks, platform_rpq_occupancy, platform_rpq_inserts and set enable_derived_metrics) seconds gauge derived from perf uncore socket
platform_pmm_reads_bytes_per_second TBD no (uncore_event_names: platform_pmm_bandwidth_reads and set enable_derived_metrics) numeric gauge derived from perf uncore socket, pmu_type
platform_pmm_writes_bytes_per_second TBD no (uncore_event_names: platform_pmm_bandwidth_writes and set enable_derived_metrics) numeric gauge derived from perf uncore socket, pmu_type
platform_pmm_total_bytes_per_second TBD no (uncore_event_names: platform_pmm_bandwidth_reads, platform_pmm_bandwidth_writes and set enable_derived_metrics) numeric gauge derived from perf uncore socket, pmu_type
platform_dram_reads_bytes_per_second TBD no (uncore_event_names: platform_cas_count_reads and set enable_derived_metrics) numeric gauge derived from perf uncore socket, pmu_type
platform_dram_writes_bytes_per_second TBD no (uncore_event_names: platform_cas_count_writes and set enable_derived_metrics) numeric gauge derived from perf uncore socket, pmu_type
platform_dram_total_bytes_per_second TBD no (uncore_event_names: platform_cas_count_reads, platform_cas_count_writes and set enable_derived_metrics) numeric gauge derived from perf uncore socket, pmu_type
platform_dram_hit_ratio TBD no (uncore_event_names: platform_cas_count_reads, platform_cas_count_writes and set enable_derived_metrics) numeric gauge derived from perf uncore socket, pmu_type
platform_upi_bandwidth_bytes_per_second TBD no (uncore_event_names: platform_upi_txl_flits, platform_upi_rxl_flits and set enable_derived_metrics) numeric counter derived from perf uncore socket, pmu_type
platform_scaling_uncore_factor Perf uncore subsystem metric scaling factor(value 1.0 is the best, meaning that there is no scaling at all for any uncore metric) auto, (depending on uncore_event_names) numeric gauge perf subsystem with dynamic PMUs (uncore) socket, pmu_type
platform_zoneinfo Dynamic metric with many keys based on fields from /proc/zoneinfo grouped by numa_node and zone (only Normal zone) yes (zoneinfo option) numeric gauge /proc filesystem numa_node, zone, key
platform_last_seen Timestamp the information about platform was last collected yes timestamp counter internal  
platform_capacity_per_nvdimm_bytes Platform capacity per NVDIMM yes bytes gauge internal  
platform_avg_power_per_nvdimm_watts Average power used by NVDIMM on the platform yes watts gauge internal  
platform_nvdimm_read_bandwidth_bytes_per_second Theoretical reads bandwidth per platform yes bytes_per_second gauge internal socket
platform_nvdimm_write_bandwidth_bytes_per_second Theoretical writes bandwidth per platform yes bytes_per_second gauge internal socket
Name Help Enabled Unit Type Source Levels/Labels
wca_up Health check for WCA returning timestamps of last iteration yes timestamp counter internal  
wca_information Special metric to cover some meta information like wca_version or cpu_model or platform topology (to be used instead of include_optional_labels) yes numeric gauge internal  
wca_tasks Number of discovered tasks yes numeric gauge internal  
wca_mem_usage_bytes Memory usage by WCA itself (getrusage for self and children). yes bytes gauge internal  
wca_duration_seconds Internal WCA function call duration metric for profiling yes numeric gauge internal  
wca_duration_seconds_avg Internal WCA function call duration metric for profiling (average from last restart) yes numeric gauge internal