Input sqlserver - Query Stats #7842

Trovalo · 2020-07-16T15:15:02Z

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

This PR adds a new query which gets data from sys.dm_exec_query_stats.

It fetches incremental data about the top 50 queries executed on the SQL Server instance (as long as a query plan is in cache).
The information gathered is at query and exec plan level, useful to check which queries have the longest duration, which often corresponds with the heaviest or most executed queries.

The query has been tested on:

SQL on-prem from 2008 to 2019
Azure SQL DB

Notes:

The query is disabled in the proposed default configuration
It's pointless to run this every few seconds, makes sense to run it every few minutes

My personal opinion is that the information returned by this query ("QueryStats") can complete and/or enrich what is already gathered by "SqlRequests". The difference is that "QueryStats" provides data on a larger timespan, giving a higher-level overview of the situation.

Any feedback is appreciated

…sql-server---Lock-Timeouts-with-timeout-greater-than-0

ssoroka · 2020-07-16T15:36:19Z

plugins/inputs/sqlserver/sqlserver.go

@@ -82,7 +83,7 @@ query_version = 2
 # include_query = []

 ## A list of queries to explicitly ignore.
-exclude_query = [ 'Schedulers' , 'SqlRequests']
+exclude_query = [ 'Schedulers' , 'SqlRequests','QueryStats']


The issue with adding this to the exclude_query array is that people upgrading probably won't get the change, and so it'll start running by default for them. This is a problem in general with the exclude_query list being updated.

If you upgrade the config file isn't overwritten so if you didn't have it excluded you will get it AFAIK

yeah, I'm not too happy about upgrading and getting new queries suddenly. Can we exclude it by default and make you add it to the include_query list instead?

not sure I get it, are you proposing to populate both include_queries and exclude_queries of in the default configuration file?
I don't see any cons in that

Well, I'm not sure I understand completely, but it seems like we get two different behaviors for new users to telegraf and upgrading users. New users get the query excluded and upgrading users get the query added. Almost seems like the opposite of what I would expect.

I am not sure we should be adding everything "new" under include list, that also means every new person that adopts won't get it and have to explicitly include it correct? Certain things like Query level details given more expensive can be excluded by default or relegated to include list only but there has to be a decision on those types of queries. I like a good default queries that are lighter wait, and then include list for the rest if needed which are deemed heavier

we are of the same mind there. Do you want to make any changes or are we good to merge?

ok, now I get it.
Well the query itself can be somehow expensive, like the "sql_requests" one, that's why it has been excluded by default, if you want it, you will have to be explicit about it.
I don't see any solution to this kind of problem though (getting or not new queries on new setups or updates), I hope people read the release notes before upgrading.

@ssoroka the problem in V2 collector ( what we are trying to address with #7934 ) is t hat all queries are run by default.
@Trovalo if ok we can close this what I propose is the following
a. We do not add this query at all for V2 (given it is heavy duty and is by default going to be run)
b. We do not add this for database_type by default given we don't want this to be run for default
c. A user actually uses include_list if wanted for this. I think Include_list as it stands today has a few issues ( I ran into when using that need to debug).

@Trovalo are you ok either revisiting this after #7934 is merged or doing as part of it as a specific commit?

I'm not aware of any issue with include_query, to me we can postpone this after #7934. (to keep things easier)
Solution "c" seems the best one to me, the point is how can we achieve that with given the 2 existing parameters, include and exclude.

For now, let's postpone it, we will see later what and how to do it.

Trovalo · 2020-07-28T14:10:02Z

@denzilribeiro can you have a look at this?

denzilribeiro

FOr Azure SQL DB can you also add total_page_server_reads ?
https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-query-stats-transact-sql?view=sql-server-ver15

Also generically query_hash and query_plan_hash are useful ( aka forms of the same query and plan)

Trovalo · 2020-07-28T14:34:51Z

FOr Azure SQL DB can you also add total_page_server_reads ?
https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-query-stats-transact-sql?view=sql-server-ver15

Also generically query_hash and query_plan_hash are useful ( aka forms of the same query and plan)

I've just added the [total_page_server_reads] column for the Azure version of the query, I haven't tested it yet though. Will it run also on SQL Managed Instance?

About [query_hash] and [query_plan_hash], they are already present in the query (both versions)

denzilribeiro · 2020-07-28T15:02:28Z

Yes total_page_server_reads will run on all versions of Azure SQL Database.

denzilribeiro

Approving given small change assuming tested.

ssoroka · 2020-07-28T15:25:33Z

@Trovalo You good for me to merge this?

Trovalo · 2020-07-28T15:47:37Z

Yes, for me it's all ok

Trovalo · 2020-08-03T08:01:37Z

@ssoroka, @danielnelson can you merge this PR or is something more needed?

ssoroka · 2020-08-14T18:01:59Z

Postponing this until after #7934 is merged as per @Trovalo

the actual database name will be fetched by the query. (to check if the same can be applied to azure)

Trovalo · 2020-09-03T08:49:27Z

The error is relative to a different plugin, the SQL Server section works properly.
@denzilribeiro when we merge this you will choose how to manage the "database_name" for the Azure DB or Azure Managed instance version of the query

fixed the truncation of the last char of the "statement_text" string

ssoroka · 2020-10-13T21:58:32Z

has conflicts with master

Trovalo · 2020-10-26T11:15:03Z

Just to update you on the state of this PR.
I have a working version of the QueryStats query on a custom Telegraf, but I've found some drawbacks.

fetching just the top 50 queries is not that useful (in general)
The best use I've found so far is to restrict the collection to 1-2 databases to identify troublesome queries
- I've used a horrible way to specify a database filter (and I don't know if there is a nice way since you are injecting values in the query)
visualizing the data is not that straightforward, and the query might be slow
on the PRO side, the data are extremely useful and allow you to see what's wrong almost immediately

I put some screenshot of my dashboard in the "Queries" collapsed section below, just to give you an idea of how useful the data can be (at least for a dba)

Doubts/proposal

What to fetch
To be honest, I'm not sure about what's the best way to go with this PR, but as of now I had the best result by having a decent amount of data for each database, therefore my proposal is to fetch the top X queries for each database.
The query will definitely be at least a bit more complex, but that shouldn't be an issue since it has to run with a "low" frequency (every 5-10m maybe)

Statement Text problem
Another possible "problem" is given by the column "Statement Text", which by default is a tag (and should be a field, as of now accomplished with processor plugin), it can weigh a lot and always has a constant value for query hash.
To me, excluding the "Statement Text" is bad, because you need to know which is the statement that's causing issues in order to solve the situation or point out what's causing issues... An horrible idea I have is to have a different query, that just fetches "SQL hash" and "Statement Text", filtered with the same logic of the query stats one, so you could run it with:

an even lower frequency (and risking losing some data)
at the same frequency and still reduce the numbers of data (since a "query hash" might have several "plan hash")
not run it at all (not having the data, but also saving space and IO)

Documentation and default settings
I'll classify this as an advanced query, and I'd like people to know what to do and not do with it.

It's simply pointless to run this every few seconds
- We should add a section to specify how to run it with a different interval
should it be active by default? as of now it would (if you don't use query include/exclude)

Querying the data
About the weight/complexity of the query on InfluxDB, which might look like the ones below, I'm not even sure this is an issue as it depends on a lot of factors, and a CQ to precalculate the difference between points might help a lot.
There are also some screenshot of my current dsh, so you have an idea of what you can see

Queries

Data per Plan Hash (standard performance)

SELECT
	 non_negative_difference(last("execution_count")) AS "Execution Count"
	,non_negative_difference(last("total_worker_time_ms")) AS "Worker Time"
	,non_negative_difference(last("total_elapsed_time_ms")) AS "Total Time"
	,non_negative_difference(last("total_physical_reads")) AS "Physical Reads"
	,non_negative_difference(last("total_logical_reads")) AS "Logical Reads"
	,non_negative_difference(last("total_logical_writes")) AS "Logical Writes"
	,non_negative_difference(last("total_rows")) AS "Rows"
	,non_negative_difference(last("total_grant_kb")) AS "Memory"
	,non_negative_difference(last("total_used_grant_kb")) AS "Used Memory"
	,non_negative_difference(last("total_ideal_grant_kb")) AS "Ideal Memory"
FROM "$InfluxDB_RetentionPolicy"."sqlserver_query_stats"
WHERE 
	("sql_instance" =~ /^$Var_Sql_Instance$/
	AND "query_hash" =~ /^$Var_Query_Hash$/) 
	AND $timeFilter
GROUP BY
	 time($__interval)
	,"query_hash"
	,"query_plan_hash"
	,"stmt_object_name"
	,"stmt_db_name"

Data per Query Hash, on the whole time interval (way slower performance)
I use this as an overview table, to then filter the performance of a single Query Hash

SELECT
	 sum("execution_count") AS "Execution Count"
	,sum("total_worker_time_ms") AS "Worker Time"
	,sum("total_elapsed_time_ms") AS "Total Time"
	,sum("total_physical_reads") AS "Physical Reads"
	,sum("total_logical_reads") AS "Logical Reads"
	,sum("total_logical_writes") AS "Logical Writes"
	,sum("total_rows") AS "Rows"
	,sum("total_grant_kb") AS "Memory"
	,sum("total_used_grant_kb") AS "Used Memory"
	,sum("total_ideal_grant_kb") AS "Ideal Memory"
FROM (
	SELECT
		 non_negative_difference(LAST("execution_count")) AS "execution_count"
		,non_negative_difference(LAST("total_worker_time_ms")) AS "total_worker_time_ms"
		,non_negative_difference(LAST("total_elapsed_time_ms")) AS "total_elapsed_time_ms"
		,non_negative_difference(LAST("total_physical_reads")) AS "total_physical_reads"
		,non_negative_difference(LAST("total_logical_reads")) AS "total_logical_reads"
		,non_negative_difference(LAST("total_logical_writes")) AS "total_logical_writes"
		,non_negative_difference(LAST("total_rows")) AS "total_rows"
		,non_negative_difference(LAST("total_grant_kb")) AS "total_grant_kb"
		,non_negative_difference(LAST("total_used_grant_kb")) AS "total_used_grant_kb"
		,non_negative_difference(LAST("total_ideal_grant_kb")) AS "total_ideal_grant_kb"
	FROM "$InfluxDB_RetentionPolicy"."sqlserver_query_stats"
	WHERE 
		("sql_instance" =~ /^$Var_Sql_Instance$/) 
		AND $timeFilter
	GROUP BY
		 time(5m)
		,"query_hash"
		,"query_plan_hash"
) GROUP BY
	"query_hash"

Any feedback/opinion is welcomed

denzilribeiro · 2020-11-23T20:34:03Z

@Trovalo give me a few days have a though around it there maybe a PR coming for Query store ( which is similiar in nature) and will take deltas of what is different and similar approach could potentially be used here.

Trovalo · 2020-11-24T09:00:33Z

@denzilribeiro, This PR has no activity but I'm still playing around with this query and it already had several iterations in my custom telegraf build.

As of now, it looks like this: (it still uses the "old" style)

const qmonitorQueryStats string = `
SET DEADLOCK_PRIORITY -10;
DECLARE
     @SqlStatement AS nvarchar(max)
    ,@EngineEdition AS tinyint = CAST(SERVERPROPERTY('EngineEdition') AS int)
    ,@MajorMinorVersion AS int = CAST(PARSENAME(CAST(SERVERPROPERTY('ProductVersion') as nvarchar),4) AS int)*100 + CAST(PARSENAME(CAST(SERVERPROPERTY('ProductVersion') as nvarchar),3) AS int)
    ,@Columns AS nvarchar(MAX) = ''
IF @MajorMinorVersion >= 1050 OR @EngineEdition IN (5,8) BEGIN
    SET @Columns += N',SUM(qs.[total_rows]) AS [total_rows]'
END
IF (
    @MajorMinorVersion >= 1100 AND EXISTS (SELECT * from sys.all_columns WHERE object_id = OBJECT_ID('sys.dm_exec_query_stats') AND [name] = 'total_dop')
) OR @EngineEdition IN (5,8)
BEGIN
    SET @Columns += N'
    ,SUM(qs.[total_dop]) AS [total_dop]
    ,SUM(qs.[total_grant_kb]) AS [total_grant_kb]
    ,SUM(qs.[total_used_grant_kb]) AS [total_used_grant_kb]
    ,SUM(qs.[total_ideal_grant_kb]) AS [total_ideal_grant_kb]
    ,SUM(qs.[total_reserved_threads]) AS [total_reserved_threads]
    ,SUM(qs.[total_used_threads]) AS [total_used_threads]'
END
IF (
    @MajorMinorVersion = 1300 AND EXISTS (SELECT * from sys.all_columns WHERE object_id = OBJECT_ID('sys.dm_exec_query_stats') AND [name] = 'total_columnstore_segment_reads')
) OR @EngineEdition IN (5,8)
BEGIN
    SET @Columns += N'
    ,SUM(qs.[total_columnstore_segment_reads]) AS [total_columnstore_segment_reads]
    ,SUM(qs.[total_columnstore_segment_skips]) AS [total_columnstore_segment_skips]'
END
IF @MajorMinorVersion >= 1500 OR @EngineEdition IN (5,8) 
BEGIN
    SET @Columns += N'
    ,SUM(qs.[total_spills]) AS [total_spills]'
END
SET @SqlStatement = N'
SELECT TOP(100)
     ''sqlserver_query_stats'' AS [measurement]
    ,REPLACE(@@SERVERNAME,''\'','':'') AS [sql_instance]
    ,pa.[database_name]
    ,CONVERT(varchar(20),qs.[query_hash],1) as [query_hash]
    ,CONVERT(varchar(20),qs.[query_plan_hash],1) as [query_plan_hash]
    ,QUOTENAME(OBJECT_SCHEMA_NAME(qt.objectid,qt.dbid)) + ''.'' +  QUOTENAME(OBJECT_NAME(qt.objectid,qt.dbid)) as stmt_object_name
    ,MIN(SUBSTRING(
        qt.[text],
        qs.[statement_start_offset] / 2 + 1,
        (CASE WHEN qs.[statement_end_offset] = -1
            THEN DATALENGTH(qt.[text])
            ELSE qs.[statement_end_offset]
        END - qs.[statement_start_offset]) / 2 + 1
    )) AS statement_text
    ,DB_NAME(qt.[dbid]) stmt_db_name
    ,COUNT(DISTINCT qs.[plan_handle]) AS [plan_count]
    ,SUM(qs.[execution_count]) AS [execution_count]
    ,SUM(qs.[total_physical_reads]) AS [total_physical_reads]
    ,SUM(qs.[total_logical_writes]) AS [total_logical_writes]
    ,SUM(qs.[total_logical_reads]) AS [total_logical_reads]
    ,SUM(qs.[total_clr_time]/1000) AS [total_clr_time_ms]
    ,SUM(qs.[total_worker_time]/1000) AS [total_worker_time_ms]
    ,SUM(qs.[total_elapsed_time]/1000) AS [total_elapsed_time_ms]
    ' + @Columns + N'
FROM sys.dm_exec_query_stats as qs
OUTER APPLY sys.dm_exec_sql_text(qs.[sql_handle]) AS qt
CROSS APPLY (
    SELECT DB_NAME(CONVERT(int, value)) AS [database_name] 
    FROM sys.dm_exec_plan_attributes(qs.plan_handle)
    WHERE attribute = N''dbid''
) AS pa
--WHERE 
--   1 = 1
--	<DatabaseFilter>
GROUP BY 
     pa.database_name
    ,qs.query_hash
    ,qs.query_plan_hash
    ,qt.objectid
    ,qt.dbid
ORDER BY
	[total_worker_time_ms] DESC
'
EXEC sp_executesql @SqlStatement

Here are some points about it
Performance
Query on those data are heavy (even if the data are gathered every 5m), a table like this one can take a few seconds to load

and this is the query underneath

SELECT
	 sum("execution_count") AS "Execution Count"
	,sum("total_worker_time_ms") AS "Worker Time"
	,sum("total_elapsed_time_ms") AS "Total Time"
	,sum("total_physical_reads") AS "Physical Reads"
	,sum("total_logical_reads") AS "Logical Reads"
	,sum("total_logical_writes") AS "Logical Writes"
	,sum("total_rows") AS "Rows"
	,sum("total_grant_kb") AS "Memory"
    ,last("statement_text_preview") AS "Statement"
FROM (
	SELECT
		 non_negative_difference(LAST("execution_count")) AS "execution_count"
		,non_negative_difference(LAST("total_worker_time_ms")) AS "total_worker_time_ms"
		,non_negative_difference(LAST("total_elapsed_time_ms")) AS "total_elapsed_time_ms"
		,non_negative_difference(LAST("total_physical_reads")) AS "total_physical_reads"
		,non_negative_difference(LAST("total_logical_reads")) AS "total_logical_reads"
		,non_negative_difference(LAST("total_logical_writes")) AS "total_logical_writes"
		,non_negative_difference(LAST("total_rows")) AS "total_rows"
		,non_negative_difference(LAST("total_grant_kb")) AS "total_grant_kb"
        ,last("statement_text_preview") AS "statement_text_preview"
	FROM "$InfluxDB_RetentionPolicy"."sqlserver_query_stats"
	WHERE 
		("sql_instance" =~ /^$Var_Sql_Instance$/) 
		AND $timeFilter
	GROUP BY
		 time(5m)
		,"query_hash"
		,"query_plan_hash"
        ,"database_name"
) GROUP BY
	 "query_hash"
    ,"database_name"

Quering
Not that easy to query, in fact the above query is the minimum since you must compute differences at "query_plan_hash" level and then lower the aggregation level (if needed)
In order to simplify this I've set up a continuous query to pre-calculate the differences as the data enter the system and it does wonders for performance and queriability

Result
The result itself is amazing as you can troubleshoot queries and analyze workloads, it's even more useful from an analysis point of view when you specify which database to monitor (I've added a sort of "db_include" and "db_exclude" in the config... even if as of now it's implemented in a horrible way)

here are some visual I've built with those data

denzilribeiro · 2021-01-25T18:00:05Z

@Trovalo here is the
a. It isn't the cheapest query to run :) so wouldn't enable it by default imo and if enabled should be run at a lower frequency than 10 seconds for sure as discussed before ( 15 mins min perhaps :) )
b. Is susceptible to any plan cache clearing, etc so isn't necessarily representative of all queries,
b. From 2016 onwards Query store already has all this data, way easier . I think once Query store PR is there I could see a "conditional" perhaps? Aka if version with query store just use that, if not then use this?

For query store there is a PR in flight.. #8465

Other comment why order by worker time rather than elapsed duration?

ORDER BY
[total_worker_time_ms] DESC

sjwang90 · 2021-04-05T16:11:07Z

Is this PR still active? Or are we considering #8465 as the latest?

Trovalo · 2021-04-05T18:39:56Z

Is this PR still active? Or are we considering #8465 as the latest?

This is a different one as the PR you mentioned is only about SQL on Azure.
With this one, my aim was to provide Query Stats data for any version of SQL Server.

I actually have this one up and running on my own, but it's kind of dangerous to put it live as it is, in fact, it would require a minimum time gap between executions, and as of now I have no idea about how to provide this kind of enforcement.
I could actually "steal" something from the linked PR since they are trying to do exactly that.

If it's spring cleaning time I can just delete this PR, and create a new one once I have some significant breakthrough about how to make it safe.

Trovalo added 7 commits July 9, 2020 10:01

added new counter - Lock Timeouts (timeout > 0)/sec

c539966

Merge branch 'master' of https://github.com/influxdata/telegraf into …

b30e7b5

…sql-server---Lock-Timeouts-with-timeout-greater-than-0

added new query "QueryStats"

0b9ee65

edite query count test with new query

a16ac3b

fix - binary hash to string, compatibility for sql 2008

74af545

synthax and indentation

8725457

updated sampleconfig and readme

845e2f7

ssoroka reviewed Jul 16, 2020

View reviewed changes

removed c539966 - wrong branch

56aa6d6

denzilribeiro suggested changes Jul 28, 2020

View reviewed changes

added [total_page_server_reads] for Azure

8cb7692

denzilribeiro approved these changes Jul 28, 2020

View reviewed changes

sjwang90 added the area/sqlserver label Jul 30, 2020

denzilribeiro mentioned this pull request Aug 8, 2020

Database_type config to Split up sql queries by engine type #7934

Merged

Fetched actual db_name

d7e2cc6

the actual database name will be fetched by the query. (to check if the same can be applied to azure)

fix truncation of statement_text

ceb186d

fixed the truncation of the last char of the "statement_text" string

Trovalo marked this pull request as draft October 26, 2020 09:39

sjwang90 added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Nov 23, 2020

sjwang90 added this to the Planned milestone Dec 9, 2020

sjwang90 removed this from the Planned milestone Jan 29, 2021

Trovalo closed this Apr 5, 2021

Trovalo deleted the sqlserver--querystats branch December 1, 2022 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input sqlserver - Query Stats #7842

Input sqlserver - Query Stats #7842

Trovalo commented Jul 16, 2020

ssoroka Jul 16, 2020

denzilribeiro Jul 28, 2020

ssoroka Aug 10, 2020

Trovalo Aug 10, 2020 •

edited

Loading

ssoroka Aug 10, 2020

denzilribeiro Aug 10, 2020

ssoroka Aug 10, 2020

Trovalo Aug 10, 2020

denzilribeiro Aug 10, 2020

Trovalo Aug 10, 2020

Trovalo commented Jul 28, 2020

denzilribeiro left a comment

Trovalo commented Jul 28, 2020

denzilribeiro commented Jul 28, 2020

denzilribeiro left a comment

ssoroka commented Jul 28, 2020

Trovalo commented Jul 28, 2020

Trovalo commented Aug 3, 2020

ssoroka commented Aug 14, 2020

Trovalo commented Sep 3, 2020

ssoroka commented Oct 13, 2020

Trovalo commented Oct 26, 2020 •

edited

Loading

denzilribeiro commented Nov 23, 2020

Trovalo commented Nov 24, 2020

denzilribeiro commented Jan 25, 2021

sjwang90 commented Apr 5, 2021

Trovalo commented Apr 5, 2021

Input sqlserver - Query Stats #7842

Input sqlserver - Query Stats #7842

Conversation

Trovalo commented Jul 16, 2020

Required for all PRs:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trovalo Aug 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trovalo commented Jul 28, 2020

denzilribeiro left a comment

Choose a reason for hiding this comment

Trovalo commented Jul 28, 2020

denzilribeiro commented Jul 28, 2020

denzilribeiro left a comment

Choose a reason for hiding this comment

ssoroka commented Jul 28, 2020

Trovalo commented Jul 28, 2020

Trovalo commented Aug 3, 2020

ssoroka commented Aug 14, 2020

Trovalo commented Sep 3, 2020

ssoroka commented Oct 13, 2020

Trovalo commented Oct 26, 2020 • edited Loading

Doubts/proposal

denzilribeiro commented Nov 23, 2020

Trovalo commented Nov 24, 2020

denzilribeiro commented Jan 25, 2021

sjwang90 commented Apr 5, 2021

Trovalo commented Apr 5, 2021

Trovalo Aug 10, 2020 •

edited

Loading

Trovalo commented Oct 26, 2020 •

edited

Loading