Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input sqlserver - Query Stats #7842

Closed
wants to merge 11 commits into from

Conversation

Trovalo
Copy link
Collaborator

@Trovalo Trovalo commented Jul 16, 2020

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

closes #7789

This PR adds a new query which gets data from sys.dm_exec_query_stats.

It fetches incremental data about the top 50 queries executed on the SQL Server instance (as long as a query plan is in cache).
The information gathered is at query and exec plan level, useful to check which queries have the longest duration, which often corresponds with the heaviest or most executed queries.

The query has been tested on:

  • SQL on-prem from 2008 to 2019
  • Azure SQL DB

Notes:

  • The query is disabled in the proposed default configuration
  • It's pointless to run this every few seconds, makes sense to run it every few minutes

My personal opinion is that the information returned by this query ("QueryStats") can complete and/or enrich what is already gathered by "SqlRequests". The difference is that "QueryStats" provides data on a larger timespan, giving a higher-level overview of the situation.

Any feedback is appreciated

@@ -82,7 +83,7 @@ query_version = 2
# include_query = []

## A list of queries to explicitly ignore.
exclude_query = [ 'Schedulers' , 'SqlRequests']
exclude_query = [ 'Schedulers' , 'SqlRequests','QueryStats']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with adding this to the exclude_query array is that people upgrading probably won't get the change, and so it'll start running by default for them. This is a problem in general with the exclude_query list being updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you upgrade the config file isn't overwritten so if you didn't have it excluded you will get it AFAIK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm not too happy about upgrading and getting new queries suddenly. Can we exclude it by default and make you add it to the include_query list instead?

Copy link
Collaborator Author

@Trovalo Trovalo Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get it, are you proposing to populate both include_queries and exclude_queries of in the default configuration file?
I don't see any cons in that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm not sure I understand completely, but it seems like we get two different behaviors for new users to telegraf and upgrading users. New users get the query excluded and upgrading users get the query added. Almost seems like the opposite of what I would expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should be adding everything "new" under include list, that also means every new person that adopts won't get it and have to explicitly include it correct? Certain things like Query level details given more expensive can be excluded by default or relegated to include list only but there has to be a decision on those types of queries. I like a good default queries that are lighter wait, and then include list for the rest if needed which are deemed heavier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are of the same mind there. Do you want to make any changes or are we good to merge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now I get it.
Well the query itself can be somehow expensive, like the "sql_requests" one, that's why it has been excluded by default, if you want it, you will have to be explicit about it.
I don't see any solution to this kind of problem though (getting or not new queries on new setups or updates), I hope people read the release notes before upgrading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssoroka the problem in V2 collector ( what we are trying to address with #7934 ) is t hat all queries are run by default.
@Trovalo if ok we can close this what I propose is the following
a. We do not add this query at all for V2 (given it is heavy duty and is by default going to be run)
b. We do not add this for database_type by default given we don't want this to be run for default
c. A user actually uses include_list if wanted for this. I think Include_list as it stands today has a few issues ( I ran into when using that need to debug).

@Trovalo are you ok either revisiting this after #7934 is merged or doing as part of it as a specific commit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any issue with include_query, to me we can postpone this after #7934. (to keep things easier)
Solution "c" seems the best one to me, the point is how can we achieve that with given the 2 existing parameters, include and exclude.

For now, let's postpone it, we will see later what and how to do it.

@Trovalo
Copy link
Collaborator Author

Trovalo commented Jul 28, 2020

@denzilribeiro can you have a look at this?

Copy link
Contributor

@denzilribeiro denzilribeiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FOr Azure SQL DB can you also add total_page_server_reads ?
https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-query-stats-transact-sql?view=sql-server-ver15

Also generically query_hash and query_plan_hash are useful ( aka forms of the same query and plan)

@Trovalo
Copy link
Collaborator Author

Trovalo commented Jul 28, 2020

FOr Azure SQL DB can you also add total_page_server_reads ?
https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-query-stats-transact-sql?view=sql-server-ver15

Also generically query_hash and query_plan_hash are useful ( aka forms of the same query and plan)

I've just added the [total_page_server_reads] column for the Azure version of the query, I haven't tested it yet though. Will it run also on SQL Managed Instance?

About [query_hash] and [query_plan_hash], they are already present in the query (both versions)

@denzilribeiro
Copy link
Contributor

Yes total_page_server_reads will run on all versions of Azure SQL Database.

Copy link
Contributor

@denzilribeiro denzilribeiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving given small change assuming tested.

@ssoroka
Copy link
Contributor

ssoroka commented Jul 28, 2020

@Trovalo You good for me to merge this?

@Trovalo
Copy link
Collaborator Author

Trovalo commented Jul 28, 2020

Yes, for me it's all ok

@Trovalo
Copy link
Collaborator Author

Trovalo commented Aug 3, 2020

@ssoroka, @danielnelson can you merge this PR or is something more needed?

@ssoroka
Copy link
Contributor

ssoroka commented Aug 14, 2020

Postponing this until after #7934 is merged as per @Trovalo

the actual database name will be fetched by the query. (to check if the same can be applied to azure)
@Trovalo
Copy link
Collaborator Author

Trovalo commented Sep 3, 2020

The error is relative to a different plugin, the SQL Server section works properly.
@denzilribeiro when we merge this you will choose how to manage the "database_name" for the Azure DB or Azure Managed instance version of the query

fixed the truncation of the last char of the "statement_text" string
@ssoroka
Copy link
Contributor

ssoroka commented Oct 13, 2020

has conflicts with master

@Trovalo Trovalo marked this pull request as draft October 26, 2020 09:39
@Trovalo
Copy link
Collaborator Author

Trovalo commented Oct 26, 2020

Just to update you on the state of this PR.
I have a working version of the QueryStats query on a custom Telegraf, but I've found some drawbacks.

  • fetching just the top 50 queries is not that useful (in general)
  • The best use I've found so far is to restrict the collection to 1-2 databases to identify troublesome queries
    • I've used a horrible way to specify a database filter (and I don't know if there is a nice way since you are injecting values in the query)
  • visualizing the data is not that straightforward, and the query might be slow
  • on the PRO side, the data are extremely useful and allow you to see what's wrong almost immediately

I put some screenshot of my dashboard in the "Queries" collapsed section below, just to give you an idea of how useful the data can be (at least for a dba)

Doubts/proposal

What to fetch
To be honest, I'm not sure about what's the best way to go with this PR, but as of now I had the best result by having a decent amount of data for each database, therefore my proposal is to fetch the top X queries for each database.
The query will definitely be at least a bit more complex, but that shouldn't be an issue since it has to run with a "low" frequency (every 5-10m maybe)

Statement Text problem
Another possible "problem" is given by the column "Statement Text", which by default is a tag (and should be a field, as of now accomplished with processor plugin), it can weigh a lot and always has a constant value for query hash.
To me, excluding the "Statement Text" is bad, because you need to know which is the statement that's causing issues in order to solve the situation or point out what's causing issues... An horrible idea I have is to have a different query, that just fetches "SQL hash" and "Statement Text", filtered with the same logic of the query stats one, so you could run it with:

  • an even lower frequency (and risking losing some data)
  • at the same frequency and still reduce the numbers of data (since a "query hash" might have several "plan hash")
  • not run it at all (not having the data, but also saving space and IO)

Documentation and default settings
I'll classify this as an advanced query, and I'd like people to know what to do and not do with it.

  • It's simply pointless to run this every few seconds
    • We should add a section to specify how to run it with a different interval
  • should it be active by default? as of now it would (if you don't use query include/exclude)

Querying the data
About the weight/complexity of the query on InfluxDB, which might look like the ones below, I'm not even sure this is an issue as it depends on a lot of factors, and a CQ to precalculate the difference between points might help a lot.
There are also some screenshot of my current dsh, so you have an idea of what you can see

Queries

Data per Plan Hash (standard performance)
image

SELECT
	 non_negative_difference(last("execution_count")) AS "Execution Count"
	,non_negative_difference(last("total_worker_time_ms")) AS "Worker Time"
	,non_negative_difference(last("total_elapsed_time_ms")) AS "Total Time"
	,non_negative_difference(last("total_physical_reads")) AS "Physical Reads"
	,non_negative_difference(last("total_logical_reads")) AS "Logical Reads"
	,non_negative_difference(last("total_logical_writes")) AS "Logical Writes"
	,non_negative_difference(last("total_rows")) AS "Rows"
	,non_negative_difference(last("total_grant_kb")) AS "Memory"
	,non_negative_difference(last("total_used_grant_kb")) AS "Used Memory"
	,non_negative_difference(last("total_ideal_grant_kb")) AS "Ideal Memory"
FROM "$InfluxDB_RetentionPolicy"."sqlserver_query_stats"
WHERE 
	("sql_instance" =~ /^$Var_Sql_Instance$/
	AND "query_hash" =~ /^$Var_Query_Hash$/) 
	AND $timeFilter
GROUP BY
	 time($__interval)
	,"query_hash"
	,"query_plan_hash"
	,"stmt_object_name"
	,"stmt_db_name"

Data per Query Hash, on the whole time interval (way slower performance)
I use this as an overview table, to then filter the performance of a single Query Hash

image

SELECT
	 sum("execution_count") AS "Execution Count"
	,sum("total_worker_time_ms") AS "Worker Time"
	,sum("total_elapsed_time_ms") AS "Total Time"
	,sum("total_physical_reads") AS "Physical Reads"
	,sum("total_logical_reads") AS "Logical Reads"
	,sum("total_logical_writes") AS "Logical Writes"
	,sum("total_rows") AS "Rows"
	,sum("total_grant_kb") AS "Memory"
	,sum("total_used_grant_kb") AS "Used Memory"
	,sum("total_ideal_grant_kb") AS "Ideal Memory"
FROM (
	SELECT
		 non_negative_difference(LAST("execution_count")) AS "execution_count"
		,non_negative_difference(LAST("total_worker_time_ms")) AS "total_worker_time_ms"
		,non_negative_difference(LAST("total_elapsed_time_ms")) AS "total_elapsed_time_ms"
		,non_negative_difference(LAST("total_physical_reads")) AS "total_physical_reads"
		,non_negative_difference(LAST("total_logical_reads")) AS "total_logical_reads"
		,non_negative_difference(LAST("total_logical_writes")) AS "total_logical_writes"
		,non_negative_difference(LAST("total_rows")) AS "total_rows"
		,non_negative_difference(LAST("total_grant_kb")) AS "total_grant_kb"
		,non_negative_difference(LAST("total_used_grant_kb")) AS "total_used_grant_kb"
		,non_negative_difference(LAST("total_ideal_grant_kb")) AS "total_ideal_grant_kb"
	FROM "$InfluxDB_RetentionPolicy"."sqlserver_query_stats"
	WHERE 
		("sql_instance" =~ /^$Var_Sql_Instance$/) 
		AND $timeFilter
	GROUP BY
		 time(5m)
		,"query_hash"
		,"query_plan_hash"
) GROUP BY
	"query_hash"

Any feedback/opinion is welcomed

@denzilribeiro
Copy link
Contributor

@Trovalo give me a few days have a though around it there maybe a PR coming for Query store ( which is similiar in nature) and will take deltas of what is different and similar approach could potentially be used here.

@sjwang90 sjwang90 added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Nov 23, 2020
@Trovalo
Copy link
Collaborator Author

Trovalo commented Nov 24, 2020

@denzilribeiro, This PR has no activity but I'm still playing around with this query and it already had several iterations in my custom telegraf build.

As of now, it looks like this: (it still uses the "old" style)

const qmonitorQueryStats string = `
SET DEADLOCK_PRIORITY -10;
DECLARE
     @SqlStatement AS nvarchar(max)
    ,@EngineEdition AS tinyint = CAST(SERVERPROPERTY('EngineEdition') AS int)
    ,@MajorMinorVersion AS int = CAST(PARSENAME(CAST(SERVERPROPERTY('ProductVersion') as nvarchar),4) AS int)*100 + CAST(PARSENAME(CAST(SERVERPROPERTY('ProductVersion') as nvarchar),3) AS int)
    ,@Columns AS nvarchar(MAX) = ''
IF @MajorMinorVersion >= 1050 OR @EngineEdition IN (5,8) BEGIN
    SET @Columns += N',SUM(qs.[total_rows]) AS [total_rows]'
END
IF (
    @MajorMinorVersion >= 1100 AND EXISTS (SELECT * from sys.all_columns WHERE object_id = OBJECT_ID('sys.dm_exec_query_stats') AND [name] = 'total_dop')
) OR @EngineEdition IN (5,8)
BEGIN
    SET @Columns += N'
    ,SUM(qs.[total_dop]) AS [total_dop]
    ,SUM(qs.[total_grant_kb]) AS [total_grant_kb]
    ,SUM(qs.[total_used_grant_kb]) AS [total_used_grant_kb]
    ,SUM(qs.[total_ideal_grant_kb]) AS [total_ideal_grant_kb]
    ,SUM(qs.[total_reserved_threads]) AS [total_reserved_threads]
    ,SUM(qs.[total_used_threads]) AS [total_used_threads]'
END
IF (
    @MajorMinorVersion = 1300 AND EXISTS (SELECT * from sys.all_columns WHERE object_id = OBJECT_ID('sys.dm_exec_query_stats') AND [name] = 'total_columnstore_segment_reads')
) OR @EngineEdition IN (5,8)
BEGIN
    SET @Columns += N'
    ,SUM(qs.[total_columnstore_segment_reads]) AS [total_columnstore_segment_reads]
    ,SUM(qs.[total_columnstore_segment_skips]) AS [total_columnstore_segment_skips]'
END
IF @MajorMinorVersion >= 1500 OR @EngineEdition IN (5,8) 
BEGIN
    SET @Columns += N'
    ,SUM(qs.[total_spills]) AS [total_spills]'
END
SET @SqlStatement = N'
SELECT TOP(100)
     ''sqlserver_query_stats'' AS [measurement]
    ,REPLACE(@@SERVERNAME,''\'','':'') AS [sql_instance]
    ,pa.[database_name]
    ,CONVERT(varchar(20),qs.[query_hash],1) as [query_hash]
    ,CONVERT(varchar(20),qs.[query_plan_hash],1) as [query_plan_hash]
    ,QUOTENAME(OBJECT_SCHEMA_NAME(qt.objectid,qt.dbid)) + ''.'' +  QUOTENAME(OBJECT_NAME(qt.objectid,qt.dbid)) as stmt_object_name
    ,MIN(SUBSTRING(
        qt.[text],
        qs.[statement_start_offset] / 2 + 1,
        (CASE WHEN qs.[statement_end_offset] = -1
            THEN DATALENGTH(qt.[text])
            ELSE qs.[statement_end_offset]
        END - qs.[statement_start_offset]) / 2 + 1
    )) AS statement_text
    ,DB_NAME(qt.[dbid]) stmt_db_name
    ,COUNT(DISTINCT qs.[plan_handle]) AS [plan_count]
    ,SUM(qs.[execution_count]) AS [execution_count]
    ,SUM(qs.[total_physical_reads]) AS [total_physical_reads]
    ,SUM(qs.[total_logical_writes]) AS [total_logical_writes]
    ,SUM(qs.[total_logical_reads]) AS [total_logical_reads]
    ,SUM(qs.[total_clr_time]/1000) AS [total_clr_time_ms]
    ,SUM(qs.[total_worker_time]/1000) AS [total_worker_time_ms]
    ,SUM(qs.[total_elapsed_time]/1000) AS [total_elapsed_time_ms]
    ' + @Columns + N'
FROM sys.dm_exec_query_stats as qs
OUTER APPLY sys.dm_exec_sql_text(qs.[sql_handle]) AS qt
CROSS APPLY (
    SELECT DB_NAME(CONVERT(int, value)) AS [database_name] 
    FROM sys.dm_exec_plan_attributes(qs.plan_handle)
    WHERE attribute = N''dbid''
) AS pa
--WHERE 
--   1 = 1
--	<DatabaseFilter>
GROUP BY 
     pa.database_name
    ,qs.query_hash
    ,qs.query_plan_hash
    ,qt.objectid
    ,qt.dbid
ORDER BY
	[total_worker_time_ms] DESC
'
EXEC sp_executesql @SqlStatement

Here are some points about it
Performance
Query on those data are heavy (even if the data are gathered every 5m), a table like this one can take a few seconds to load

image

and this is the query underneath

SELECT
	 sum("execution_count") AS "Execution Count"
	,sum("total_worker_time_ms") AS "Worker Time"
	,sum("total_elapsed_time_ms") AS "Total Time"
	,sum("total_physical_reads") AS "Physical Reads"
	,sum("total_logical_reads") AS "Logical Reads"
	,sum("total_logical_writes") AS "Logical Writes"
	,sum("total_rows") AS "Rows"
	,sum("total_grant_kb") AS "Memory"
    ,last("statement_text_preview") AS "Statement"
FROM (
	SELECT
		 non_negative_difference(LAST("execution_count")) AS "execution_count"
		,non_negative_difference(LAST("total_worker_time_ms")) AS "total_worker_time_ms"
		,non_negative_difference(LAST("total_elapsed_time_ms")) AS "total_elapsed_time_ms"
		,non_negative_difference(LAST("total_physical_reads")) AS "total_physical_reads"
		,non_negative_difference(LAST("total_logical_reads")) AS "total_logical_reads"
		,non_negative_difference(LAST("total_logical_writes")) AS "total_logical_writes"
		,non_negative_difference(LAST("total_rows")) AS "total_rows"
		,non_negative_difference(LAST("total_grant_kb")) AS "total_grant_kb"
        ,last("statement_text_preview") AS "statement_text_preview"
	FROM "$InfluxDB_RetentionPolicy"."sqlserver_query_stats"
	WHERE 
		("sql_instance" =~ /^$Var_Sql_Instance$/) 
		AND $timeFilter
	GROUP BY
		 time(5m)
		,"query_hash"
		,"query_plan_hash"
        ,"database_name"
) GROUP BY
	 "query_hash"
    ,"database_name"

Quering
Not that easy to query, in fact the above query is the minimum since you must compute differences at "query_plan_hash" level and then lower the aggregation level (if needed)
In order to simplify this I've set up a continuous query to pre-calculate the differences as the data enter the system and it does wonders for performance and queriability

Result
The result itself is amazing as you can troubleshoot queries and analyze workloads, it's even more useful from an analysis point of view when you specify which database to monitor (I've added a sort of "db_include" and "db_exclude" in the config... even if as of now it's implemented in a horrible way)

here are some visual I've built with those data
image

@sjwang90 sjwang90 added this to the Planned milestone Dec 9, 2020
@denzilribeiro
Copy link
Contributor

@Trovalo here is the
a. It isn't the cheapest query to run :) so wouldn't enable it by default imo and if enabled should be run at a lower frequency than 10 seconds for sure as discussed before ( 15 mins min perhaps :) )
b. Is susceptible to any plan cache clearing, etc so isn't necessarily representative of all queries,
b. From 2016 onwards Query store already has all this data, way easier . I think once Query store PR is there I could see a "conditional" perhaps? Aka if version with query store just use that, if not then use this?

For query store there is a PR in flight.. #8465

Other comment why order by worker time rather than elapsed duration?

ORDER BY
[total_worker_time_ms] DESC

@sjwang90 sjwang90 removed this from the Planned milestone Jan 29, 2021
@sjwang90
Copy link
Contributor

sjwang90 commented Apr 5, 2021

Is this PR still active? Or are we considering #8465 as the latest?

@Trovalo
Copy link
Collaborator Author

Trovalo commented Apr 5, 2021

Is this PR still active? Or are we considering #8465 as the latest?

This is a different one as the PR you mentioned is only about SQL on Azure.
With this one, my aim was to provide Query Stats data for any version of SQL Server.

I actually have this one up and running on my own, but it's kind of dangerous to put it live as it is, in fact, it would require a minimum time gap between executions, and as of now I have no idea about how to provide this kind of enforcement.
I could actually "steal" something from the linked PR since they are trying to do exactly that.

If it's spring cleaning time I can just delete this PR, and create a new one once I have some significant breakthrough about how to make it safe.

@Trovalo Trovalo closed this Apr 5, 2021
@Trovalo Trovalo deleted the sqlserver--querystats branch December 1, 2022 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sqlserver feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Input SQL Server - Add "Query Stats"
4 participants