Async stats update causing blocking

Page content

I recently encountered an issue where an index rebuild set to wait_at_low_priority ended up blocking an asynchronous statistics update. This interaction led to a large blocking chain where queries were waiting on the async stats update and started to timeout.

How did it happen? πŸ”—︎

We’ll need a basic understanding of a few concepts as well as database settings

  1. ‘RCSI’ and ‘Async stats update’ database configuration
  2. Auto stats update and how it’s triggered
  3. Lock partitioning
  4. Locking and blocking

Database configuration πŸ”—︎

  • RCSI (Read Committed Snapshot Isolation) prevents reading queries from blocking writers - this helps with the concurrency
  • Stats update:
    • Sync stats - the query will wait for the out-of-date statistics to update before generating a plan. This might be a problem if you have a short query with a high call frequency where waiting for the stats update would take much longer than the query itself
    • Async stats - when stats recomputation is needed it will start a background job that will update the stats later. The query is still using the out-of-date stats but at least it can continue

Let’s create a test database with RCSI and Async stats update:

CREATE DATABASE StatsUpdateAsync
ALTER DATABASE StatsUpdateAsync SET READ_COMMITTED_SNAPSHOT ON WITH ROLLBACK IMMEDIATE
ALTER DATABASE StatsUpdateAsync SET AUTO_UPDATE_STATISTICS_ASYNC ON WITH ROLLBACK IMMEDIATE

Auto stats update πŸ”—︎

In one of my previous articles KEEP PLAN Demystified I’ve demonstrated how to monitor the stats update, what are the different thresholds etc., so if you’re not familiar with these concepts, I suggest you read it first.

For our demo, we’ll create a table with 500 rows which means that the threshold for auto stats update is also 500.

CREATE TABLE dbo.PermaSmall
(
    n int
    , CONSTRAINT PK_PermaSmall PRIMARY KEY (n)
)

;WITH
    L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
    L1   AS(SELECT 1 AS c FROM L0 CROSS JOIN L0 AS B),
    L2   AS(SELECT 1 AS c FROM L1 CROSS JOIN L1 AS B),
    L3   AS(SELECT 1 AS c FROM L2 CROSS JOIN L2 AS B),
    L4   AS(SELECT 1 AS c FROM L3 CROSS JOIN L3 AS B),
    L5   AS(SELECT 1 AS c FROM L4 CROSS JOIN L4 AS B),
    Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
    , tally AS (SELECT TOP (500) n FROM Nums ORDER BY n)
INSERT INTO dbo.PermaSmall WITH(TABLOCKX) (n)
SELECT
    n
FROM tally

Lock partitioning πŸ”—︎

This was the hardest piece of the puzzle because there is not a whole lot of documentation.
My main sources were:

I’ll summarize the key points:

  • Lock partitioning is enabled by default if you have 16+ CPUs
    • Startup Trace Flag 1228 can lower the requirement to 2 CPUs
    • You can check if it’s enabled by looking for the message Lock partitioning is enabled… in the Error Log. For example with this query exec sp_readerrorlog 0, 1, N'lock partitioning'
  • It’s a scalability feature when some types of lock can be split into partitions enabling higher concurrency for some operations
  • The number of partitions matches the number of schedulers
  • Acquiring shared access requires only the local partition to be acquired
  • Acquiring exclusive access requires all partitions to be acquired
  • Partition lock acquires always start at 0 and go up to the number of schedulers. Lock releases have reversed direction

Locking and blocking πŸ”—︎

This should not be a new concept to most DBAs. The main tool to help with this is the Lock compatibility matrix (the complete one).

In this demo - the most important locks will be Sch-M (schema modification) and Sch-S (schema stability).
Schema is an overloaded term that can mean multiple things. In this case, it refers to the “definition” of an object.

  • Sch-S lock is present in every query because you don’t want the schema to change inflight. This lock cannot be removed (yes, not even with the NOLOCK hint).
  • Sch-M lock on the other hand blocks everything and must wait on the release of all the Sch-S locks so you can for example add a new column or rebuild an index, etc.

Demo πŸ”—︎

Control the assigned scheduler πŸ”—︎

We’re almost ready to start the demo, but there is one more consideration that caused me a headache. In the Lock partitioning summary, we had this bullet point:

Acquiring shared access requires only the local partition to be acquired

The local partition matches the scheduler_id that the session is running on. If we want a repeatable demo, we must be able to control the session’s scheduler assignment. After some attempts with closing and opening new sessions hoping for the desired scheduler, I’ve opted to use the Resource Governor’s AFFINITY option.

This snippet will dynamically create Resource pools with workgroups per scheduler and a Classification function that will match the workgroup to the application name.

DECLARE @schedulerCount int

SELECT 
    @schedulerCount = MAX(dos.scheduler_id)
FROM sys.dm_os_schedulers AS dos
WHERE dos.scheduler_id < 1048576

DECLARE
    @i int = 0
    , @dynamicSql nvarchar(MAX) = N''

WHILE @i <= @schedulerCount
BEGIN
    SET @dynamicSql = N'
        CREATE RESOURCE POOL SchedulerPool' + CAST (@i AS varchar(2)) + N'
        WITH (AFFINITY SCHEDULER = (' + CAST (@i AS varchar(2)) + N'));'

    EXECUTE sp_executesql @dynamicSql

    SET @dynamicSql = N'
        CREATE WORKLOAD GROUP SchedulerGroup' + CAST (@i AS varchar(2)) + N'
        USING SchedulerPool' + CAST (@i AS varchar(2)) + ';'

     EXECUTE sp_executesql @dynamicSql

    SET @i = @i + 1;
END
GO

CREATE OR ALTER FUNCTION dbo.ClassifierFunction()
RETURNS SYSNAME
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @WorkloadGroup SYSNAME;

    IF APP_NAME() LIKE 'SchedulerGroup%' /* example: SchedulerGroup15*/
    BEGIN
        SET @WorkloadGroup = APP_NAME()
    END
    ELSE
    BEGIN
        SET @WorkloadGroup = 'default';
    END

    RETURN @WorkloadGroup;
END;
GO

ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.ClassifierFunction);
GO

ALTER RESOURCE GOVERNOR RECONFIGURE;
GO

Now if I want my SSMS session to run on scheduler 6 I’ll add an Additional Connection parameter

Application Name=SchedulerGroup6;

AppName connection param showing SchedulerGroup6

And I can check that it worked with:

SELECT 
      dot.scheduler_id
    , dot.session_id
FROM sys.dm_os_tasks AS dot 
WHERE dot.session_id = @@spid

Plan of action πŸ”—︎

To demonstrate the blocking, I will:

  1. Start a reading query in a transaction with HOLDLOCK. I will only read the last row in the table
  2. Start an index rebuild with WAIT_AT_LOW_PRIORITY and a long timeout so we have time to test this
    1. This will be blocked because of the Sch-S lock from the reading query
    2. To demonstrate the blocking, this should be running on a scheduler with a high number (let’s say 10)
  3. Modify the table a couple of times to bring the stats modification counter over the threshold
    1. This is not blocked by the reading query as I’m only reading the last row
    2. It’s also not blocked by the index rebuild which waits in the low-priority queue
  4. Run a query reading from the table - this will trigger the Async update stats
  5. To demonstrate the blocking we can run the reading query again from a session lower than 10

Blocked rebuild πŸ”—︎

Let’s run the first few steps:

Run this query in the StatsUpdateAsync database - the scheduler_id doesn’t matter yet

USE StatsUpdateAsync
go
/* Blocker */
BEGIN TRAN
    SELECT 
        *
    FROM dbo.PermaSmall AS ps WITH (HOLDLOCK)
    WHERE ps.n = (SELECT 500)
-- ROLLBACK

In another session with the app name Application Name=SchedulerGroup10 run

SELECT 
    dot.scheduler_id
    , dot.session_id
FROM sys.dm_os_tasks AS dot 
WHERE dot.session_id = @@spid

USE StatsUpdateAsync
GO
/* Rebuild */
ALTER INDEX [PK_PermaSmall] ON [dbo].[PermaSmall] 
REBUILD WITH 
(
    ONLINE=ON 
    (
        WAIT_AT_LOW_PRIORITY
        (
            MAX_DURATION = 50 MINUTES /* I don't mind waiting */
            , ABORT_AFTER_WAIT = SELF
        )
    )
);

The query will be stuck in executing and we have 50 minutes to run our test.

Now we’ll run the update query twice and check the modification counter

USE StatsUpdateAsync
GO
/* Updater */
UPDATE p
    SET p.n = p.n /* fake modification */
FROM dbo.PermaSmall AS p
WHERE p.n <= 250
OPTION (KEEPFIXED PLAN)

SELECT
    obj.name AS tableName
    , stat.name AS statName
    , CAST(sp.last_updated AS time) AS Last_update_time
    , sp.rows
    , sp.steps
    , sp.modification_counter AS modCounter
    , d.compatibility_level AS CL
    , 500 AS threshold
FROM
    sys.objects AS obj
    JOIN sys.stats AS stat
        ON stat.object_id = obj.object_id
    JOIN sys.stats_columns AS sc
        ON sc.object_id = stat.object_id
        AND sc.stats_id = stat.stats_id
        AND sc.stats_column_id = 1
    JOIN sys.columns AS c
        ON c.object_id = obj.object_id
        AND c.column_id = sc.column_id
    CROSS APPLY sys.dm_db_stats_properties (stat.object_id, stat.stats_id) AS sp
    JOIN sys.databases AS d
        ON d.database_id = DB_ID()
WHERE
    obj.is_ms_shipped = 0
    AND obj.name LIKE N'Perma%'
ORDER BY sp.rows
OPTION (RECOMPILE, KEEPFIXED PLAN)

Stats modification counter showing 500

Now by running a reading query once we trigger the async auto stats update.

Let’s take a look at the locks:

SELECT 
    dtl.request_session_id AS spid
    , CASE 
        WHEN deib.event_info IS NOT NULL
            THEN LEFT(deib.event_info, 20)
        WHEN debjq.object_id1 IS NOT NULL
            THEN 'Async stats update'
        ELSE 
            'N/A'
    END AS Spid_Info
    , dtl.resource_lock_partition AS lock_partition
    , dot.scheduler_id
    , dtl.resource_type AS type
    , dtl.resource_subtype AS subtype
    , dtl.request_mode AS lock
    , dtl.request_status AS status
    , dtl.resource_description
    , dtl.resource_associated_entity_id
    , dtl.request_lifetime
    , dtl.request_owner_type
    , dtl.lock_owner_address
    , debjq.session_id
FROM sys.dm_tran_locks AS dtl 
JOIN sys.dm_os_tasks AS dot 
    ON dtl.request_session_id = dot.session_id
LEFT JOIN sys.dm_exec_background_job_queue AS debjq
    ON debjq.session_id = dtl.request_session_id
OUTER APPLY sys.dm_exec_input_buffer(dtl.request_session_id, 0) AS deib
WHERE
    dtl.resource_database_id = DB_ID('StatsUpdateAsync')
    AND dtl.request_mode LIKE 'Sch-%'
    AND dtl.resource_type = 'METADATA'
    AND dtl.resource_subtype = 'STATS'
ORDER BY dtl.request_session_id, dtl.resource_lock_partition

Taken locks output showing the async stats update being blocked

  1. The Rebuild task is running on scheduler 10 and is holding Sch-S lock only on the local partition (same as the scheduler_id)
  2. The Async stats update acquired Sch-M locks on partitions 0 - 9 but because modify has to lock all partitions, it’s blocked by the incompatible Sch-S held by the Rebuild
  3. Now any query running on scheduler_id less than 10 will be blocked by the most restrictive Sch-M lock of the async stats update
    1. But any query running on scheduler_id greater than 10 can still continue as usual
    2. You can test yourself by repeating the Reader or Updater query on any of the schedulers using the appropriate app_name

⚠️ I had inconsistent results when trying to run the Reader on the same scheduler as the index rebuild.
Sometimes it caused a deadlock where the Async stats update was a victim - thus releasing the Sch-M locks and unblocking everyone. But at the same time, it seemed like it queued another Async stats update causing another blocking chain for future queries

Solutions πŸ”—︎

Microsoft is aware of this problem and it has been fixed in the SQL 2022 (and the Azure offerings) as highlighted in this blog post Improving concurrency of asynchronous statistics update

You can enable this database-scoped configuration (it’s not enabled by default)

ALTER DATABASE SCOPED CONFIGURATION
SET ASYNC_STATS_UPDATE_WAIT_AT_LOW_PRIORITY = ON

Then the lock information looks like this

Async stats update has low priority Sch-M lock

This is not blocking anymore so queries can continue as usual.

As for those not on SQL 2022 yet - our solution was to update stats manually just before the planned index rebuild to minimize the chance of this happening.
It also helps if you don’t have long-running queries that can block the index rebuild and auto stats update.

Thank you for reading.