SQL Server Premier Field Engineer Blog

I was dealing with an issue where we had an AlwaysOn availability group with 2 replicas configured for Automatic failover. There was a 30 second glitch where folks could not connect to the Database on the primary and automatic failover did not kick in. Well, that was our initial impression at least. The purpose of this post is to expose you to the different logs available in troubleshooting AlwaysOn Availability group issues, not so much on this particular problem itself.

Symptoms on Primary: Connections failed for a brief moment with the error below and then was all good.

Error: 983, Severity: 14, State: 1.

Unable to access database 'HADB' because its replica role is RESOLVING which does not allow connections. Try the operation again later.

So there were 3 questions to answer:

a. What was the reason for the error?

b. Why didn’t automatic failover kick in? Or did it?

c. Was it supposed to fail over to the other node?

First of all we need to understand the FailureConditionLevel which controls when failover occurs both from an SQL FCI (failover cluster) or AlwaysOn Availability group Automatic failover perspective. For detailed information regarding Failover Policies for Failover Cluster Instances, refer to this article: http://msdn.microsoft.com/en-us/library/ff878664.aspx

In my case the FailoverConditionLevel is set to 5 (Default is 3). This setting can be altered with the following TSQL script:

If I look at the article referenced above, I notice that the FailoverConditionLevel has the following attributes:

Indicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.
SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).
System stored procedure sp_server_diagnostics returns ‘system error’.
System stored procedure sp_server_diagnostics returns ‘resource error’.
System stored procedure sp_server_diagnostics returns ‘query_processing error’.

One thing to note here is that the Cluster action is only if any of the subsystems report an “error”, no action is taken on a warning.

So effectively what happens is the following:

Cluster service runs LooksAlive check
Sp_server_diagnostics results sent to Resource Monitor DLL
Resource Monitor DLL detects any error state and notifies the cluster service
Cluster Service takes the resource offline
Notifies SQL Server to issue an internal command to take the availability group offline.
There is also the whole concept of a lease that is explained here:

http://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx

In order to understand this better I attempted and was able to reproduce the scenario.

I then looked at the Cluster Diagnostic extended event Log, the AlwaysOn extended event log, the cluster log, and the SQL Server error log to try to piece together what exactly happened.

Cluster Diagnostic Extended Event Log:

We see from this log that the System component did throw an error. Which equates to there were N number of dumps created and or a spinlock orhpaned after an Access violation or a detected Memory scribbler condition

The Cluster Diagnostics Logs are located in the Log directory as shown below and are different log files than the cluster log itself.

They are of the format : ServerName_InstanceName_SQLDIAG_*.xel

As we can see below, you see the component_health_result indicate that the system component of sp_server_diagnostics returned an error, when the resource monitor than interpretted as a Failure due to the FailureConditionLevel set, and propagated the resource “NOTHEALTH” to the cluster service which triggered the LooksAlive check to return “not alive” or false status.

AlwaysOn Extended Event log

The AlwaysOn Health Extended Event logs cover the Availability Group related diagnostics such as State changes for the Group or Replica or Databases, errors reported, lease expiration and any Availability Group Related DDL that is executed. The format of the logs is: AlwaysOn_health*.xel

If we look at the log snippet below, we see that the AG lease expired, and that triggered us to attempt a failover which in turn changes the state from PRIMARY_NORMAL to RESOLVING_NORMAL.

Cluster Log

Note: The times are in UTC so you have to convert them to match up with the other log files.

To generate the log: How to Generate a Cluster Log on Windows 2008 onwards

00006918.00015978::2013/04/03-18:54:37.251 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'system' health state has been changed from 'warning' to 'error' at 2013-04-03 11:54:37.247

00006918.00014ef4::2013/04/03-18:54:37.970 ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, the state of system component is error

00006918.00014ef4::2013/04/03-18:54:37.970 ERR [RES] SQL Server Availability Group< 2012AG>: [hadrag] Availability Group is not healthywith given HealthCheckTimeout and FailureConditionLevel

00006918.00014ef4::2013/04/03-18:54:37.970 ERR [RES] SQL Server Availability Group< 2012AG>: [hadrag] Resource Alive result 0.

00006918.00014ef4::2013/04/03-18:54:37.970 WARN [RHS] Resource 2012AG IsAlive has indicated failure.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for '2012AG', gen(0) result 1.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] TransitionToState(2012AG) Online-->ProcessingFailure.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (2012AG, Online --> Failed)

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] resource 2012AG: failure count: 1, restartAction: 2.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] Will restart resource in 500 milliseconds.

- If you see the “restart action” highlighted above, a restart is attempted on the current node first before failing over to the other node and in this case the restart is successful so it doesn’t really fail over to the other node. If we take a look at the cluster Availability group Resource Properties, you can confirm that the Restart action does indicate that a restart will be attempted on the current node first

00019d20.00019418::2013/04/03-18:55:06.079 INFO [RCM] TransitionToState(2012AG) DelayRestartingResource-->OnlineCallIssued.

00019d20.00019418::2013/04/03-18:55:06.079 INFO [RCM] HandleMonitorReply: ONLINERESOURCE for '2012AG', gen(1) result 997.

00019d20.00019418::2013/04/03-18:55:06.079 INFO [RCM] TransitionToState(2012AG) OnlineCallIssued-->OnlinePending.

…

00006918.0001f1c0::2013/04/03-18:55:07.298 INFO [RHS] Resource 2012AG has come online. RHS is about to report status change to RCM

SQL Server Errorlog

More details on Lease Expiration: http://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx

2013-04-03 11:54:43.59 Server Error: 19407, Severity: 16, State: 1.

2013-04-03 11:54:43.59 Server The lease between availability group '2012AG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

2013-04-03 11:54:43.64 Server AlwaysOn: The local replica of availability group '2012AG' is going offlinebecause either the lease expired or lease renewal failed. This is an informational message only. No user action is required.

2013-04-03 11:54:43.64 Server The state of the local availability replica in availability group '2012AG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-04-03 11:54:43.84 spid31s The availability group database "HADB" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

2013-04-03 11:54:43.84 spid27s AlwaysOn Availability Groups connection with secondary database terminated for primary database 'HADB' on the availability replica with Replica ID: {89c5680c-371b-45b9-ae19-2042d8eec27b}. This is an informational message only. No user action is required.

n The error below can occur if the Local Log records are hardened but quorum is lost with the cluster so the remote harden cannot be completed.

2013-04-03 11:54:45.16 spid58 Remote harden of transaction 'user_transaction' (ID 0x00000000001ee9e2 0000:000006eb) started at Apr 3 2013 11:54AM in database 'HADB' at LSN (37:28:204) failed.

2013-04-03 11:54:46.42 spid31s Nonqualified transactions are being rolled back in database HADB for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

n This phase is after the “restart action” as seen in the cluster log where we are attempting a restart on the same node before failing over to the other node.

2013-04-03 11:55:06.25 spid58 AlwaysOn: The local replica of availability group '2012AG' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-04-03 11:55:07.27 spid58 The state of the local availability replica in availability group '2012AG' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-04-03 11:55:07.55 Server The state of the local availability replica in availability group '2012AG' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

So in answering the 3 prior questions I had with the logs

a. The reason we got into this state was the system component reported an error ( was a bunch of exceptions), we won’t go into those here

b. Failover was attempted, but initial attempt is to restart on the same node and it did end up coming online on that node.

c. No, it should not have failed over to the other node

Hope the exposure to these logs is helpful in troubleshooting AlwaysON Availability group issues

-Denzil Ribeiro– Sr. SQL Premier Field Engineer

A question was posed to me whether Flow Control which existed in Mirroring was still relevant for Availability groups.

Flow Control is primarily a mechanism to gate or throttle messages to avoid use of excessive resource on the primary or secondary. When we are in “Flow Control” mode, sending of log block messages from the Primary to the Secondary is paused until out of flow control mode.

A Flow Control gate or threshold exists at 2 places:

- Availability Group Replica/Transport – 8192 Messages

- Availability Group Replica Database. - 112*16 = 1792 Messages per database subject to the 8192 total limit at the transport or Replica level

When a log block is captured, every message sent over the wire has a Sequence Number which is a monotonically increasing number. Each packet also includes an acknowledgement number which is the sequence number of the last message received /processed at the other side of the connection. With these two numbers, the number of outstanding messages can be calculated to see if there exists a large number unprocessed messages. Message sequence number is also introduced in order to ensure that messages are sent in sequence. If the messages are out of sequence then the session is torn down and re-established.

From an Availability Replica perspective, either the Primary or the Secondary replica can signal that we are in Flow control mode.

On the Primary, when we send a message, we check for the number of UN-acknowledged messages that we have sent - which is the delta between Sequence Number of the message sent and last acknowledged message. If that delta exceeds a pre-defined threshold value, that replica or database is in flow control mode which means that no further messages are sent to the secondary until the flow control mode is reset. This gives the secondary some time to process and acknowledge the messages and allows whatever resource pressure that exists on the secondary to clear up.

On the Secondary, when we reach a low threshold of Log caches or when we detect memory pressure, the secondary passes a message to the primary indicating it is low on resources. When SECONDARY_FLOW_CONTROL message is sent to the primary, a bit is set on the primary layer for the database in question indicating it is in Flow control mode. This in turn skips this database when doing a round-robin scan of databases to send data.

Once we are in “flow control” mode, until that is reset, we do not send messages to the primary. Instead, we check every 1000ms for a change in flow control state. On the secondary for example, if the log caches are flushed and additional buffers are available, the secondary will send a flow control disable message indicating we no longer need to be flow controlled. Once the primary gets this message, that bit is cleared out and messages again will flow from the database in question. On the Transport or Replica side on the other hand, once the number of unacknowledged messages falls below the gated threshold, it is reset as well.

While we are in Flow control mode, perfmon counters and wait types can give us the amount of time we are in flow control mode.

Wait Types:

http://msdn.microsoft.com/en-us/library/ms179984.aspx

HADR_DATABASE_FLOW_CONTROL	Waiting for messages to be sent to the partner when the maximum number of queued messages has been reached. Indicates that the log scans are running faster than the network sends. This is an issue only if network sends are slower than expected.
HADR_TRANSPORT_FLOW_CONTROL	Waiting when the number of outstanding unacknowledged AlwaysOn messages is over the out flow control threshold. This is on an availability replica-to-replica basis (not on a database-to-database basis).

Perfmon counters:

http://msdn.microsoft.com/en-us/library/ff878472.aspx

Flow Control Time (ms/sec)	Time in milliseconds that log stream messages waited for send flow control, in the last second.
Flow Control/sec	Number of times flow-control initiated in the last second. Flow Control Time (ms/sec) divided by Flow Control/sec is the average time per wait.

Extended Events

There are 2 Extended Events which will give us the relevant information when we are under the Flow control mode – note they are under the Debug Channel.

The action is basically a “set=0” or “cleared=1” bit.

Denzil Ribeiro– Sr Premier Field Engineer

In most cases, people utilize SQL Server features and capabilities in a common and typical usage pattern. But there are some instances where environments do something just a little different, or fill the bucket to the absolute brim. With the advent of AlwaysOn Availability Groups, it is such a robust high availability and disaster recovery solution that as time goes on database professionals are going to think of really cool and interesting ways to stretch it wide, solving problems that we may not have originally thought of. What brought on the deep dive of this specific topic was that a customer was looking to see what potential CPU problems could arise with 40+ availability groups, and a multiple of that of availability group databases. One of the most devastating bottlenecks is when you start running into exhausted CPUs, but with proper planning and focused monitoring you can minimize that risk and ensure existing functionality.

Take an instance where you want to create and implement a large number of availability groups and/or availability databases. You might be far exceeding what others have published and tested with. When doing this, you will want to keep a close eye on what’s called the HADR Worker Pool. This is basically the shared pool of workers across HADR databases. For more information on specifics of what these tasks would include, please see Bob Dorr’s blog post on the topic. Regardless of what these workers are doing, the overconsumption of them could lead to less-than-optimal performance. As my peer, Denzil Riberio, previously wrote about in his blog post on actual worker threads exceeding max worker threads the contention may not surface itself in the form of a hard cap (although the max worker threads does contribute largely to the calculation of the worker limit, as explained below), but in other forms of CPU/resource contention.

When we run into resource contention, we instinctively do our basic troubleshooting on waits and queues, but the root of the problem may not be completely evident with traditional means. So in order to fill in the blanks, we can look at a handful of corner strategies for troubleshooting HADR worker thread contention.

SQL Server lends us a handful of ways to monitor and troubleshoot this, including XE events. The first one I want to look at is hadr_worker_pool_task. Bob mentions this in his blog post, and it’s not necessarily the actual data and task information I’m concerned with normally (although this information is valuable and vital in dissecting specific problems and abnormal behavior). It’s the count of worker pool tasks that I will utilize for a few reasons: baselining during normal operation, and troubleshooting during abnormal situations. All I pull is the count of these events (utilizing the event_counter XE target), and a comparison of current vs. baseline can provide whether or not HADR is simply doing more now than it typically does.

My first step is to create the XE session to start capturing this data:

createeventsessionHadrWorkerPoolTask

onserver

addeventsqlserver.hadr_worker_pool_task

addtargetpackage0.event_counter

with

(

max_memory= 4096 KB,

event_retention_mode=allow_single_event_loss,

max_dispatch_latency= 30 seconds,

max_event_size= 0 KB,

memory_partition_mode=none,

track_causality=off,

startup_state=on

);

Because the event_counter target is a running total, I want a way to have historical data and compare the deltas between two different times. I accomplish this by creating a table in a utility database (in my example, the database is named DBA):

useDBA;

createtabledbo.HadrWorkerPoolTask

(

TaskLogIdintidentity(1, 1)notnull

constraintPK_HadrWorkerPoolTask_TaskLogIdprimarykeyclustered,

LogDateTimedatetime2(2)notnull,

TaskCountbigintnotnull

);

In order to pull the data from the event_counter, I wrote the following query:

;withcteas

(

select

convert(xml,t.target_data)astarget_data

fromsys.dm_xe_session_targetst

innerjoinsys.dm_xe_sessionss

ont.event_session_address=s.address

wheres.name='HadrWorkerPoolTask'

andt.target_name='event_counter'

)

select

a.b.value('./@count','varchar(max)')ascount

fromcte

crossapplycte.target_data.nodes('CounterTarget/Packages/Package/Event')a(b);

This will give you the point-in-time count of this event, but I created the dbo.HadrWorkerPoolTask to routinely dump this count for tracking. Modifying the above query slightly in order to persist the event counts, I end up with the following:

;withcteas

(

select

convert(xml,t.target_data)astarget_data

fromsys.dm_xe_session_targetst

innerjoinsys.dm_xe_sessionss

ont.event_session_address=s.address

wheres.name='HadrWorkerPoolTask'

andt.target_name='event_counter'

)

insertintodbo.HadrWorkerPoolTask(LogDateTime,TaskCount)

select

getdate(),

a.b.value('./@count','varchar(max)')ascount

fromcte

crossapplycte.target_data.nodes('CounterTarget/Packages/Package/Event')a(b);

In order to ensure this query is executed routinely, you can create a SQL Server Agent job to execute it every x minutes (or hours, or days, or whatever unit of granularity you are looking to achieve. Just remember that the less frequent you grab and store the count of this event, your points of reference will blur and linearize the times in between possibly hiding trends). So at this point in monitoring you now have a running log of total event counts for the hadr_worker_pool_task. But we need to take this a step further in querying to get the deltas instead of the cumulative amount. The below query will do just that, compare each time slot to the one sequentially before it and do the subtraction of the task count to get the delta:

select

t1.LogDateTime,

t1.TaskCount-t2.TaskCountasHadrTaskCount

fromdbo.HadrWorkerPoolTaskt1

innerjoindbo.HadrWorkerPoolTaskt2

ont1.TaskLogId=t2.TaskLogId+ 1;

You can take this query and dump the result set into Excel, create an SSRS report, or do whatever you typically do to consume diagnostic data. Here is a little sample of what it may look like (I just copy-pasted into Excel and created a line chart):

Now I have a nice chart of the HADR activity for this time period, and if/when something goes awry I can execute the same query for the current time span and compare the two charts.

Another piece of data that I like to keep an eye on is the hadr_thread_pool_worker_start event. This event fires when SQL Server starts up a new worker if there are no idle workers, or if the amount of required workers exceeds the active worker count (all the while staying within the bounds of the worker limit count). We can consume a couple of things from this event: (1) the recordable demand that HADR is putting on the system by requiring additional workers, and (2) we can watch for the event where we simply can’t create another HADR worker, possibly due to hitting the worker limit (that limit is going to be the effective max worker threads [minus] 40, as explained by Bob Dorr). If SQL Server is unable to start a new HADR worker then it will conveniently tell you this in the event’s output.

In order to create this event, I have implemented the below session definition:

createeventsessionHadrThreadPoolWorkerStart

onserver

addeventsqlserver.hadr_thread_pool_worker_start

addtargetpackage0.event_file

(

setfilename=N'<path to XEL file>\HadrThreadPoolWorkerStart.xel'

)

with

(

max_memory= 4096 KB,

event_retention_mode=allow_single_event_loss,

max_dispatch_latency= 30 seconds,

max_event_size= 0 KB,

memory_partition_mode=none,

track_causality=off,

startup_state=on

);

To view this session data, you can use the GUI (accessible through viewing the target or live data), or you can use a query like the one below:

declare@top_countint;

set@top_count= 100;

;withxe_cteas

(

select

object_name,

cast(event_dataasxml)asevent_data

fromsys.fn_xe_file_target_read_file

(

'<path to XEL file>\HadrThreadPoolWorkerStart*.xel',

null,

null

)

selecttop (@top_count)

event_data.value('(/event/@timestamp)[1]','varchar(32)')astime_stamp,

event_data.value('(/event/data/value)[3]','int')asactive_workers,

event_data.value('(/event/data/value)[2]','int')asidle_workers,

event_data.value('(/event/data/value)[1]','int')asworker_limit,

event_data.value('(/event/data/value)[4]','varchar(5)')asworker_start_success

fromxe_cte

orderbytime_stampdesc;

This query will return a similar output:

The notable results of this query are active_workers vs. worker_limit (are we getting close to, or have we reached, the worker limit?), and worker_start_success (were we able to start a new HADR worker when we needed one?). If you are consistently seeing worker_start_success as false (indicating SQL Server’s inability to start a new HADR worker) then you could be running into HADR worker pool starvation. In the event of excessive worker thread contention and long term failures to start a worker, after 15 minutes there will be a logged message in the error log. It is message id 35217, and in your error log the message will read:

“The thread pool for AlwaysOn Availability Groups was unable to start a new worker thread because there are not enough available worker threads. This may degrade AlwaysOn Availability Groups performance. Use the "max worker threads" configuration option to increase number of allowable threads.”

As an aside, where does worker_limit come from? It is effective max worker threads – 40. So on my single CPU VM, that’ll be 512 – 40 = 472.

There are other considerations when monitoring and troubleshooting HADR worker consumption and task quantity, but what this all boils down to is CPU contention. Going back to the root of the problem, we can utilize other tried-and-true strategies to analyze this resource contention. For instance, take note of the summary of workers and runnable tasks from the sys.dm_os_schedulers DMV:

select

scheduler_id,

status,

current_tasks_count,

current_workers_count,

active_workers_count,

work_queue_count,

runnable_tasks_count,

load_factor

fromsys.dm_os_schedulers

wherestatus='VISIBLE ONLINE';

Things to watch out for here are counts of work_queue_count and runnable_tasks_count consistently above zero. Another query to have in your toolbox, looking for wait types associated with thread starvation:

select

wait_type,

count(*)aswait_type_count

fromsys.dm_os_waiting_tasks

groupbywait_type

orderbywait_type_countdesc;

This above query will give you a count of all current waiting tasks’ associated wait types.

Lastly, when troubleshooting busy schedulers, I like to see what percentage of the total wait time the signal wait time is (signal wait time is the amount of time that tasks waited between being signaled and starting). A sample query for this could be as follows:

;withcpu_wait_cteas

(

select

sum(wait_time_ms)assum_wait_time_ms,

sum(signal_wait_time_ms)assum_signal_wait_time_ms

fromsys.dm_os_wait_stats

)

select

convert(

decimal(5, 2),sum_signal_wait_time_ms* 1.0 /sum_wait_time_ms* 100

)assignal_wait_time_percentage

fromcpu_wait_cte;

This will be the cumulative stats from the last time the instance started (or this DMV was explicitly cleared), so you may need mold this query into a delta query from a start time to end time (or manually clear this DMV) instead of grabbing the wait stats in its entirety. For all intents and purposes, the above query will give what percentage of the summation of time spent waiting that the signal wait time is. This value should be frequently baselined and compared when troubleshooting possible CPU contention. This could bring to light a higher amount of time waiting on the CPU to pick the task back up, possibly indicating CPU pressure.

I have illustrated a few possible approaches to monitoring and troubleshooting possible HADR worker thread issues and tying it all back to generic CPU contention problems. With the right approach, you can maximize visibility on a commonly foggy aspect of HADR worker thread contention.

Thomas Stringer – SQL Server Premier Field Engineer

Twitter: @SQLife

Something that is a relatively common performance eye opener is when you have a large ad hoc workload, and you’re getting a huge gap in the lack of plan reuse. You talk to the application team that is responsible for this possibly dreaded scenario and relay your concerns about the high CPU caused by the ad hoc workload compilation rate, and with the plan cache bloat you would really like to see a better display of client-side parameterization. We’ve all been there before, and we are all too familiar with the response that the application code will not be changing. Work your magic on the database.

A commonly accepted action to resolve this plan reuse problem is to set a database to force parameterization. What this ultimately does is tell SQL Server to parameterize all DML queries (with a list of limitations that can be found in the above link). Take for example, you have a complex ad hoc query that uses up to 100 distinct parameter values. With simple parameterization, that could lead to 100 different cached plans. But by forcing parameterization that could be narrowed down to a single prepared plan that is parameterized. You think, awesome, problem solved! Maybe. Maybe not.

By forcing parameterization and reusing complex queries’ plans you could potentially be hurting performance. In order to illustrate this pitfall, I have provided an example below. I have grown my AdventureWorks2012 database using a script provided by my good friend and SQL professional Adam Machanic (Thinking Big). Take the following query, something that may be used for reporting:

use AdventureWorks2012;

select

p.ProductID,

p.Name as Product,

th.ActualCost,

th.Quantity,

pm.Name as ProductModel

from dbo.bigTransactionHistory th

inner join dbo.bigProduct p

on th.ProductID = p.ProductID

left join Production.ProductModel pm

on p.ProductModelID = pm.ProductModelID

where p.ProductID < <ProductID VALUE HERE>;

As you can see in the above query, a user could filter by a range of ProductIDs. Take a scenario where plan reuse (or the lack thereof) is causing plan cache bloat and a high rate of compilations. So you set parameterization to be forced for the AdventureWorks2012 database:

alter database AdventureWorks2012

set parameterization forced;

Now say the first query for the cold cache is the following, specifying the ProductID to be a value of 1002:

select

p.ProductID,

p.Name as Product,

th.ActualCost,

th.Quantity,

pm.Name as ProductModel

from dbo.bigTransactionHistory th

inner join dbo.bigProduct p

on th.ProductID = p.ProductID

left join Production.ProductModel pm

on p.ProductModelID = pm.ProductModelID

where p.ProductID < 1002;

We get the following post execution plan:

All looks well. But the problem comes into play now when we execute with a different, larger value (50532):

select

p.ProductID,

p.Name as Product,

th.ActualCost,

th.Quantity,

pm.Name as ProductModel

from dbo.bigTransactionHistory th

inner join dbo.bigProduct p

on th.ProductID = p.ProductID

left join Production.ProductModel pm

on p.ProductModelID = pm.ProductModelID

where p.ProductID < 50532;

We can see by looking at the plan cache that the desired plan reuse has been obtained, by using the below query:

select

cp.objtype,

cp.cacheobjtype,

cp.usecounts,

st.text,

qp.query_plan

from sys.dm_exec_cached_plans cp

outer apply sys.dm_exec_sql_text(cp.plan_handle) st

outer apply sys.dm_exec_query_plan(cp.plan_handle) qp

where st.text like ‘%bigTransactionHistory%’

and st.text not like ‘%dm_exec_cached_plans%’;

You will notice, as a side note, that there are still two Adhoc plans for this query. These are simply shell plans that point to the prepared plan that has now been parameterized. You can tell these are shell plans by looking at the query plan’s XML, and you’ll only see a few elements there, and only the StmtSimple element containing data. The data that’s here includes a ParameterizePlanHandle and ParameterizedText attributes that point to the prepared plan including full query plan details. Think of these shell plans as pointers to the real plan.

Going back to the current post execution plan for parameter value 50532, you will be sorely disappointed:

As you can see above, because the initial plan was reused, the estimated number of rows was a little over 2000, but the actual number of rows was over 30 million! Taking a look at the XML of this post execution plan, we can see the following difference between the compiled value and the runtime value, explaining why this plan is poorly executing for this run: (for further information regarding parameter sniffing, please see Lisa Gardner’s blog post on Back to Basics: SQL Parameter Sniffing due to Data Skews)

</ParameterList>

There’s a chance this is not the optimal plan when users specify the upper bounds of ProductIDs. For argument’s sake, let’s switch back to simple parameterization and execute the two queries again (ProductID equal to 1002 and 50532):

Looking at these plans side by side, we see that two relatively different plans were chosen for the different ProductID values (1002 is the top plan, 50532 the bottom). And glancing at the plan cache we can further see that instead of preparing a single plan, the two full Adhoc plans were compiled and cached:

Looking at the query stats for the same query (passing parameter value 50532, utilizing forced parameterization for the first test and simple parameterization for the second test) during the different runs, we can see numerical proof that with simple parameterization (and in turn an ad hoc plan) gives us better execution statistics at the cost of no plan reuse:

When you are dealing with these types of scenarios where plan cache bloat and large amounts of compilations due to ad hoc workloads are becoming issues, before you flip a switch like forced parameterization you need to do extensive testing and analysis. In the end, fixing certain problems can lead to more disastrous ones (in this case, poor plan usage and degraded execution time). Like many other things we do in our profession, this is one big “it depends”. I hope that this blog post has given some insight to a potential pitfall of forcing parameterization, and with proper planning you will be able to determine the best balance of plan reuse and best performance, even with an ad hoc workload.

Thomas Stringer – SQL Server Premier Field Engineer

Twitter: @SQLife

Well-constructed indexes can greatly improve read performance in SQL server, but they can be costly to maintain. There’s the obvious cost of additional time for your periodic index maintenance (rebuilds, reorganization and updating statistics) and the cost of additional storage, but there’s also a cost every time you make an update to indexed data.

Consider this small and poorly indexed table:

CREATE TABLE dbo.Person(

CompanyID INT IDENTITY,

NetworkId VARCHAR(20),

FirstName VARCHAR(20),

MiddleName VARCHAR(20),

LastName VARCHAR(50),

DateOfBirth DATE,

SSN CHAR(9),

EmailAddress VARCHAR(100),

BusinessPhone VARCHAR(10),

ModifiedDate DATETIME,

CONSTRAINT PK_Person PRIMARY KEY CLUSTERED (CompanyID)

);

CREATE INDEX ix_LastName ON dbo.person (LastName);

CREATE INDEX ix_LastFirstMiddle ON dbo.Person (LastName, FirstName, MiddleName);

CREATE INDEX ix_LastNameFirstName ON dbo.Person (LastName, FirstName)

INCLUDE (DateOfBirth);

Every time we insert or delete a row from our table we must also insert or delete a row in each of its non-clustered indexes, and if we update a value in an indexed column (e.g., MiddleName) we must update any indexes that contain the column. You can see that we could be occurring a lot of costly I/O – and we haven’t even considered that each modification must also be written out to the transaction log.

Add to this a tendency to add every index suggested by tuning tools and wizards in hopes of (ironically) improving performance and we’ve got a mess. We clearly need a more considered, holistic approach to our index tuning, so I’d like to share my approach to this task.

Let’s start by visualizing the index maintenance necessary when we modify our table. We’ll turn on the option to “Include the Actual Query Plan” then run an insert and an update and look at the properties of the Insert and Update operators…

INSERT INTO dbo.Person ( NetworkId, FirstName, MiddleName, LastName, DateOfBirth,

SSN, EmailAddress, BusinessPhone, ModifiedDate )

VALUES ( ‘jroberts’, ‘Jonathan’, ‘Q’, ‘Roberts’, ‘19700206’, ‘123456789’,

‘jroberts@somecompany.com’, ‘9195559632’, GETDATE());

If you highlight the Clustered Index Insert operator and hit F4 you can see its properties. Scroll down to the Object node and expand it. Here you can see all that our non-clustered indexes are also being modified by the insert.

If we update MiddleName you can see there’s less work to do as it only appears in one of our indexes…

UPDATE dbo.Person

SET MiddleName = ‘Quincy’

WHERE NetworkId = ‘jroberts’;

We can see that the number and design of our indexes will impact the performance of our server. Our goal is to get the most use from the smallest number of indexes. We’ll do this by first reviewing all of the existing indexes on a table looking for opportunities to consolidate them, then making modifications to the remaining indexes to maximize their usage and finally adding 1 or 2 thoughtfully built indexes and monitoring to see the impact of our changes. The approach laid out here approach does not absolve us from doing the preliminary bottleneck analysis and identifying our top contributing queries before we dive down into crafting indexes for the same.

Index tuning efforts may start with

1) Manually crafting an index to improve performance on a problematic SQL statement

2) Implementing “missing indexes” identified with a DMV or

3) Comprehensive workload evaluation using the Database Engine Tuning Advisor (DTA).

Wherever you start, I recommend you focus on 1 table at a time

If taking the manual approach, do a quick check for other expensive queries involving the same table. If using “missing indexes” or the DTA, make note of the various index suggestions for the targeted table, and note where they overlap. You want to discover as much as you can about your indexing needs so you can maximize the use of each index. There will be opportunities where simply adding one column to an INCLUDE clause will cover an additional query.

– Gather missing index data for the current database

SELECT t.name AS ‘table’,

( avg_total_user_cost * avg_user_impact ) * ( user_seeks + user_scans )

AS ‘potential_impact’,

‘CREATE NONCLUSTERED INDEX ix_IndexName ON ‘ + SCHEMA_NAME(t.schema_id)

+ ‘.’ + t.name COLLATE DATABASE_DEFAULT + ‘ (‘

+ ISNULL(d.equality_columns, ”)

+ CASE WHEN d.inequality_columns IS NULL THEN ”

ELSE CASE WHEN d.equality_columns IS NULL THEN ”

ELSE ‘,’

END + d.inequality_columns

END + ‘) ‘ + CASE WHEN d.included_columns IS NULL THEN ”

ELSE ‘INCLUDE (‘ + d.included_columns + ‘)’

END + ‘;’ AS ‘create_index_statement’

FROM sys.dm_db_missing_index_group_stats AS s

INNER JOIN sys.dm_db_missing_index_groups AS g

ON s.group_handle = g.index_group_handle

INNER JOIN sys.dm_db_missing_index_details AS d

ON g.index_handle = d.index_handle

INNER JOIN sys.tables t WITH ( NOLOCK ) ON d.OBJECT_ID = t.OBJECT_ID

WHERE d.database_id = DB_ID()

AND s.group_handle IN (

SELECT TOP 500 group_handle

FROM sys.dm_db_missing_index_group_stats WITH ( NOLOCK )

ORDER BY ( avg_total_user_cost * avg_user_impact ) *

( user_seeks + user_scans ) DESC )

AND t.name LIKE ‘Person’

ORDER BY ( avg_total_user_cost * avg_user_impact ) * ( user_seeks + user_scans ) DESC;

Run an index usage query just for the table you’re working with and save the output. We want to know which indexes are being used and which aren’t. We’ll also want to check back after our tuning session to see if usage patterns have changed. Keep in mind that the DMV counters are reset each time SQL is restarted, so the longer SQL’s up before you look for missing indexes or index usage the more accurate the values will be. Also consider cyclic usage patterns. You may want to postpone data collection until after those big end of month (quarter, year) reports have been run.

– Index usage for tables having more than 10000 rows

SELECT t.name ‘table’, i.name ‘index_name’,

( u.user_seeks + u.user_scans + u.user_lookups ) ‘reads’,

u.user_updates ‘writes’, ( SELECT SUM(p.rows)

FROM sys.partitions p

WHERE p.index_id = u.index_id

AND u.object_id = p.object_id

) ‘rows’, i.type_desc, i.is_primary_key,

i.is_unique

FROM sys.dm_db_index_usage_stats u

INNER JOIN sys.indexes i ON i.index_id = u.index_id

AND u.object_id = i.object_id

INNER JOIN sys.tables t ON u.object_id = t.object_id

INNER JOIN sys.schemas s ON t.schema_id = s.schema_id

WHERE OBJECTPROPERTY(u.object_id, ‘IsUserTable’) = 1

AND ( SELECT SUM(p.rows)

FROM sys.partitions p

WHERE p.index_id = u.index_id

AND u.object_id = p.object_id

) > 10000

AND u.database_id = DB_ID()

AND t.name LIKE ‘Person’

ORDER BY reads;

Script out the DDL for the table, including all of its indexes and keys. We need to see what we’ve already got to work with and the data types of the columns.
Before adding new indexes we always want to optimize those we’ve already got. Look for duplicate indexes we can eliminate or overlapping indexes that we can easily merge. Approach changes to Unique (including your Primary Key index) and Clustered indexes very cautiously as they have important roles in your table. Check your index usage numbers. You don’t want to spend time figuring out the best way to merge 2 similar indexes, neither of which is ever used.

Looking at our simplistic example from above, we find that we can roll the functionality of all 3 indexes into 1 by simply adding MiddleName to the 3^rd index below:

– Lots of overlap and duplication

CREATE INDEX ix_LastName ON dbo.Person (LastName);

CREATE INDEX ix_LastFirstMiddle ON Dbo.Person (LastName, FirstName, MiddleName);

CREATE INDEX ix_LastNameFirstName ON Dbo.Person (LastName, FirstName)

INCLUDE (DateOfBirth);

– The functionality of the 3 can be combined into 1 index and

– the other 2 can be dropped

CREATE INDEX ix_LastFirstMiddle ON Dbo.Person (LastName, FirstName, MiddleName)

INCLUDE (DateOfBirth);

5. After you’ve optimized your existing indexes, consider the indexes you’d like to add. Can you make small changes to any of the existing indexes to accommodate your new index needs?

To a point we can extend our INCLUDE list to “cover” additional queries. A covering index is one that contains all the columns needed for a query allowing us to avoid the additional IO of a Key or RID lookup in the base table.

The value of the INCLUDE clause is that it allows us to create a covering index with a smaller footprint as the included columns only are only stored at the leaf level of an index, not at the root or intermediate levels so there’s less impact to index size than if we were to add additional columns as index keys.

Index to add: Person (LastName, FirstName) INCLUDE (SSN)

– Instead we can add SSN to the INCLUDE clause

CREATE INDEX ix_LastFirstMiddle ON Dbo.Person (LastName, FirstName, MiddleName)

        INCLUDE (DateOfBirth, SSN);

Sometimes it’s more effective to add an index with some overlap to keep indexes narrow and I/O small:
Index to add: Person (LastName, FirstName, NetworkId) INCLUDE (EmailAddress)

– New index – option 1

CREATE INDEX ix_LastFirstNetworkId ON Dbo.Person (LastName, FirstName, NetworkId)   INCLUDE (EmailAddress);

Or consider moving a column we aren’t using in our WHERE clause in the INCLUDE clause where it will take up less space. Try different implementations of an index and test their effectiveness.

– New index – option 2

CREATE INDEX ix_LastFirstNetworkId ON Dbo.Person (LastName, FirstName)   INCLUDE (EmailAddress, NetworkId);

6. Limit your changes to no more than a 1 or 2 indexes per table at a time, and keep a close eye on the usage statistics after implementation to see if they’re being used. Index tuning is an iterative process, so plan to do additional tuning and to check usage numbers on a periodic basis.

Useful Advice

ü Never implement a new index without careful consideration, evaluation and testing. If using the Missing Indexes DMV, read Limitations of the Missing Indexes Feature. The same applies when altering or dropping indexes. If index hints are present in code, disabling or removing an index will break the code.

ü Don’t duplicate your table by creating an index with a lengthy INCLUDE clause. Wider tables can justify wider indexes since the I/O savings can still be substantial. I try to INCLUDE no more than 1/3 of the table’s columns.

ü I like to limit the actual index keys (stuff to the left of INCLUDE) to no more than 3 columns. The key columns take up more space in an index than the INCLUDE columns, and I find 3 columns yields good selectivity.

ü I try to keep the number of indexes on tables in busy OLTP systems to no more than 5 (rule of thumb). Six is OK, 29 is not! More indexes on an OLAP system are appropriate.

ü Verify that the column order SQL’s recommending is correct. The choice of leading column drives statistics and is key to whether the optimizer chooses to use the index. Ideally it will be selective and used in the WHERE clause of multiple queries. Additional guidelines are that columns used for equality comparisons should precede those used for inequality comparisons and that columns with greater selectivity should precede those with fewer distinct values.

ü Create indexes on columns used to JOIN tables.

ü Drop unused or very seldom used indexes after verifying they aren’t used to generate a critical report for the CEO once a year. Remember that the DMV counters are reset each time SQL is restarted. Consider collecting data at intervals over a longer period of time to get a more accurate picture of index usage. It’s also a good practice to script out and save any indexes you plan to drop should you need to rebuild them in a hurry.

Additional Index design guidelines:

MSDN – Clustered index design guidelines

MSDN – General Index design Guidelines

Susan Van Eyck

SQL Server Premier Field Engineer

When installing SQL 2008 on a newer OS or a machine that has .NET Framework 4.0, you can encounter errors if the SQL installation media is running from a UNC path. The application log can throw an error such as the one below:

Log Name: Application

Source: .NET Runtime

Date: 6/30/2013 4:20:04 PM

Event ID: 1026

Task Category: None

Level: Error

Keywords: Classic

User: N/A

Computer: Machine.Domain.com

Description:

Application: setup100.exe

Framework Version: v4.0.30319

Description: The process was terminated due to an unhandled exception.

Exception Info: System.Security.SecurityException

Stack:

at Microsoft.SqlServer.Chainer.Setup.Setup.DebugBreak()

at Microsoft.SqlServer.Chainer.Setup.Setup.Main()

Event Xml:

<Channel>Application</Channel>

<Computer>Machine.Domain.com</Computer>

</System>

<Data>Application: setup100.exe

Framework Version: v4.0.30319

Description: The process was terminated due to an unhandled exception.

Exception Info: System.Security.SecurityException

Stack:

at Microsoft.SqlServer.Chainer.Setup.Setup.DebugBreak()

at Microsoft.SqlServer.Chainer.Setup.Setup.Main()

</Data>

</EventData>

</Event>

There are significant changes in CAS in .net 4.0 – http://blogs.msdn.com/b/shawnfa/archive/2010/02/24/so-is-cas-dead-in-net-4-or-what.aspx which result in this. In the .NET Framework version 3.5 and earlier versions, if you loaded an assembly from a remote location, the assembly would run partially trusted with a grant set that depended on the zone in which it was loaded. For example, if you loaded an assembly from a website, it was loaded into the Internet zone and granted the Internet permission set. In other words, it executed in an Internet sandbox. If you try to run that assembly in the .NET Framework 4 and later versions, an exception is thrown; you must either explicitly create a sandbox for the assembly.

More details on the .NET Framework 4.0 Security model: http://msdn.microsoft.com/en-us/magazine/ee677170.aspx

There has been an article very recently released on the same but doesn’t list all the workarounds http://support.microsoft.com/kb/971269

There are several workarounds here, either of which can help.

1. Install the Media from a Local drive

2. Remove the V4.0 Config element from the setup file ( setup.exe.config) of the SQL setup directory ( make a copy of the file before doing that)

<configuration>

<startup useLegacyV2RuntimeActivationPolicy="true">

<supportedRuntime version="v4.0"/>

<supportedRuntime version="v2.0.50727"/>

</startup>

3. Turn off LegacyCasPolicy and allow remote assemblies to be run in the Setup.exe.config of the SQL setup directory, and add the highlighted element. The <loadFromRemoteSources> element lets you specify that the assemblies that would have run partially trusted in earlier versions of the .NET Framework are to be run fully trusted in the .NET Framework 4 and later versions. By default, remote assemblies do not run in the .NET Framework 4 and later (http://msdn.microsoft.com/en-us/library/dd409252.aspx )

<runtime>
< legacyCasPolicy enabled="false" />
<loadFromRemoteSources enabled="true"/>
< /runtime>

4. Use CasPol to trust the UNC share (Using CasPol to Fully Trust a Share ). Please understand the security ramifications of doing this

C:\WINDOWS\Microsoft.NET\Framework\v4.0.30319\caspol.exe -m -ag 1 -url "file:\\share\sqlinstall\*" FullTrust -exclusive on

5. Uninstall Microsoft .NET Framework 4 / Microsoft .NET Framework 4 Client Profile ( more of a last resort unless you don’t need it).

Denzil Ribeiro – Sr. Premier Field Engineer

(@denzilribeiro)

There are a handful of questions that DBAs get in the wild that aren’t necessarily under the jurisdiction of the typical DBA. One of those aspects is connection pooling. Far too often application teams, or network admins, or <insert non-DBA professional here> approach the DBA with questions regarding connection pooling, and whether or not it is functioning correctly or even happening.

The answer that enterprise DBAs need to be giving to these inquiries is that it is provider-specific. In other words, it is on the client/application side that connection pooling is handled. I will show an example of this below using the .NET Data Provider for SQL Server (the System.Data.SqlClient namespace), but the ideas proposed should propagate to other popular providers used today.

An application utilizes a provider to make connections to an instance of SQL Server. In my case, I’m going to use a PowerShell process to mimic this behavior. I’ll start off by creating a connection string that my application will use to connect to the default instance on SQLBOX1:

$ConnectionString = "data source=sqlbox1; initial catalog=master; trusted_connection=true; application name=ConnPoolTest"

The above connection string is rather simple, but I am specifying the Application Name parameter so that we can easily parse sys.dm_exec_sessions in a below query to further prove connection pooling. Now what we’re going to do is create five System.Data.SqlClient.SqlConnection objects:

$SqlConnection1 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)

$SqlConnection2 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)

$SqlConnection3 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)

$SqlConnection4 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)

$SqlConnection5 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)

All this does is instantiate the SqlConnection objects utilizing the same connection string that we defined above. Note: Connection Pools are going to be defined by the connection strings that are passed. Different connection strings, different connection pools. Now we want to open these connections. I put a two second sleep in between each connection opening just to see a gradual increase in the pooled connection count with the PerfMon counter NumberOfPooledConnections for all instances of the .NET Data Provider for SqlServer object:

$SqlConnection1.Open()

Start-Sleep -Seconds 2

$SqlConnection2.Open()

Start-Sleep -Seconds 2

$SqlConnection3.Open()

Start-Sleep -Seconds 2

$SqlConnection4.Open()

Start-Sleep -Seconds 2

$SqlConnection5.Open()

Looking at the aforementioned counter in PerfMon, we see that the number of pooled connections goes from zero to five within the ten second duration.

We can also see this from sys.dm_exec_sessions, by filtering on the application name that I specified in the connection string:

select session_id,program_name

from sys.dm_exec_sessions

where program_name = ‘ConnPoolTest’;

Now, there is nothing new here. There have been five connections, and there are five sessions that show this. But connection pooling comes into play and really flexes its muscles when these connections are closed, and even disposed:

$SqlConnection1.Close()

$SqlConnection2.Close()

$SqlConnection3.Close()

$SqlConnection4.Close()

$SqlConnection5.Close()

$SqlConnection1.Dispose()

$SqlConnection2.Dispose()

$SqlConnection3.Dispose()

$SqlConnection4.Dispose()

$SqlConnection5.Dispose()

Write-Host "Connection1 State: $($SqlConnection1.State)" -ForegroundColor Green

Write-Host "Connection2 State: $($SqlConnection2.State)" -ForegroundColor Green

Write-Host "Connection3 State: $($SqlConnection3.State)" -ForegroundColor Green

Write-Host "Connection4 State: $($SqlConnection4.State)" -ForegroundColor Green

Write-Host "Connection5 State: $($SqlConnection5.State)" -ForegroundColor Green

The last five lines of code show the status of each of the connections’ state, and this should have the output like the following:

Connection1 State: Closed

Connection2 State: Closed

Connection3 State: Closed

Connection4 State: Closed

Connection5 State: Closed

But, re-executing the sys.dm_exec_sessions query above we still see that the same five sessions are alive and well:

select session_id, program_name

from sys.dm_exec_sessions

where program_name = ‘ConnPoolTest’;

This is connection pooling. PerfMon also shows us that these connections are indeed in the pool for later reuse:

Now when the provider needs to open up a new connection with a connection string correlated to an existing pool it’ll just take one of the inactive pooled connections so that the overhead of establishing a connection to the instance isn’t incurred.

So throughout the above information, I’ve been hammering the point that connection pools are unique to the connection strings themselves. Let’s see what it looks like with a different connection string. Modifying only the application name in the connection string, I create a new one and specify this for two more connections:

$ConnectionString2 = "data source=sqlbox1; initial catalog=master; trusted_connection=true; application name=ConnPoolTest2"

$SqlConnection6 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString2)

$SqlConnection7 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString2)

$SqlConnection6.Open()

Start-Sleep -Seconds 2

$SqlConnection7.Open()

$SqlConnection6.Close()

$SqlConnection7.Close()

$SqlConnection6.Dispose()

$SqlConnection7.Dispose()

Glancing back at the PerfMon counters (I have now added the NumberOfActiveConnectionPools counter), we see that these two new connections are in a separate connection pool due to a different connection string:

The green line is the NumberOfActiveConnectionPools counter. It is one for the first five connections, as they all use the same connection string. Specifying a different connection string for two additional connections show us that the five original connections cannot be reused, so another connection pool is created for this new connection string, and now holds the two new connections.

Looking again at what SQL Server sees (with a slightly modified WHERE clause to include the “new” application name):

select session_id, program_name

from sys.dm_exec_sessions

where program_name like ‘ConnPoolTest%’;

We can see above that the two connection pools’ connections are indeed there and pooled for later use, even though the connections that they originated with have been closed and disposed.

One of the biggest considerations with connection pooling is to ensure that proper disposal or closure of the SqlConnection object(s) is taking place. What happens is that when clients use and don’t close or dispose the connection object, it will continue to consume the pooled connection instead of releasing it back to the pool for reuse. Let’s take a look at a [scaled-down] example of this. Notice in my connection string that I use the “Max Pool Size” parameter to cap this off at a value of five pooled connections, instead of using the default 100 max pool size. A list of connection string parameters, including those related to connection pooling, can be found on the MSDN reference for the SqlConnection.ConnectionString property. Below we specify “Max Pool Size” as five, and notice what happens when we attempt to open a sixth connection with the same connection string:

$ConnectionString3 = "data source=sqlbox1; initial catalog=master; trusted_connection=true; application name=ConnPoolTest; max pool size=5"

$SqlConnection1 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString3)

$SqlConnection2 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString3)

$SqlConnection3 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString3)

$SqlConnection4 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString3)

$SqlConnection5 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString3)

$SqlConnection6 = New-Object System.Data.SqlClient.SqlConnection($ConnectionString3)

$SqlConnection1.Open()

$SqlConnection2.Open()

$SqlConnection3.Open()

$SqlConnection4.Open()

$SqlConnection5.Open()

# five connections opened with no issues

$SqlConnection6.Open()

# the above attempt for SqlConnection.Open() fails

When I made the call $SqlConnection6.Open(), I received the following error, describing quite well what probably happened (and in this case, did happen):

Exception calling "Open" with "0" argument(s): "Timeout expired. The timeout period elapsed

prior to obtaining a connection from the pool. This may have occurred because all pooled

connections were in use and max pool size was reached."

This is a common pitfall for web applications, and the above example illustrates why this can be a problem. Say a web application uses the default max pool size (100), and there are currently 100 active and consumed pooled connections. Because they may not have been properly closed or disposed, the 101th attempt will result in a similar error and behavior. Common ways to handle this are in the finally block of a try/catch/finally call, or in C# implement a using block, which automatically calls the IDisposable.Dispose() method at the end of the block.

Even though this may very well be client-side technology and functionality, there are ways that we can keep an eye on connection pooling and “see” it happening within SQL Server. Like with many diagnosing and monitoring requirements, we can see connection pooling (or the lack thereof) with SQL Trace or Extended Events. For SQL Trace, the events to capture are Audit Login, Audit Logout, and RPC:Completed (with a filter on ObjectName like “sp_reset_connection”). You would see similar results:

There are a few interesting aspects of the above example. First off, the initial login for a different connection string and/or PID would result in a nonpooled connection (notice the EventSubClass indicates whether it is Pooled [2] or Nonpooled [1]). But when that connection is released back to the pool (logout) then the subsequent SqlConnection.Open() calls can reuse the pooled connection (provided there is one available). Another thing to notice is that the sp_reset_connection is called for pooled connections that are attempting to be reused. What this does is ensure just that: The connection’s context is reset (things like transactions being reset) so that there isn’t unwanted propagation from the previous connection.

As said above, this can also be monitored with Extended Events by capturing the login, logout, and the rpc_completed events (again, filtered on the object_name event field for the “sp_reset_connection” stored procedure). This would look similar to the following:

We get virtually the same data we did in the SQL Trace, but in this case we have an event field named is_cached that will tell us whether the particular event is correlated to a pooled connection (much like above, the first isn’t a pooled connection and all subsequent logins were, with calls to sp_reset_connection in between). Note: for both SQL Trace and XEvents, monitoring the login and logout events can extremely noisy and chatty, so consider filtering events.

Another document to refer to is the MSDN reference on SQL Server Connection Pooling (ADO.NET). There is great information on that article that wasn’t expanded on in this blog post.

I hope this post has shed some light on what connection pooling is, where and how it is handled, and how it is seen from the client and the SQL Server instance. So now the next time you, the DBA, are approached with questioning on connection pooling, you know the basic underlying constructs of how it works, and can answer appropriately instead of assuming it is server-side functionality.

Thomas Stringer – SQL Server Premier Field Engineer

Twitter: @SQLife

I have been a long time supporter and attendee of the PASS Summit. Last year I even had the pleasure of being a first time speaker. Unfortunately, this year I am unable to attend, but I wanted to take the time to tell you about a number of my fellow PFEs who will be attending and speaking at the summit in Charlotte. The great thing about PFE is that we work with large scale customer environments every day. We not only get to see a lot of strange issues, but we also help with many strategic initiatives with our customers. There are a number of PFEs attending and will be helping out in the SQL Clinic alongside CTS and SQL CAT folks. Stop on by!

We have 3 SQL PFEs speaking this year! Their sessions are all going to be great and definitely worth attending:

Daniel Sol is presenting SQL Server Performance and Monitoring in Windows Azure at Scale

Tim Chapman ( @chapmandew ) and Denzil Ribeiro ( @denzilribeiro ) are giving an awesome session on Transaction Log Internals

Tim Chapman is teaching a session on Index Internals

Tim has also teamed up with Thomas LaRock (not a PFE, but no one Is perfect) to present Query Performance Tuning: A 12 Step Method

These will all be great sessions! The PASS Summit is a great event for training, networking, and to see how other companies solve similar problems. It is a great way to get new ideas and find ways to improve performance and efficiency in your environment.

Here is more info on the PFE sessions:

SQL Server Performance and Monitoring in Windows Azure at Scale [CLD-307-M]

Speaker(s): Daniel Sol

Duration: 75 minutes

Track: Cloud Application Development & Deployment

This session will focus on lessons learned from performance tuning one of the largest Windows Azure SQL Database deployments in the world. We’ll work through the approach and methodology used in the process, especially in the non-functional testing and monitoring phase. We’ll also look at design decisions and match them to real examples, digging into how the monitoring requirements were defined and how the design was turned into reality.

SQL Server Transaction Log Internals [DBA-406-M]

Speaker(s): Tim Chapman Denzil Ribeiro

Duration: 75 minutes

Track: Enterprise Database Administration & Deployment

The transaction log plays the most critical role in any SQL Server database. In this session, you’ll learn the importance of the transaction log and go inside the roles it plays inside the database engine. You’ll see how logging and recovery work, explore the checkpoint process and write-ahead logging, and walk through the steps you need to take as a DBA to ensure proper management of the transaction log for a SQL Server database.

SQL Server Index Internals: A Deep Dive [DBA-405-M]

Speaker(s): Tim Chapman

Duration: 75 minutes

Track: Enterprise Database Administration & Deployment

Have you ever wondered what an index actually looks like under the covers? Are you curious as to how SQL Server indexes retrieve data so quickly? In this session, we’ll discuss index internals and how SQL Server creates, maintains, and uses indexes internally for normal operations. We’ll cover topics such as index data structures, query optimization (briefly), and maintaining indexes and statistics. After this session, you’ll understand the DMVs available in SQL Server 2012 for viewing index internals, what data is available at each level of the index btree structure and how to make meaningful use of that data, and how data is retrieved through efficient index usage.

Query Performance Tuning: A 12-Step Method [DBA-316]

Speaker(s): Thomas LaRock Tim Chapman

Duration: 75 minutes

Track: Enterprise Database Administration & Deployment

Performance tuning is hard; everyone knows that. But it can be faster and easier if you have a defined process to follow. This session breaks performance tuning down into 12 easy-to-follow steps to help you understand what actions to take (and when) to improve query performance. If you’ve ever been handed a query and told to “make it go,” these 12 steps are what you need to get the job done in the shortest amount of time.

I also want to give a shout out to fellow Microsoft support colleagues Bob Ward and Adam Saxton who will also be giving amazing sessions on SQL Server 2012 Memory and Power View Performance. Their sessions are always a must see.

Curt Mathews and Shon Hauck are giving a full day precon on Availability Groups as well. It is not too late to sign up for a precon, and this is a good one!

Since I won’t be there to cheer on my teammates, good luck guys!

Thanks,

Lisa Gardner

@SQLGardner

First off, thanks to everyone who decided to attend a Microsoft SQL PFE session at the PASS Summit in Charlotte. We know you have a lot of great sessions to choose from, and we are thrilled if you happened to have chosen to listen in on one of ours. As promised, we are posting the demo code and slides from our sessions. The slides are already posted to the PASS site (the linkages have been corrected), so we will just post links below. The demos are also available for download.

SQL Server Performance and Monitoring in Windows Azure at Scale [CLD-307-M]
Speaker(s): Daniel Sol
Duration: 75 minutes
Track: Cloud Application Development & Deployment
Slide Deck

SQL Server Transaction Log Internals [DBA-406-M]
Speaker(s): Tim Chapman Denzil Ribeiro
Duration: 75 minutes
Track: Enterprise Database Administration & Deployment
Slide Deck , Demos

SQL Server Index Internals: A Deep Dive [DBA-405-M]
Speaker(s): Tim Chapman
Duration: 75 minutes
Track: Enterprise Database Administration & Deployment
PASS TV Video
Slide Deck , Demos

Query Performance Tuning: A 12-Step Method [DBA-316]
Speaker(s): Thomas LaRock Tim Chapman
Duration: 75 minutes
Track: Enterprise Database Administration & Deployment
Slide Deck , Demos

I recently dealt with a customer issue where they were troubleshooting MSDTC, and upon hearing the explanation of exactly what they were doing, I was a bit surprised that a distributed transaction was being used.

Upon further investigation, they were unintentionally promoting a local transaction to a distributed transaction. The reason behind this that they were using the TransactionScope class to ensure the atomic nature of multiple operations. Much like we already know about transactions, if operation 1 and operation 2 happen within a transaction, they either both succeed or they both fail, either resulting in a commit or a rollback.

The transaction’s inadvertent promotion to a distributed transaction came from their implementation though. Instead of reusing the same SqlConnection object for both of the operations’ SqlCommand objects, they created different ones. Even though they were the same connection string, this caused the distributed transaction when you wouldn’t normally expect a distributed nature.

Before we dive into the actual code reproduction, I’m first going to create an Extended Events session so that we can monitor both the sql_transaction and the dtc_transaction events. Below is my session definition:

create event session DtcMonitoring

on server

add event sqlserver.dtc_transaction

(

action

(

sqlserver.server_principal_name,

sqlserver.sql_text

)

where

(

sqlserver.server_principal_name = N’stringer\administrator’

and sqlserver.client_app_name = N’DtcTesting’

)

add event sqlserver.sql_transaction

(

action

(

sqlserver.server_principal_name,

sqlserver.sql_text

)

where

(

sqlserver.server_principal_name = N’stringer\administrator’

and sqlserver.client_app_name = N’DtcTesting’

)

add target package0.event_file

(

set filename = N’\\<path to XEL file>\DtcMonitoring.xel’

)

with

(

event_retention_mode = allow_single_event_loss,

max_event_size = 0 KB,

memory_partition_mode = none,

track_causality = off,

startup_state = off

);

– start the session

–

alter event session DtcMonitoring

on server

state = start;

I am filtering this, as I know in my repro application I am setting the Application Name portion of my connection string to “DtcTest”. This will keep the noise down and allow me to concentrate on the events I care about.

Now onto the code that reproduces this behavior. Below I have a method that utilizes two SqlConnection objects, therefore promoting a transaction to a distributed transaction. In the following code examples, I have nested the XE events that were fired corresponding to when they were fired as a result of particular code that ran.

public void PerformDistributedTransaction()

{

using (TransactionScope distrTrans = new TransactionScope())

{

SqlConnection dbConn1 = new SqlConnection(ConnectionString);

SqlConnection dbConn2 = new SqlConnection(ConnectionString);

SqlCommand sqlCmd1 = new SqlCommand();

sqlCmd1.Connection = dbConn1;

sqlCmd1.CommandText = "select * from humanresources.department;";

SqlCommand sqlCmd2 = new SqlCommand();

sqlCmd2.Connection = dbConn2;

sqlCmd2.CommandText = "select * from humanresources.employee;";

try

{

dbConn1.Open();

dbConn2.Open();

sqlCmd1.ExecuteNonQuery();

sqlCmd2.ExecuteNonQuery();

}

catch (Exception ex)

{

throw ex;

}

finally

{

dbConn1.Dispose();

dbConn2.Dispose();

sqlCmd1.Dispose();

sqlCmd2.Dispose();

}

distrTrans.Complete();

}

Note: In the above code, I am being explicit with my SqlConnection.Dispose() calls, throwing them in the finally block of my try/catch. I am by no means illustrating this as best practice, but it keeps down on code clutter instead of the using block to call IDisposable.Dispose() at the end of it, allowing concentration on the lesson at hand.

I have highlighted the important parts of the code in yellow. By instantiating two SqlConnection objects even with the same connection string, and having these objects separately set as the SqlCommand.Connection properties for the individual operations, I am causing the transaction to get promoted to a distributed transaction.

This is precisely the behavior I was seeing, and the behavior that wasn’t necessary or desired. The fix to this was utilizing a single SqlConnection object for both operations SqlCommand.Connection properties. Below would be an example of how this is done:

public void PerformNonDistributedTransaction()

{

using (TransactionScope nonDistrTrans = new TransactionScope())

{

SqlConnection dbConn1 = new SqlConnection(ConnectionString);

SqlCommand sqlCmd1 = new SqlCommand();

sqlCmd1.Connection = dbConn1;

sqlCmd1.CommandText = "select * from humanresources.department;";

SqlCommand sqlCmd2 = new SqlCommand();

sqlCmd2.Connection = dbConn1;

sqlCmd2.CommandText = "select * from humanresources.employee;";

try

{

dbConn1.Open();

sqlCmd1.ExecuteNonQuery();

sqlCmd2.ExecuteNonQuery();

}

catch (Exception ex)

{

throw ex;

}

finally

{

dbConn1.Dispose();

sqlCmd1.Dispose();

sqlCmd2.Dispose();

}

nonDistrTrans.Complete();

}

After this code logic runs, the logged events show us that we are no longer resorting to a distributed transaction for this operation.

In short, if you are seeing unintended distributed transactions, one place to look is how you are using your connection objects. You may also be unintentionally promoting a transaction to a distributed transaction.

Thomas Stringer – SQL Server Premier Field Engineer

Twitter: @SQLife

This blog post will be part 1 in a multi-post series where I show you how to create your own custom SQL Server troubleshooting data collections, how to load the data into a SQL Server database, and how to create your own custom reports based on the data you collect.

The first step in this process is becoming familiar with Diag Manager from CodePlex. This tool is practically identical to the pssdiag tool that we SQL engineers use to capture a data from a customer’s system. The tool comes with a simple UI that allows you to pick and choose which data to capture from a given SQL Server machine.

Note: You install and configure this tool from a client machine – not a target SQL Server machine. The tool will create a .cab file that you place on a SQL machine to capture information. The DiagManger tool is just used to decide which data you want to capture. It basically creates a configuration file for using SQLDiag. SQLDiag is a capture utility that comes with SQL Server – DiagManager is simply a UI used to create configuration files for SQLDiag.

This is a tool that has been out for a long, long time and I’m still amazed that it’s not used more.

The benefits of using DiagManager as your tool of choice for capturing SQL Server diagnostic data include:

1. It’s free. A great and flexible tool that won’t cost you a cent. No marketing included.

2. It’s easy to use. Once you get a feel for the interface after this post, you’ll be able to create custom configurations to capture data from any server in your environment. (More on this later)

3. It collects perfmon data. A lot of free tools out there don’t. Data gathered from perfmon can help you solve a number of problems that aren’t possible through DMV collections. Besides Wait Stats captured through DMVs, perfmon data is the first data I look at from a capture.

4. It’s completely configurable. You don’t like what data it captures by default? Change it. Under the covers this tool is basically creating configuration files for SQLDiag to use, so if you can imagine it – this tool can likely capture it. Great for consultants that like to use their own scripts, but want their customers to capture the data.

5. It collects SQL Trace data – which can be used in conjunction with perfmon data to correlate interesting events that occur inside the database engine.

6. It is a consistent troubleshooting solution that will collect the same data from each server for evaluation purposes.

Getting Started

Once you have downloaded the installed the tool from CodePlex onto a client machine somewhere, open the tool to have a look at it. The screenshot below is what you’ll see. Let’s go through the UI;

The default CPU architecture selected is 32-bit (which is the first choice in the Platform list). More than likely you’ll be running this capture against a 64 bit server, so make sure to choose the AMD-64 button. If the target server is 32-bit, stick with the default. Also, make sure this is the first option you choose. Switching from 32 to 64 bit will undo all previous choices you’ve made for the capture. Not a huge deal – but can be frustrating. If you’re running this capture on a clustered instance of SQL Server, you’ll want to specify the SQL Virtual Name as the machine name. More information on this here: http://diagmanager.codeplex.com/wikipage?title=RunningCluster

If you plan to capture diagnostic information against all of the instances on the machine that the diag is captured, then you can accept the defaults here. Otherwise, specify an instance name if you only want to capture data against a single instance on the machine. You’ll have to execute the pssdiag this tool creates on the server itself, so leaving the Machine name input box as “.” should be fine. (This tool doesn’t allow for executing the pssdiag capture remotely) Also, make sure you choose the correct version of SQL Server that you plan to be capturing diagnostic data from. If you do not, the capture will error when you attempt to gather data, and you’ll have to start all over.

Unfortunately, at the time of this writing this tool doesn’t capture SQL Server 2012 information natively. I can’t say when the tool will be updated to allow for this through the UI. The batch file this tool creates can be adjusted to allow for capturing data from a SQL Server 2012 instance. I’ll outline how to do this in a future blog post.

One great asset of this tool is that gives you the ability to capture perfmon data. You are able to specify the objects that you want to capture along with the max file sizes and the capture interval (the capture interval is global across all perfmon counters). I always capture perfmon data – the overhead of the collection is minimal and it can really give you a lot of insight as to what is going on for a given server. It’s a wealth of information and can help you uncover problems that you wouldn’t be able to find with DMV information alone. And, if you wanted to, you could feed the output from the perfmon capture to the Performance Analysis of Logs (PAL) tool.

By default, DiagManager will capture SQL Trace information. However, only on rare occasions do I actually collect it. Capturing too many trace events can add a lot of overhead to an instance, and actually be detrimental to what you’re trying to accomplish overall. Uncheck this box if you do not need to collect SQL Trace data. If you do decided to collect Trace information make sure to collect only the information you need. If the capture runs for an extended duration, you’ll want to make sure you run the capture from a drive that has a lot of drive space.

Now comes the fun part – the custom data collection. From my perspective, this is the most powerful portion of this tool as it allows you to collect data from any custom SQL script you could imagine. To enable, right-click “_MyCollectors” and choose “Details”. From there, a screen pops up that allows you to configure the collection to gather your personalized scripts.

I’ll say that again – whatever scripts you find handy, you can have pssdiag capture them for you. Any script you want. The output of these SQL scripts will be dumped to .OUT files, which can be analyzed by other tools, namely SQL Nexus (which I’ll cover in depth in my next blog post).

If you have a lot of scripts (and I do) then entering them in by hand may not be the most useful option for you. So, one option is to modify the XML that holds this listing of scripts directly. These values are stored in the CustomDiag.XML file in the C:\Program Files (x86)\Microsoft\Pssdiag\CustomDiagnostics\_MyCollectors folder.

To download the scripts that I send out for collection, you can grab them here. My plan is eventually have a communal location where people can post scripts that they find useful so others can make use of them as well. Everyone loves useful scripts, and having a central location for them would be great for everyone. If you have any comments on the scripts I’ve provided, please shoot me an email at timchap<at>Microsoft<dot>com. I’d love to have feedback, and in the future plan to put something together so people can post and review their own scripts for everyone to use.

You can edit this XML file directly to add your scripts.

You can download my CustomDiag.XML file here. You’ll just need to drop it into the location specified above.

You’ll also want to put your custom SQL scripts in the same location.

If you look at the scripts I’ve made available, you’ll notice a PRINT statement at the beginning of each one (and intermingled in some others) that describes the type of data that script is collecting. The output from the PRINT statement will be present in the output file that the pssdiag captures.

Why should you care? I’m glad you asked. Smile

We will use these tags to identify data sets and load them into SQL Server tables automagically using SQL Nexus. (I’ll focus on how to do that in my next post.) If you setup a capture that gathers data at regular intervals during the process, it is also a good idea to inject a GETDATE() to the PRINT statement at the beginning.

After you’ve made your choices for the data to gather, click the Save button and choose where to place the pssd.cab file. It is this file that you will place on the SQL Server for capturing data. By default, this file is saved to the C:\Program Files (x86)\Microsoft\Pssdiag\Customer directory on your client machine. If you’ve chosen SQL 2008 as the instance you’ll be collecting from, you’ll get the following popup asking you whether you’re collecting from a SQL Server 2008 or SQL Server 2008 R2 instance. Choose the correct option for the instance you’re collecting from.

Note: Pay no attention to the typo in the screen above. It’s done on purpose for dramatic effect. Smile

After you’ve placed the pssd.cab file onto your SQL Server machine (preferably in a dedicated folder – potentially with a lot of free disk space if you’re capturing SQL Trace information), double-click the cab file to extract the contents (double click the cab file, copy-all, and paste them somewhere).

Double-click on the pssd.cmd file (shown below) to start the data capture. When you’ve ran the capture for a sufficient amount of time (that will depend on how long you think you need to capture the data) then hit Ctrl+C to stop the capture. If you’re just needing to snapshot the server to get a few vital bits of information, starting the capture and immediately shutting it down may be fine. However, if you need more information than that, running the capture longer may be necessary. It isn’t unheard of for me to have clients run the capture for several hours. If you do this, just be careful if you’re capturing SQL Trace information – as it can result in a large data capture.

Note: The capturing of this information can also be automated through the use of a SQL Agent job.

Once the capture shuts down, it saves the data to an output folder in the same location where you started the capture – in my case the C:\TimLovesPSSDiag folder.

Taking a look at the output folder we can see a glimpse of the breadth of data that this tool collects, both by default and through customization.

In my next post in this series I will detail how you can make use of SQL Nexus to import the custom data you collected through your DMV scripts into SQL Server tables. In the post after that I will introduce some custom reports that I have made from the data I typically collect from customer environments, and share a Powershell script that will export the data from those reports to an Excel file that will be very close to “Customer Ready”. This report will allow you to show to your customer’s specific issues they may be experiencing on their system. This way you can focus on planning remediation steps with your servers or your customers rather than worrying about gathering the data.

Here is the link to the downloads in case you missed the links above.

See you next time!

Tim Chapman
@chapmandew

I recently worked with a customer who was attempting to deploy Transparent Database Encryption using a trusted certificate which was generated by a certificate authority (CA). They were unable to import the certificate using the CREATE CERTIFICATE command as it kept failing with a 15468 error. This blog post attempts to explain this error and demonstrates a solution to the problem.

The process of creating a trusted certificate involves using a cryptographic tool to generate a private key which is then submitted to a certificate authority (CA), which will in turn generate a certificate. Microsoft offers a MAKECERT utility that is useful for testing but not recommended for production environments. Other cryptographic tools like the open source OPENSSL are useful for generating private keys that adhere to the strict x.509 cryptography formats.

As a general best practice using EKM is preferable to generating the keys manually as it makes a distinct separation between the key and the database being protected by making the key inaccessible by the SQL Server engine.

SQL Server is capable of using certificates which incorporate the .DER (Distinguished Encoding Rules) file format. These files are binary encoded certificates which can typically have a CER or CRT extension. While Certificate Authorities and cryptography tools like OpenSSL can encode in .DER file format, they can also encode certificates using .PEM or Privacy Enhanced Electronic Mail which uses Base64 formatting. Unfortunately the Base64 format is not compatible with SQL Server.

Some Certificate authorities store both the public and private keys in a personal exchange format or PFX format. SQL Server won’t be able to import these PFX files directly since the CREATE CERTIFICATE command is expecting a DER-encoded certificate (.CER). As a workaround, the Microsoft PVK converter utility can be used to convert the PFX certificates to PVK/DER format. More information can be found in KB Article 2914662 “How to use PFX-Formatted certificates in SQL Server”

Any certificates that are used for encryption in SQL Server must use the DER formatting. Certificates coded in Base 64 format which are subsequently imported into SQL Server using the CREATE CERTIFICATE command will generate the following error:

Msg 15468, Level 16, State 6, Line 1

An error occurred during the generation of the certificate.

To correct this problem, the certificate needs to be converted into the DER format so that it can be read by SQL Server. This can be accomplished by having the Certificate Authority re-issue the certificate in the DER format, or optionally the certificate can be converted using the OpenSSL tool.

I will demonstrate the process of creating the Private Key using OpenSSL and Windows Certification Authority tool on the Windows Server operating system to create the certificates in both formats

1. First we must install OpenSSL 1.x and the Visual C++ redistributables from http://slproweb.com/products/Win32OpenSSL.html

2. Open a command prompt, change to the ‘c:\openssl-win32\bin’ and execute

set OPENSSL_CONF=c:\openssl-win32\bin\openssl.cfg

3. From the same directory, generate the private key

openssl.exe genrsa 2048 > private.key

4. Now we have to generate the certificate signing request file that will be used to request the certificate from the Certificate Authority or CA.

openssl.exe req -new -key private.key > certificate.csr

5. As part of the creation of the certificate request or CSR file, you will be prompted to answer several questions, where the Common Name will be the subject of the certificate.

6. Submit the CSR file to the Certificate Authority in order to request a certificate. I used the Windows Certification Authority tool to open the CSR file, then issue a certificate:

I then exported it in two formats. The first file ‘certificateDER.cer’ was a DER encoded certificate while the second file ‘certificateb64.cer’ was in Base64 format.

7. When I attempt to import the base64 version of the certificate, it fails with a 15468 error:

CREATE CERTIFICATE My_New_Cert
FROM FILE = 'D:\Temp\CertTest\certificateb64.cer'
WITH PRIVATE KEY (FILE='D:\Temp\CertTest\private.pvk',
DECRYPTION BY PASSWORD = 'password');
Go

Msg 15468, Level 16, State 6, Line 1

An error occurred during the generation of the certificate.

8. If I switch to the DER encoded certificate, I am able to import the certificate successfully.

CREATE CERTIFICATE My_New_Cert
FROM FILE = 'D:\Temp\CertTest\certificateDER.cer'
WITH PRIVATE KEY (FILE='D:\Temp\CertTest\private.pvk',
DECRYPTION BY PASSWORD = 'password');
Go

Command(s) completed successfully.

9. I then Confirm that the Certificates are imported:

SELECT * FROM SYS.certificates

10. Finally, I use the DER encoded certificate to encrypt the database:

USE TDE
CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE My_New_Cert
Go

ALTER DATABASE TDE
SET ENCRYPTION ON
GO

In summary, when importing a certificate in SQL Server from a certificate authority be sure that the certificate is encoded in DER format. Otherwise the certificate will have to be converted to DER using third party tools like OpenSSL, or a DER version of the certificate will have to be requested from the Certificate Authority.

Greg Husemeier

At the 2014 PASS Summit in Seattle on Tuesday November 4, Denzil Ribeiro and I (Tim Chapman) will be giving a Pre-Conference seminar entitled “Troubleshoot Customer Performance Problems like a Microsoft Engineer.” In this full-day session Denzil and I will cover a plethora of troubleshooting tools and methodologies that we use as field engineers when confronted with challenging performance related issues. The first few hours of the day will be spent on covering troubleshooting tools and methodologies such as wait statistics, extended events, using PSSDiag and SQL Nexus. The rest of the day will be spent re-living some of these challenging performance issues and how to resolve them.

This will be a day focused on SQL Server internals as they relate to performance troubleshooting. It will be a fast-paced and fun day. Expect a lot of interaction (and if we do our job correctly – a good deal of learning new things).

We hope to see you there!

One of the more difficult features to understand/troubleshoot that I’ve found in my experiences working with SSRS customers is the SSRS subscription functionality, especially the subscription processing piece of it before the report execution starts. Once we get to the report execution it’s fairly easy to figure out what’s happening or not happening from there but until then all we see in the SSRS trace logs is a series of Event/Notification messages pertaining to a particular subscription and not something that gives us an all up view of what’s happening overall on in SSRS.

To help assist with this I put together the flow chart below that shows the process a subscription goes through, whether it’s a standard subscription or a data driven subscription, inside of the Reporting Services catalog database (typically named ReportServer). Understanding this flow can be useful in determining if you have subscription events or notifications getting queued up in your database prior to the report execution even being attempted.

As you can see the primary tables associated with subscription processing are the Event, Notifications, and ActiveSubscriptions table.

The Event table will either hold a TimedSubscription event or a DataDrivenSubscription event. Doing a simple SELECT statement and looking at the EventType can show you if and what type of records are queueing up in there (if any).
The Notifications table will have a single Notification record for a standard subscription (with an IsDataDriven value of 0) or potentially multiple Notification records for a data driven subscription (with an IsDataDriven value of 1). Most frequently when SSRS subscriptions get queued up I see the bottleneck in the Notifications table. You can use the following query to look at the queue inside the Notifications table to see it in the exact order that SSRS will process subscriptions:

SELECT n.SubscriptionID, c.Name as ReportName, c.Path as ReportPath, u.UserName as SubscriptionOwner, n.ExtensionSettings,
n.NotificationEntered as QueuedSinceTime, n.ProcessAfter, n.SubscriptionLastRunTime, n.DeliveryExtension
from dbo.Notifications n with (nolock)
inner join dbo.Catalog c with (nolock) on n.ReportID = c.ItemID
inner join dbo.Users u with (nolock) on n.SubscriptionOwnerID = u.UserID
WHERE n.ProcessStart is NULL and (n.ProcessAfter is NULL or n.ProcessAfter < GETUTCDATE())
ORDER BY n.NotificationEntered
The ActiveSubscriptions table is used only for data driven subscriptions. This is where we’ll keep track of how many Notifications for each data driven have succeeded or failed (which is why in the LastStatus column of the Subscriptions table for a data driven subscription will always say something like “Done: X Processed of Y total: Z errors”. The values for X, Y, and Z are pulled from the ActiveSubscriptions table.

You’ll also notice from the flow chart that the real “engine” behind subscription processing is the dedicated subscription threads within the ReportingServicesService.exe process. When bottlenecks occur it may not be that we have too many subscriptions trying to process at once, it could be because our subscription threads are all tied up executing long running reports/subscriptions and we’ve hit the maximum number of threads we can spawn for subscription processing. We can configure this by modifying the MaxQueueThreads value inside of our RsReportServer.config Configuration File. The default for this value is 0 which means it’ll be determined by the number of CPUs on the Report Server (it is usually two times the number of logical CPUs on a machine). It may be beneficial to turn this value up if subscriptions are getting queued up but the resources on the Report Server (and SQL Server hosting the SSRS databases) are not being taxed.

I’ve found in researching the web there is little to no documentation out there for these particular tables. While that is partially by design because we don’t want you making any modifications to records within these tables it also ends up clouding up our understanding of how those tables relate to subscription processing. Understanding how a subscription queue bottleneck happened and how to identify it are one in the same and can be accomplished by understanding the flow “under the hood” of SSRS subscriptions and looking at the pertinent tables that give an idea of what the subscription queue looks like at any given point.

SSIS with SQL 2012 and above brought a lot of great enhancements to SSIS that ease deployment and reconfiguration of packages. The project deployment model, SSIS Catalog, and parameters make it a lot easier to manage SSIS. It also helps decouple environmental configurations from the SSIS code. This way DBAs don’t need to be modifying packages or worry about config files on the filesystem that may have passwords with plain text in them. It is awesome and as someone who managed hundreds of SSIS packages at a previous role, this is a big win!

Note: I am going to assume that you readers are at least slightly aware of project and package parameters and how they work. If not, that’s ok, you will still get something out of this. You may just need to go do a little more research to better understand it.

Anyway, I have a package where I have 8 files I plan to load into 8 tables. This means that I have 8 flat file connection managers in my SSIS package. This is a quick and dirty package I am making for my internal team to use which means we will be taking these 8 files that we have generated and load them into a db on our local SQL instances to do some brief analysis. I want to parameterize the package so I can just hand over the dtsx file to my colleagues to reuse for themselves. This is not a traditional SSIS use case, but the problem is common….

I right click on the connection manager for one of the flat files and select “parameterize”. I create a new parameter with the filepath and filename

This is great! Just a few clicks and I created a parameter. Now here is the problem…. I create 8 of these for all the flat file connection managers.

Problem: Now when I hand the package off to a team member, he/she will need to modify 8 parameters for their environment. Since I know that all 8 files will be in the same directory, that just seems silly.

Solution: First of all, do not right click on the connection manager and select “parameterize”. Sounds silly but it will eventually be parameterized.

Create a parameter at the package or project level. For my use case, I am using package level for portability, but I highly recommend using the project deployment model and project level parameters. You can see my example here:

Now you go within each connection manager in the properties window and click on the ellipse button next to Expressions (highlighted).

You will then select “Connection String” as the property and click on the ellipse again to get into the expression builder.

For the expression, you drag and drop the InputFilePath parameter, then add the plus sign and the filename in quotes.

Then click on OK and notice you will now see the F(x) now with the connection manager icon. You see this same icon after you parameterize the connection manager via the GUI with right click “parameterize”. This different method simply allows you to apply a parameter and an expression.

Now simply repeat this process for all other connection managers. The only difference in the expression will be the file name.

Whether you run the package in SSDT or configure the package in the SSIS Catalog, you just need to modify the file path in 1 place to alter all 8 of my connection managers. You can see this in SSMS:

So when you are creating SSIS packages, think about ways to minimize the number of configurations needed. This will help minimize configurations needed during deployment.

The drawback to this approach is that you are now assuming that all those files will always be in the same directory. Make sure you consider whether or not that is what you want to require within your process. Rarely is there a solution that is ideal for every use case, but depending on your needs, this can be another solution to help simplify your package parameters.

New to SQL Server 2016 are several new security features, each aimed at protecting your data in a very specific way. Dynamic Data Masking allows you to create rules to mask data that you choose so that lower-security users do not see the actual data in the table but rather a mask of it instead. Row-level security allows you to create security schemes so that predicate logic can be used to allow users to see only specific data in a table. Always Encrypted, which I’ll be focusing on today, ensures that the data you choose is ALWAYS in an encrypted state – at rest and in transit. Since the data is always in an encrypted state, you can prevent anyone from seeing the data – even the Database Administrators. In fact, with this feature you can store your data in Azure and be assured that only your applications that retrieve data from the cloud can decrypt the data.

To begin the demo, I am going to expand my AdventureWorks2016 database and navigate to the Security folder in SSMS. Inside the Security folder there is a folder for ‘Always Encrypted Keys’. Expanding that folder I see two additional folders: Column Master Key and Column Encryption Key.

Column Master Key – a metadata entry that represents a key and is used to encrypt the Column Encryption Key. The Column Master Key uses a certificate on the machine, Azure Key Vault, or a Hardware Security Module to encrypt the Column Encryption Key. The CMK must be created first.

Column Encryption Key – This is the key used to encrypt the data in the SQL Server tables. This binary key is generated by using the Column Master Key.

To create a Column Master Key, you must have the thumbprint of a certificate available. You DO NOT (and SHOULD NOT) have the certificate on the SQL machine itself. What I did in my case was generate the certificate on my SQL machine through the ‘New Column Master Key’ UI, Export the Certificate to my client machine, and then delete the certificate on my SQL Server machine. If the certificate is left on the SQL Server machine, then the user who has access to the certificate (the certificate would have to be stored in a User or Local machine store) could then decrypt the data easily (more on this in a later post).

Recently released was a set of Powershell cmdlets that all you to separate the roles of Security Administrator and DBA. These cmdlets allow the Security Administrator to create and load the necessary certificate while allowing the DBA to create the necessary SQL Server objects (such as the Column Encryption Key) without the need of having the certificate physically on the SQL machine. You can read more about how to use these cmdlets here.

So, in this case I choose to ‘Generate Certificate’. The certificate is named “Always Encrypted Certificate”. Here I choose to store it in the ‘Windows Certificate Store – Current User’. I could also store it in Azure if I wanted to. However, for our purposes I am going to delete it very soon anyway. I could also use Azure Key Vault to store the certificate. I’ll show how to do this in a later post.

Here I can see that the certificate was created in the Personal  Certificates folder in the Current User store (in the Certificates snap-in in mmc).

If I were to go to my certificate that was created through the ‘Create Column Master Key’ wizard, I can see the thumbprint – which matches up to the KEY_PATH above (less the spaces):

The TSQL used to create the Column Master Key is below. The thumbprint for the certificate (highlighted) is used to correspond to the certificate on the client machine when communication is initiated.

Now that my Column Master Key has been created, I can use it to create (and encrypt) a Column Encryption Key.

The Column Encryption Key uses the Column Master Key to generate a varbinary key which can later be used to encrypt data in a SQL Server table. Here I just give the Column Encryption Key a name an tell it to use the Column Master Key I created above to do the encrypting of the varbinary value.

Here is the TSQL to create the Column Encryption Key. The value I have highlighted is the encryption key that was generated by using the Column Master Key. It is abbreviated in this case (it’s too long to print).

Since I used the method of creating the certificate on the SQL machine, at this point I can remove the certificate. Had I went with the route of using Powershell to set up the necessary objects using role separation, I wouldn’t need to do this. But, alas – I do.

I’ll first need to Export it with the private key so that I can restore it to the client machine (I’ll cover this in my next post). The main thing to make note of at this point is that the certificate is no longer needed on the SQL Server machine. If I keep track of the TSQL statements generated from the above operations, I’ll never need to have that certificate on the SQL machine again.

Next up, I will create a table that will store encrypted data. In the TSQL table definition, I need to specify the encryption algorithm to be used as well as if the comparisons against the column will be Deterministic or Randomized. Deterministic encryption uses a method which always generates the same encrypted value for any given plain text value. Using deterministic encryption allows grouping, filtering by equality, and joining tables based on encrypted values, but can also allow unauthorized users to guess information about encrypted values by examining patterns in the encrypted column. This weakness is increased when there is a small set of possible encrypted values, such as True/False, or North/South/East/West region. Deterministic encryption must use a column collation with a binary2 sort order for character columns. Randomized encryption uses a method that encrypts data in a less predictable manner. Randomized encryption is more secure, but prevents equality searches, grouping, indexing, and joining on encrypted columns.

Now that my table is created, I can INSERT data into it. However, since this table is encrypted the interaction must occur with an application that has access to the certificate I used to generate the Column Master Key. I wrote an application to insert data into the table, which I will discuss in my next blog post. However, once the data has been inserted into the table, we can see that just viewing the data returns the encrypted form of the information.

Note: I have sysadmin rights on this machine and the data is still returned in an encrypted manner. More information on this coming… 

SELECT
*
FROM dbo.People

Thanks!

Tim Chapman

In this blog post I am going to continue discussing the new Always Encrypted feature in SQL Server 2016. There are 2 main aspects to Always Encrypted – first is generating the Column Master Key and Column Encryption Keys in the database where the encrypted database will be stored. Second is the usage of a specialized driver in the client applications so that the encrypted data can be used by the application. This driver allows clients to encrypt sensitive data inside client applications and never reveal the encryption keys to SQL Server. This driver does this by automatically encrypting and decrypting sensitive data in the SQL Server client application. The driver encrypts the data in sensitive columns before passing the data to SQL Server, and automatically rewrites queries so that the semantics to the application are preserved. Similarly, the driver transparently decrypts data stored in encrypted database columns that are contained in query results. To make use of the driver, you have to ensure that the .Net framework version 4.6 or higher is installed on the client computer.

In this setup I have two separate machines – a server with SQL Server 2016 (named SQL2016A) installed and a client machine with Visual Studio 2013 installed – which I have named DevMachine. On my DevMachine (which has the .Net 4.6 framework installed), I’ve written a small C# application that makes a connection to my SQL Server 2016 instance and calls a stored procedure to insert some data (the code of which I’ll reveal below).

Below is the CREATE TABLE statement I used in last week’s tip for creating the encrypted columns in the dbo.People table.

USE AdventureWorks2016

Here are the stored procedures that my application will call to INSERT and SELECT data from the dbo.People table. Note that there is a SQL User in my database named Sam that the application will be running under. So, I also give Sam the permission to EXECUTE these stored procedures.

In the C# application I’ve written, I’ll set the connection string to my SQL2016A machine – in which I must specify a new option for making use of Always Encrypted. This is the ‘Column Encryption Setting = Enabled’ option. This allows DML operations against an encrypted table.

On my DevMachine (where the C# application will be running from) I have a certificate loaded that originally I created through setting up the Always Encrypted feature on my SQl2016A machine. Once I had the feature set up, I exported the certificate with the Private Keys to a file. I then imported that certificate on my DevMachine client machine. Once I had done that, I deleted the certificate from the SQLServer2016A machine. So, as of right now – the ONLY machine that has the certificate installed is my client DevMachine server. The certificate is not needed on my SQL server machine (and in fact, should NOT be there).

It is worth noting at this point that this deployment model may not be advantageous for your environment – especially if you have a large number of client machines that will be interacting with SQL Server. The reason is that each client must have the certificate located on the machine. In which case, storing the certificate in Azure Key Vault or another Hardware Security Module may be a better choice.

Here is a snippet of code from the application that calls the dbo.InsertPerson stored procedure. Notice that I have to use the Parameters object. Behind the scenes the driver is encrypting and decrypting the parameter values as necessary. The stored procedure sp_describe_parameter_encryption is used to do this.

Once my application has called the stored procedure and inserted the record on my SQL machine, I can run a normal SELECT statement against the table. Here we can see that the SSN and LastName columns are both encrypted. In fact, even though I am connected to the SQL instance as a sysadmin, I still cannot decrypt the data to see the actual contents. This is because the certificate is NOT located on the SQL machine (or any client machine that the sysadmin may be running queries from) – so my sysadmin account has no access to it.

USE AdventureWorks2016
GO

SELECT *
FROM dbo.People

Taking things a step further, I can try to use the new Column Encryption setting when connected to the SQL instance through the ‘Additional Connection Parameters’ option inside of SSMS:

I am able to connect just fine, but when I run a query to return data from the People table, I get an error. The reason is that this query was not executed from a machine where the certificate lives (the DevMachine box). Since the certificate does not exist on the SQL Server instance, there is no way for me to decrypt it. The only way the encryption/decryption can happen is if the certificate is available. This is why any client application that needs to use Always Encrypted must have the certificate loaded to that machine (OR – the case of web clients, have the certificate loaded to a central application server).

SELECT *
FROM dbo.People

Now, if I log onto my DevMachine and issue the same query while using the ‘Column Encryption Setting = Enabled’ option, I am able to return the data in a decrypted fashion (which is how the application would use it). Very cool!!

SELECT *
FROM dbo.People

Over the years, there has been much confusion as to the best way to load test Analysis Services. I have seen solutions ranging from as simple as using ascmd to completely custom console apps in Visual Studio. One of the challenging aspects of Business Intelligence solutions has been unit testing in the past, and I believe that one of the major barriers to adoption has been the complexity required to set things up. Projects always have very tight deadlines, and who has the time, I hear a lot. Fortunately, with Powershell and the Invoke-ASCmd cmdlet this becomes a very direct exercise and removes any barriers. This solution can be used either against Multidimensional or Tabular databases, with either MDX, DAX, or DMX queries.

The Invoke-ASCMD cmdlet is a part of the sqlps module for Powershell. If you do not already have this imported into your environment, the first step would be to bring this in so you can properly run script against Analysis Services.

## Import the SQL Server Module.

Import-Module "sqlps" -DisableNameChecking

Once this is done, we can then start running queries by using the Invoke-ASCmd cmdlet within PowerShell. This cmdlet allows us execute MDX, XMLA, DAX, DMX, or (with tabular 2016 only) TMSL scripts against an Analysis Server remotely. A very straightforward run of the cmdlet would look something like the below:

#Issue a query command against SSAS and output result to disk</pre>
Invoke-ASCmd -Server localhost\sql2012tabular -Database "AdventureWorks Tabular Model SQL 2012" -Query "SELECT { [Measures].[Number of Orders], [Measures].[Sum of SalesAmount] } ON COLUMNS ,

NON EMPTY { [Date].[Calendar].[Year] } ON ROWS

FROM [Model]"

Where the -Server parameter denotes the server you want to run against, -Database is the database you want to hit, and -Query is the query you want to run. This is very handy but can be extended by changing the -Query parameter to an Input File parameter. This allows us to write and manage our queries separately. Additionally, instead of 1 query having them separated into a new InputFile gives the ability to run multiple queries via the use of the GO operator between queries. Looking at the same query above but with the InputFile parameter we get the below syntax:

Invoke-ASCmd -Server localhost\sql2012tabular -Database "AdventureWorks Tabular Model SQL 2012" -InputFile "C:\ASPowershellFolder\InputQueries\Query1.mdx"

And the queries that live in the Query1.mdx file are as follows:


SELECT NON EMPTY { [Measures].[Sum of SalesAmount], [Measures].[Number of Orders] } ON COLUMNS

, NON EMPTY { ([Customer].[Occupation].[Occupation].ALLMEMBERS ) } ON ROWS

FROM [Model]

GO

SELECT NON EMPTY { [Measures].[Sum of SalesAmount] } ON COLUMNS

, NON EMPTY { ([Date].[Calendar].[Month].ALLMEMBERS ) } ON ROWS

FROM [Model]

GO

Ok, we now have a simple elegant way to run either a single or multiple query(ies) against SSAS through PowerShell. We can now extend our solution via the use of PowerShell workflows. Workflows are a way in which PowerShell can do the same thing multiple times and retry upon failure, as well as creating a queue that stacks up. Without the use of the workflow, if we executed the above script with 1000 concurrent queries, all 1000 would fire off, some would be successful and the rest would fail after we hit our limit. Workflows take the 1000 queries and send them in via managed requests up to the max point the server can handle. I found the below code from Jamie Thomson over on sqlblog, which is very close to what we are looking for. By changing the function that he created to call a URI to our earlier Invoke-ASCmd, we can kick off a specified number of requests in parallel against our database! The code to create the workflow is below:

#used the workflow code from Jamie Thomson's parallel workflows in Powershell tip at http://sqlblog.com/blogs/jamie_thomson/archive/2014/12/09/parallel-foreach-loops-one-reason-to-use-powershell-workflow-instead-of-ssis.aspx

#given an specified input file path and the database that it should be run under, runs a simulated load test against an SSAS server with the specified number of

#used the workflow code from Jamie Thomson's parallel workflows in Powershell tip at http://sqlblog.com/blogs/jamie_thomson/archive/2014/12/09/parallel-foreach-loops-one-reason-to-use-powershell-workflow-instead-of-ssis.aspx

#given an specified input file path and the database that it should be run under, runs a simulated load test against an SSAS server with the specified number of

workflow asloadtest{

Param($NumberofConnections)


$array = 1..$NumberofConnections



function ASLoadTest($i){

Invoke-ASCmd -Server localhost\sql2012tabular -Database "AdventureWorks Tabular Model SQL 2012" -InputFile "C:\ASPowershellFolder\InputQueries\Query1.mdx"

;

}



foreach -parallel ($i in $array) {ASLoadTest $i}



}

cls


asloadtest 5

Now the only thing left for us to do is to call the workflow to execute, and pass in the number of connections that we want to execute in parallel. This is a very straightforward command:


asloadtest 5

Where asloadtest is the name of the workflow and the 5 is the number of connections parameter. That’s it! With 20 lines of PowerShell code we have a scalable SSAS load testing harness. What’s really cool is that we can easily change the Invoke-ASCmd to Invoke-SQLCmd and reuse this exact same approach for load testing the database engine as well. How cool is that!? Now that we have the hard part done the next step is to configure analysis services to capture the performance monitor metrics that we want to see. Over on his blog, Bill Anton has done an excellent job detailing what perfmon metrics you should be capturing for SSAS and why they are all important (as well as a super helpful summary list at the bottom).

The complete PowerShell scripts are available here on the TechNet gallery for download.

One of the major complaints that I hear from the field has to do with how hard and time consuming it is to upgrade Windows Server versions and SQL Server versions in a clustered environment. In the entire history of clustering with SQL Server, there were very limited ways to accomplish the upgrade of the software which would normally coincide with a hardware refresh cycle. This was a big problem, but now it’s something we can finally take control of and create our own migration and upgrade paths!

Introducing Windows Server 2016 (currently Tech Preview)!

Windows Server 2016 has many improvements upon older versions, the one that many SQL Server DBAs will be happy about is the subject of the blog post today. This new feature is called “Cluster Operating System Rolling Upgrade” and allows for a Windows cluster currently running at least Windows Server 2012R2 to have mixed versions of Windows Server in the cluster.

In the current version of Windows Server (2012R2) when creating a cluster or adding a new node, one of the checks completed during these operations is that of the Windows Version (image below).

This in of itself is extremely exciting – but let’s see it in action!

Upgrading Windows Server 2012R2 Cluster to Windows Server 2016 using AlwaysOn Availability Groups

What we’re going to walk through is upgrading the current Windows Server 2012R2 cluster to Windows Server 2016. Additionally we could also upgrade SQL Server with the cluster upgrade or at a later date.

Environment Overview

The current environment we’re going to upgrade is a two node Windows cluster utilizing SQL Server 2014. There is a single availability group that we want to keep running and have as little downtime as possible. This environment starts are being a single subnet but will end having multiple subnets. I’ll be showing this from the availability perspective, but will also add in information for those of you who may be running FCIs.

Step 1 – Add in a new node to the existing cluster

The first step is to add in a new node to the cluster as this will keep the cluster availability high. If possible, have this new node already updated to the new version of Windows Server as this will save a step later. This node doesn’t have to live in the cluster the entire time, but it should be able to handle the workload of the node it is replacing (if repurposing the old node) or should be the newer upgraded hardware node (or azure VM) running the latest Windows version. This new node should already have SQL Server installed on it if using Availability Groups. If using FCIs, add the node then choose to add a node using the SQL installer.

Once the node is added, we’ll want to run the cluster validation wizard. Please note, if using FCIs *uncheck* the storage checks or the storage will be failed over to each node to test the infrastructure.

Note that there is a “warning” under system configuration.

This is a new warning message letting us know that the cluster can see that the nodes aren’t all running the same OS version. This would have caused an issue before, however now the cluster will continue to run and operate as normal. This doesn’t mean the cluster should run in the downlevel mode for a very long time as all new functionality will not be able to be used while the cluster level is not at the newest version.

Step 2 – Add Services to the New Node

After adding the node, I add in a second subnet by adding in the IP resource manually to facilitate the new node being physically located anywhere. The cluster now became a multi-subnet cluster.

Once the networking is setup, I can add the node to the current availability group and make sure everything is working before we start the node upgrade rotation.

Step 3 – Evict a Current Node and Upgrade Windows Server

The steps before this were done to make sure we don’t lose any availability (as that’s why we build clusters!) by only having a single node in the cluster. If there are multiple nodes in your cluster already, the previous steps may not need to be completed and would be able to start on Step 3 (this step).

We’re going to evict one of the current Windows Server 2012R2 nodes in the current cluster so that the operating system can be upgraded to Windows Server 2016 and introduced back into the cluster. In my environment I’m choosing the node that is currently not the primary Availability Group node. In this case, it’s WS12R2TO16N2. Before we can evict the node from the current cluster, we’ll want to be nice to our availability group and nicely remove the replica from the AG. If using FCI, this would be the time to use the remove node feature part of the FCI installer.

Once the node is evicted, upgrade the operating system (generally this is done with an image which wipes the current server in the process). Once the node has the new version of Windows Server, we’ll repeat Step 1 which was to add a node into the cluster. In this case we’re going ot add it back in, giving it the same node name that it had before.

Repeat Steps 2 and 3 until all but a single node remains.

Step 4 – Upgrading the Last Node

When we upgrade the last node, there is nothing different that should be done at this point. The only call out is that this is the ONLY time in this entire process where our SQL Server services will encounter a small downtime. This means we were able to upgrade our entire environment in the background without incurring any down time. When we finally upgrade the last node running our services, there will be a very short disruption in services while we fail over the AG/FCI to one of the newer nodes in the cluster.

Again, don’t forget Step 2 after adding the node back in. When everything has been completed, go to step 5.

Step 5 – Upgraded the Cluster Level

Until now, the cluster nodes were running at the lowest version common across all nodes – which in this case was 2012R2. Now that all of the nodes are 2016 TP the cluster needs to be told that it can operate at the higher functional level.

Before going any further, let’s run one more cluster validation wizards – again if using SQL Server FCIs, please uncheck the disk checks.

You can see that running the cluster validation wizard again points out to us that we should update to the highest cluster functional level whereas before it was a warning that we were running a mixed version environment. So, let’s upgrade the cluster functional level!

Open a powershell prompt as a user that is a local admin of the cluster nodes.

Let’s check to see what the current cluster functional level is: (Get-Cluster).ClusterFunctionalLevel

Which we can see is currently set to “8”.

Now let’s update the functional level and check again. One last word of caution… Once this is done, the cluster cannot be leveled to a lower version (much like SQL Server and backups)! Please make sure you’re ready to do this as there is no going back. A similar warning will be shown upon invokcation of the powershell cmdlet.

Great! Our update went well J Let’s investigate what the functional level is now:

We can now tell that 8 = Windows Server 2012R2 and 9 = Windows Server 2016.

Our cluster is now running Windows Server 2016 and we were able to upgrade and do this with extremely little downtime!

Step 6 – Final Cluster Validation Check

This is where one more cluster validation check should be run in order to “double check” the cluster. It doesn’t take very long or many resources to do this and I always like to double check all changes made to clusters.

In addition, take a look at the recent cluster events. This is the point where we should no longer be receiving event 1548:

I want to thank to Ruben Gonzalez for his guidance.

This blog post explains how to configure SQL Server AlwaysOn AG with an External Listener in Azure VMs running in Resource Manager model.

Pre-requisites

Before to start ensures that the environment is ready, deploy VMs in Azure RM model:

Create a Resource Group.
Create a Virtual Network.
Create an Availability Set. (Required)
Create a Network Security Group. (Optional)
Deploy Azure VMs:
1. One VM for the Domain Controller with an Active Directory.
2. Two SQL Server VMs deployed into the VN Subnet and joined to AD Domain.
3. One VM to configure the File Share Witness Quorum Model.
4. Two availability groups with two synchronous-commit replicas of an availability database.

01_AzureRM_SQLAG_External_listener

Concepts

Internet Facing load balancer

Azure load balancer maps the public IP address and port number of incoming traffic to the private IP address and port number of the virtual machine and vice versa for the response traffic from the virtual machine.

Azure Load Balancer contains the following child resources:

Front end IP configuration – contains public IP addresses for incoming network traffic.
Back end address pool – contains network interfaces (NICs) for the virtual machines to receive network traffic from the load balancer.
Load balancing rules – contains rules mapping a public port on the load balancer to port in the back end address pool.
Inbound NAT rules – contains rules mapping a public port on the load balancer to a port for a specific virtual machine in the back end address pool.
Probes – contains health probes used to check availability of virtual machines instances in the back end address pool.

Configure the external load balancer.

With the following steps you will create and configure an Internet Facing Load Balancer and then you will configure the cluster to use the Public IP address from the load balancer for the AlwaysOn availability group listener.

1. Setup PowerShell to use Resource Manager

# To login to Azure Resource Manager
Login-AzureRmAccount
# To view all subscriptions for your account
Get-AzureRmSubscription
# To select a subscription for your current session
Get-AzureRmSubscription –SubscriptionName "Subscription Name" | Select-AzureRmSubscription

01_LoadSubscription

2. Create a Public IP address for the Front-End IP pool

Create an Azure Public IP address (PIP) resource, named MSPublicIP01, to be used as front-end with DNS name msagapp01.centralus.cloudapp.azure.com. The command below uses the static allocation type.

#Create a virtual network and a public IP address for the front-end IP pool

$publicIP = New-AzureRmPublicIpAddress -Name "MSPublicIP01" -ResourceGroupName "MSRGAlwaysON" -Location 'Central US' –AllocationMethod Static -DomainNameLabel "msagapp01"

02_CreatePIP_Out

3. Create Load Balancer with Front-End IP pool and a Back-End Address Pool

The following script will create a Load Balancer with this child items:

Front end IP configuration	Front IP in the LB will be the resource MSPublicIP01
Back end address pool	This is only the child object that contains the NICs for the virtual machines to receive network traffic from the load balancer. In this case, the backend pool is the addresses of the two SQL Servers in your availability group.
Probes	First probe will be on port 59999 and this will be validated every 5 sec. The probe defines how Azure will verify which of the SQL Servers currently owns the availability group listener. Azure will probe the service based on IP address on a port that you define when you create the probe.
Load balancing rules	Maps public port 2550 on the load balancer to port 2550 in the back end address pool. The load balancing rules configure how the load balancer routes traffic to the SQL Servers. For this load balancer you will enable direct server return because only one of the two SQL Servers will ever own the availability group listener resource at a time.c

##Create a Front-End IP pool and a Back-End Address Pool
#Front End IP
$frontendIP    =  New-AzureRmLoadBalancerFrontendIpConfig         -Name LB-MSFrontend -PublicIpAddress $publicIP

#BackEnd Adress Pool
$beaddresspool =  New-AzureRmLoadBalancerBackendAddressPoolConfig -Name LB-MSBackEnd

#Health Probe Port
$healthProbe   = New-AzureRmLoadBalancerProbeConfig -Name LB-MSHealthProbe -Protocol Tcp -Port 59999 -IntervalInSeconds 5 -ProbeCount 2

#Load Balancer Rule
#Important to note that for AlwaysOn Availability Group Listener the FrontEnd and BackEnd Port must be the same and EnableFloatingIP must be specified.
$lbrule        = New-AzureRmLoadBalancerRuleConfig  -Name LB-MSRuleSQLAG01 -FrontendIpConfiguration $frontendIP -BackendAddressPool  $beAddressPool -Probe $healthProbe -Protocol Tcp -FrontendPort 2550 -BackendPort 2550 -EnableFloatingIP -LoadDistribution SourceIPProtocol

#Create the Azure Load Balancer with the above configurations
$NRPLB         = New-AzureRmLoadBalancer -ResourceGroupName MSRGAlwaysON -Name MSLB -Location 'Central US' -FrontendIpConfiguration $frontendIP -InboundNatRule $inboundNATRule1 -LoadBalancingRule $lbrule -BackendAddressPool $beAddressPool -Probe $healthProbe

Note: In order to minimize complexity, in the Load Balancer Rule the FrontendPort and the BackendPort are the same, but it should work even using different ports, the recommendation always is test, test and test.

4. Join the VMs NICs to the Backed Pool in the Load Balancer

Azure calls the backend address pool backend pool. In this case, the backend pool is the addresses of the two SQL Servers in your availability group.

#Join the VMs NICs to the Backed Pool in the Load Balancer
#Get NIC Name of VM1
$VM1= Get-AzureRmVM -ResourceGroupName MSRGAlwaysON -Name MSSQL01
$nic1Name=$VM1.NetworkProfile.NetworkInterfaces[0].Id
$nic1Name= $nic1Name.Substring(($nic1Name.LastIndexOf("/")+1) , $nic1Name.Length-($nic1Name.LastIndexOf("/")+1))
$nic1 = Get-AzureRmNetworkInterface -ResourceGroupName MSRGAlwaysON -Name $nic1Name

#Get NIC Name of VM2
$VM2= Get-AzureRmVM -ResourceGroupName MSRGAlwaysON -Name MSSQL02
$nic2Name=$VM2.NetworkProfile.NetworkInterfaces[0].Id
$nic2Name= $nic2Name.Substring($nic2Name.LastIndexOf("/")+1 , $nic2Name.Length-($nic2Name.LastIndexOf("/")+1))
$nic2 = Get-AzureRmNetworkInterface -ResourceGroupName MSRGAlwaysON -Name $nic2Name

# Join NICs to LB Backend Pools
$nic1.IpConfigurations[0].LoadBalancerBackendAddressPools.Add($NRPLB.BackendAddressPools[0]);
$nic2.IpConfigurations[0].LoadBalancerBackendAddressPools.Add($NRPLB.BackendAddressPools[0]);

$nic1 | Set-AzureRmNetworkInterface
$nic2 | Set-AzureRmNetworkInterface

5. Configure the Network Security Group Inbound Rules

Network security group (NSG) contains a list of Access Control List (ACL) rules that allow or deny network traffic to your VM instances in a Virtual Network. NSGs can be associated with either subnets or individual VM instances within that subnet. When a NSG is associated with a subnet, the ACL rules apply to all the VM instances in that subnet. In addition, traffic to an individual VM can be restricted further by associating a NSG directly to that VM.

# Configure the Network Security Group to Allow access over ports 1433 SQLSVC and 2550 SQL AG Listener
$nsg = Get-AzureRmNetworkSecurityGroup -ResourceGroupName MSRGAlwaysON -Name MSSG

$nsgrule1=Add-AzureRmNetworkSecurityRuleConfig -NetworkSecurityGroup $nsg -Name sqlsvc -Description "Allow port 1433" -Access Allow -Protocol Tcp -Direction Inbound -Priority 1010 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix * -DestinationPortRange 1433

$nsgrule2=Add-AzureRmNetworkSecurityRuleConfig -NetworkSecurityGroup $nsg -Name sqlag01 -Description "Allow port 2550" -Access Allow -Protocol Tcp -Direction Inbound -Priority 1020 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix * -DestinationPortRange 2550

Set-AzureRmNetworkSecurityGroup -NetworkSecurityGroup $nsg

Note: If you have a Network Security Group associated per single VM then you have to execute the above script for every NSG.

6. Configure the cluster to use the load balancer IP address

The next step is to configure the listener on the cluster, and bring the listener online. To accomplish this, do the following:

Create the availability group listener on the failover cluster
Bring the listener online and configure the port number
Open Firewall Ports

Create the availability group listener on the failover cluster
Go to the Failover Cluster Manager>Roles>SQLApp1 (AlwaysOn Availability Group)
On the Actions Pane click on Add Resource and then Client Access Point

03_Add_ClienAccessPoint

Add the Name > This will be the name of the Listener
04_CAP_Name

Next on the Confirmation Page
05_CAP_Confirmation

Next on the Summary Page
06_CAP_Summary

On the Resource Group right click on the IP Resource Address and the properties, set the resource name to IPListner1
07_RG_view

08_IPResourceName

On the cluster node that currently hosts the primary replica, open an elevated PowerShell ISE and paste the following commands into a new script.

# the cluster network name (Use Get-ClusterNetwork on Windows Server 2012 of higher to find the name)
$ClusterNetworkName = "SQLPublic"

# the IP Address resource name
$IPResourceName = "IPListener1"

# The IP Address of the Internal Load Balancer (ILB).
# This is the static IP address for the load balancer you configured in the Azure RM.
$ILBIP = "40.83.10.48" #MSPublicIP01

Import-Module FailoverClusters
Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{"Address"="$ILBIP";"ProbePort"="59999";"SubnetMask"="255.255.255.255";"Network"="$ClusterNetworkName";"EnableDhcp"=0}

On the Cluster Resource SQLApp1 right click and then go offline
09_RG_AG_Offline

Then go to the resource properties>dependencies tab and Add the mssqlapp1 resource as dependency.
10_Dependencies

Bring the listener online and configure the port number
Bring the resource SQLApp1 Online
11_RG_AG_Online

Verify that the Public IP is configured in the NIC
12_ValidateIP

Configure the listener port
Open SSMS and the go to Availability Groups>SQLApp1>Listener>mssqlapp1 right click and properties
13_ListenerPort

Open Firewall Ports
Open a CMD and create firewall rules to allow connections over ports 1433 (SQLSVC), 2550 (SQL AG Listener), 59999 (Probe Port)

netsh firewall add portopening TCP 1433 "Open Port 1433"
netsh firewall add portopening TCP 2550 "Open Port 2550"
netsh firewall add portopening TCP 59999 "Open Port 59999"

7. Validate the access over the listener inside the VMs and trough Internet

In the VM1

sqlcmd -S MSSQLAPP1,2550 -E -dAPP1 -Q"SELECT @@SERVERNAME"

14_Access1

In the VM2

sqlcmd -S MSSQLAPP1,2550 -E -dAPP1 -Q"SELECT @@SERVERNAME"

15_Access2

Over the Internet
Use the DNS name configured in the Public IP

sqlcmd -S msagapp01.centralus.cloudapp.azure.com,2550 -Usqladmin -dAPP1 -Q"SELECT @@SERVERNAME"

16_Access3

The objective is test the access trough the listener inside the VM’s and trough the External Load Balancer that uses the Public IP with the DNS name and port 2550, both tests were a success!

8. Configure the ReadOnly routing list

Open SSMS connect to the Primary Replica > Open a new query window

--SPECIFY TO ACCEPT READ-ONLY CONNECTIONS
ALTER AVAILABILITY GROUP SQLApp1
MODIFY REPLICA ON N'MSSQL01' WITH (SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY))

ALTER AVAILABILITY GROUP SQLApp2
MODIFY REPLICA ON N'MSSQL02' WITH (SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY))

--SPECIFY A READ_ONLY_ROUTING_URL
ALTER AVAILABILITY GROUP SQLApp1
MODIFY REPLICA ON 'MSSQL01'WITH( SECONDARY_ROLE (READ_ONLY_ROUTING_URL='tcp://mssql01.centralus.cloudapp.azure.com:1433'))

ALTER AVAILABILITY GROUP SQLApp1
MODIFY REPLICA ON 'MSSQL02'WITH( SECONDARY_ROLE (READ_ONLY_ROUTING_URL='tcp://mssql02.centralus.cloudapp.azure.com:1433'))

--SPECIFY A READ-ONLY ROUTING LIST
ALTER AVAILABILITY GROUP SQLApp1
MODIFY REPLICA ON 'MSSQL01'WITH( PRIMARY_ROLE (READ_ONLY_ROUTING_LIST =('MSSQL02','MSSQL01')))

ALTER AVAILABILITY GROUP SQLApp1
MODIFY REPLICA ON 'MSSQL02'WITH( PRIMARY_ROLE (READ_ONLY_ROUTING_LIST =('MSSQL01','MSSQL02')))

9. Validate the access over the listener inside the VMs and trough Internet

In the VM1

sqlcmd -S MSSQLAPP1,2550 -Usqladmin -dAPP1 -Q"SELECT @@SERVERNAME" -KREADONLY

17_Access4

In the VM2

sqlcmd -S MSSQLAPP1,2550 -Usqladmin -dAPP1 -Q"SELECT @@SERVERNAME" -KREADONLY

18_Access5

Over the Internet
Use the DNS name configured in the Public IP

sqlcmd -S msagapp01.centralus.cloudapp.azure.com,2550 -Usqladmin -dAPP1 -Q"SELECT @@SERVERNAME" -KREADONLY

19_Access6
The objective is test the access trough the listener with the option ReadOnly in order to route the connection to the Secondary Replica, this test were executed inside the VM’s and trough the External Load Balancer that uses the Public IP with the DNS name and port 2550, both tests were a success!

References

Azure Resource Manager Support for Load Balancer
From <https://azure.microsoft.com/en-us/documentation/articles/load-balancer-arm/>

Get started creating an Internet facing load balancer in Resource Manager using PowerShell
From <https://azure.microsoft.com/en-us/documentation/articles/load-balancer-get-started-internet-arm-ps/>

Internet Facing load balancer between multiple Virtual Machines or services
From <https://azure.microsoft.com/en-us/documentation/articles/load-balancer-internet-overview/>

Multi VIP Load balancer in ARM
From <https://blogs.technet.microsoft.com/espoon/2016/03/11/multi-vip-load-balancer-in-arm/>

What is a Network Security Group (NSG)?
From <https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-nsg/>

High availability and disaster recovery for SQL Server in Azure Virtual Machines
From <https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-sql-high-availability-dr/>

Configure Always On availability group in Azure VM manually – Resource Manager
From <https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-portal-sql-alwayson-availability-groups-manual/>

Configure an internal load balancer for an AlwaysOn availability group in Azure
From <https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-portal-sql-alwayson-int-listener/>

Azure ARM: SQL Server High-Availability and Multi-Datacenter Disaster Recovery with Internal Load Balancers (ILB)
From <https://blogs.msdn.microsoft.com/igorpag/2016/01/26/azure-arm-sql-server-high-availability-and-multi-datacenter-disaster-recovery-with-internal-load-balancers-ilb/>

If you still reading this very large post I want to say thank you!!!, in a second part we are going to configure an additional Availability Group with a Listener configured with the same External Load Balancer but over a different Public IP / Port Number.
Twitter @carlos_sfc