One Experience Troubleshooting Transactions Over Linked Servers

The Backstory

This client had been plagued by poor table and index design, which by way of regular application use, caused logical fragmentation of the indexes to go from under 5% to over 90% in a relatively short amount of time. So I built a solution using Service Broker to check index fragmentation and REORGANIZE or REBUILD where necessary. The indexes are checked for fragmentation weekly. Not long after this solution was in place, I got word that a few indexes were still getting too fragmented, even with weekly maintenance on the indexes. Before implementing the index solution, it would be common for me to manually rebuild the indexes on a table when performance was degraded. Almost always, this first step resolved the performance issue.

The Event

Two principles come to mind when I recount this specific troubleshooting episode:

– Assume nothing
– Remember what tools are in your toolbox, and use them

Usually when we don’t adhere to these principles during troubleshooting, the effort takes much longer than is necessary. Such is the case for me when I was tasked with troubleshooting application timeouts due to long-running UPDATE and SELECT queries. The table in question is a very small table, with 460 total records. I checked BLOCKED_PROCESS_REPORT events and found hundreds of statements waiting on the table’s Primary Key. I’ve seen this a lot lately due to fragmentation, so I checked out logical fragmentation on the table indexes. I like to use the following query to quickly check fragmentation on a specific table in a database.

SELECT i.name, d.avg_fragmentation_in_percent 
FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('<TABLE_NAME>'),NULL, NULL, NULL) d
INNER JOIN sys.indexes i ON d.[object_id] = i.[object_id] AND d.index_id = i.index_id
WHERE d.alloc_unit_type_desc = 'IN_ROW_DATA'

In this specific case, the fragmentation wasn’t nearly high enough to raise concerns. At this point, I had prematurely convinced myself that this was a fragmentation issue, even if logical fragmentation wasn’t over 10%. Next I used the sysinternals tool contig to check physical fragmentation on the database data file. contig showed this file to be in 79 fragments, a nod to the days of 1 MB auto-growth. I got permission from the client to drop the database and restore it from backup. Once I did this, contig showed the .MDB was in 1 fragment. The only problem was that this didn’t solve my performance issue. In fact, it had no noticeable effect. Next I thought that maybe the application code had gone awry, submitting the same or similar query against the same record multiple times from multiple SPIDs. This is also something I’ve seen before at this client. So, I decided to fire up SQL Server Profiler to see if that would help figure out the issue.

Sure enough, once I dialed in the trace properties and started the trace, I could see plain as day what was going on. The table in question has a trigger on it that updates a table over a linked server. I started running some test update statements across the same linked server but I didn’t get back any errors. This is when you need to remember that triggers fire in the same transaction as the statement that caused the trigger to fire. The key word here is “transaction.” My test updates were not firing inside of a transaction. As soon as I ran the updates inside of a transaction, I got back errors. Furthermore, it was taking over a minute of execution time before encountering the error. The application timeout setting is at 30 seconds and this is why the developers were getting a .NET timeout error instead of the SQL Server error messages I was getting.

Okay, so I know what’s wrong. It seems the firewall is blocking MSDTC traffic. I used Wireshark to look at the traffic going between the server and the linked server to be updated, filtering out anything that wasn’t MSDTC traffic or traffic to the linked server IP. This gave me a clear picture of what IP the MSDTC traffic was going out on. After checking the firewall on the linked server, I saw the IP where the traffic was coming from wasn’t in the allow list for the MSDTC rule. I added the IP, updated group policies, and BAM! the trigger succeeded.

I came in the next day to reports of the SAME EXACT ISSUE! So what’s going on? The server initiating the linked server update is actually a clustered SQL Server instance running on a 2 node Windows cluster. In addition to this clustered SQL Server instance, the same Windows cluster hosts 3 additional SQL Server clusters. This specific set up has 14 IPs in total. On this particular morning, the MSDTC traffic had switched to a different IP, and was going out over an IP not even associated with the SQL Server instance initiating the update! Furthermore, this IP wasn’t the one associated with the clustered MSDTC service. So, I ended up having to add all 14 IPs to the firewall rule. Problem solved. No new reports of this issue now. Whew.

Now I’m on a quest to understand why MSDTC traffic goes out any old IP! If I find the answer to this, you can bet there will be another blog post on that topic. Until then, enjoy!

One Response to “MSDTC, Triggers, and Firewalls. Oh my!”
  1. Bill

    There were a number of changes in Windows 2008 with respect to clustering (or not clustering) MS DTC – see http://technet.microsoft.com/en-us/library/cc730992%28v=WS.10%29.aspx.

    When MS DTC depends on the cluster service, if that dependency fails then MSDTC is expected to raise IDS_DTC_W_CLUSTER_SERVICE_NOT_RUNNING (http://technet.microsoft.com/en-us/library/dd300220%28v=ws.10%29.aspx). As its error message states “if the cluster service is started [later][ then please shut down this instance of MS DTC as it will become unusable.”, with the message’s operative word being “will”. Transactions that MS DTC had started to distribute before the cluster service finally does start are still usable (and they may be ongoing, perhaps due to a misconfiguration firewall…). Typically, a clustered resource that is a service should have its service set to manually (not automatically) start (e.g. via services.msc). By using a manual service start, the cluster service should always start first.

    If a prior person (prior to a 2nd person’s engagement) were to have used cluadmin to delete a clustered MS DTC resource (as opposed to removing it), then the HKEY_LOCAL_MACHINE\Cluster key would have had its MS DTC entries deleted. But MS DTC would not be aware of that deletion, and will/should still be running – see the Note in http://technet.microsoft.com/en-us/library/cc771540%28v=WS.10%29.aspx. If that person had next recreated a clustered MS DTC resource, a 2nd person (who is oblivious to what the first person had done) would sniff unexpected MSDTC IP traffic. I suspect this is what had happened, but a timely presence of MS DTC’s event ID 4147 would suggest my prior paragraph is/was applicable (and suggests actions were taken by some prior person).

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *