Sunday, March 25, 2012

Change error logging for sys.sp_MSrepl_raiseerror

I have merge replication setup to a bunch of mobile subscribers. I want everything managed from the server and I want everything to run continuously. The reason we want this is that we want subscribers to automatically synchronize data within 60 seconds of plugging into the network from any location. We do not want to write anything to have to detect connectivity and initiate it. We do not want it initiated from a subscriber either. This creates a SEVERE logging issue and has halted a roll out at 2 different customers. We can't get beyond about 60 subscribers on a system that was running 3700 subscribers in SQL Server 2000. The server quite literally freezes and becomes non-responsive 1 - 2 times per day. We've tracked this back to the volume of logging into the Windows event log that is being caused by the replication engine.

When subscribers are disconnected, they throw constant errors, to the tune of 1 error every 2 minutes into the SQL Server error log as well as the Windows event log. This is because someone decided that these should be level 18 errors. There is no error. The publisher could not contact the subscriber, so I want it to simply log an error, shut up, go back to sleep, and then try again. I do NOT want a message in my SQL Server error log and I do NOT want a message in my Windows event log.

We are currently logging over 60,000 messages per day into both the Windows event log and the SQL Server error logs for something that we KNOW is a NORMAL operational state of the system. The merge agent doesn't have a parameter that I can feed it to ignore 14151 errors.

The culprit is in sys.sp_MSrepl_raiseerror. Since it is a system object, I can't override it and change the severity level of the error. So, right now we are stuck and the SQL Server error log as well as Windows event logs are being rendered quite useless on the system.

The offending code is:

raiserror (14151, 18, -1, @.agent, @.agent_name, @.message)

Does anyone have any idea how I can forcefully change sys.sp_MSrepl_raiseerror or in some other way suppress the logging of 14151 errors from the Windows and SQL Server logs?

More information.

We have continuous, push merge replication setup to a bunch of subscribers that routinely disconnect. It is merge replication, because we need to make changes at both publisher and subscriber in a disconnected model and this is exactly what merge replication was designed to do. It is setup as push, because we want all agents to be run from the distributor for centralized management and monitoring. It is continuous, because we want as low of latency as possible and for any subscriber that plugs in to begin synchronizing as soon as possible. In otherwords, we have a normal configuration, with very normal and reasonable requirements and we are using the merge engine in precisely the way it was designed and intended to be used.

However, it seems that running all of your agents at the distributor is absolutely incompatible with having disconnected subscribers with merge replication. It is quite literally impossible to run any architecture of any size with merge replication that has disconnected subscribers, unless you are running your agents at the subscriber and are then forced to manage them at the subscriber.

How did I reach this? Quite simply. replmerg.exe is the merge replication engine. It is coded to detect when a subscriber can not be contacted. (A normal state when a subscriber is disconnected.) When it detects one of these, it makes a call to sp_MSadd_merge_history90 and passes it a bunch of parameters, including a value of 6 for the run status. (This is coded directly into replmerg.exe and there is NO way to override this behavior.) When you pass a value of 6 for the run status to sp_MSadd_merge_history90, the last section of code makes a call to sp_MSrepl_raiserror and passes the value of 6 for the run status to it. When you pass a value of 6 to this procedure, it causes a level 18 error message of 14151 to be raised. Since anything level 16 and above is considered fatal, it writes to the SQL Server error log as well as the Windows event log.

Now, this is where the idiocy of this. The merge agent doesn't shut down when this error is thrown. It in fact goes to sleep, wakes up 1 minutes later, and goes through the same routine again. If the error was so fatal, why would the merge engine continue to run? Because, the NORMAL operational state of merge replication is to HAVE a subscriber disconnected.

So, we now have a situation where the merge engine is designed to run with disconnected subscribers as well as when those subscribers are disconnected, the merge engine throws level 18 error messages into the error log and Windows event log, and then goes to sleep so that it can continue to throw level 18 error messages for a situation that it was designed for. What does this mean? It means that as you add in several dozen subscribers, you get quite literally thousands of fatal error messages thrown into the event and error logs. Somewhere around the 60 subscriber mark, you throw fatal error messages at such as pace that the server itself quite literally freezes because of all of the error logging that it is forced to do based on a normal operational state.

So, from my testing thus far, using merge replication with push subscriptions is only suitable it you have 40 of fewer subscribers that are connected a majority of the time. If they are disconnected more than 50% of the time, I can only get about 30 subscribers running in a stable mode without causing the server to lock up. If I turn this around and run the agents at the subscriber where I lose any centralized management of replication, I can scale to well over 400 subscribers. Same publication, same subscription, same hosts. In both of these tests, I'm using multiple instances of SQL Server Express edition on a pool of 5 physical machines. I have ZERO data moving, the only activity happening is the replication engine simply trying to connect and poll for changes. In all of the push cases, the publisher locks up and becomes unresponsive requiring a reboot. The only discernable difference is that when agents are offloaded to subscribers, I don't throw level 18 errors into the event log and error log. Before someone says that it's also because I'm offloading the merge agent, I lock up the publisher in a push mode when I go beyond 30 subscribers who are mostly disconnected. However, I can also load up 100+ subscribers on a single machine in a pull mode (more than 3 times the number of agents as I could get in a push mode) without having a machine lock up. We've traced the cause of the server freeze back to the amount of error logging that is being forced on it and nearly 100% of the error logging is occuring from the single state of having a subscriber disconnected.

So, I've currently halted a couple of major customer upgrades of in excess of 5,000 processors, because this single issue would cause their environments to melt to the ground long before we ever go them upgraded. I have a deployment halted at another customer and one up in the air at a 4th customer. In all cases, I'm starting to evaluate 3rd party products to replace the replication engine, because it very clearly does NOT scale in a very normal configuration.

No comments:

Post a Comment