As some of you know, I have been fighting a bug on my system for well over a year which results in what appear to be random crashes in many plugins. I have reached out to many of you specifically trying to track this down (and probably drove a few of you crazy!), and have worked with HomeSeer and especially Rich in trying to find root cause. We were working under the assumption that this was a Mono issue, as the crash we see is deep in Mono remoting. Well turns out the issue is more generic than that, and I fully suspect happens on Windows as well. I wanted to document what is currently occurring, as it may be a bit of a surprise to you as I know we have all made certain assumptions in our coding that this bug renders incorrect.
Some data to start:
On the server side, HomeSeer sets an 8 minute timeout for RPC calls it makes to your plugin. Your plugin is linked to the HCSF.dll, and your plugin gets a 60 second API timeout value by default. On the server side there is a queue of pending RPC calls per plugin with ONE THREAD per plugin that dispatches them is sequential order. (Hopefully that line made you grimace!).
What this means is, say in your InitIO you launch a background thread (as the documentation urges you to do). That background thread calls GetDeviceEnumerator() (a call we will come back to a bit later, as its either our villain or our victim depending on your point of view). And then in your InitIO you then call RegisterWebPage. Most of us assumed that those calls, having come from separate threads, executed concurrently on the server. But in actuality its a race condition, whomever gets into the queue first executes first. So lets assume that GetDeviceEnumerator() is first in the queue and RegisterWebPage is second, and SetDeviceString is third (in your startup you call SetDeviceString() to set some device state to indicate you are running).
GetDeviceEnumerator, on larger systems can take over a minute to complete. Internally at 60 second (the default timeout) the thread waiting on the result from the server times out and throws an exception. That exception is caught and the code that process the server result continues. It tries to access a member on what should be a return message, but it's null. So a null pointer exception is thrown vs null being returned to the caller. That NPE propagates back to your plugin code.
But since the GetDeviceEnumerator thread sent the command to the server, its actually still executing. So the InitIO thread that called RegisterWebPage and SetDeviceString also timeout and throw null pointer exceptions. They would have normally complete in milliseconds, but they were never run yet. However, they are still queued on the server, so they do eventually run to completion.
Lets take another example, your InitIO starts the same background thread to call GetDeviceEnumerator() but also registers for HSEvent callbacks. In a device change callback you call GetDeviceString(). Well same issue occurs, now you wind up hanging the HSEvent callback for up to 60 seconds, and in this case you need the result (vs the sets above). Instead you get a null pointer exception.
Even if GetDeviceEnumerator() takes only 10 seconds, in the above example the HSEvent thread is hung waiting on the GetDeviceString() call for up to 10 seconds. This results in our fun 'Dropping event callbacks due to full queue' errors. You made what should have been a cheap call (GetDeviceString()) Nothing in the documentation tells you that call can take as long as the longest prior call on another thread in your plugin to complete.
On the receive side something similar happens, there is a single dispatch queue that receives messages from the server, deserializes them, and then returns the deserialized object to the caller. Since deserializing GetDeviceEnumerator() can take awhile (I see about 35 seconds on my machine) and other API's waiting for their result get held up until that is complete. If two threads in a plugin where to both call GetDeviceEnumerator the second thread would hit 70 seconds of delay (failing after 60), if the second thread simply was calling GetDeviceRef, it would also get held up for the entire deserialization time of the object.
I am literally working on the code to address this, but wanted to get my brain dumped into a message so you all had the context. The server side will move to a thread pool, so only the thread making a call to HS will hold up waiting on a result. It fixes the InitIO and HSEvent errors mentioned above. What it does not yet fix is the fact that GetDeviceEnumerator() is failing on some systems (on my system it can take 45-75 second JUST TO DESERIALIZE IT). I am also adding worker threads here to move deserialization off of the comm thread.
I am hoping Rich and I will find something else to explain why this is so slow and address it. I suspect this is slower for some reason on Mono vs Windows and that is part of the reason I have seen it more than others.
UPDATE: I have found a recent performance fix for the BinaryFormatter that was incorporated into .NET 4.7.2 and .NET 2, but is not in Mono. I opened a bug today at the Mono project asking for them to take the Microsoft fix. Further update: The BinaryFormatter is just slow as compared to other serialization choices and when multiple plugins call GetDeviceEnumerator on a SEL box which is a dual core 1.5ghz machine, times can be multiple minutes. I rewrote the code so the deserialization happens after the timeout and in the calling threads context. So the timeout is now just on the sender and for GetDeviceEnumerator it goes up to 7 minutes (the HS timeout is 8). This should resolve the issue for most cases, but for larger configs they may simply require a box with a bit more horsepower.
UPDATE2: I found a place in the code that put the RPC listener thread into a tight loop if it had no work to perform. Ironically the sequential nature of the older version meant this didn't happen often, but means that plugins that were idle would suddenly see a burst in CPU usage (something MattLo reported in one of my plugins I could never find). I fixed that tight loop today (thread sleeps if no work to do). The overall effect on the system was very very noticeable. Part of the reason seralization and deserialization was taking so long was that each plugin accounted for one such tight loop and a few idle running plugins could starve the system of CPU time.
But in the mean time we all have a long legacy trail of HS3 until a version with fixes for the above is released. This means we have to be a bit defensive in our coding.
Specifically:
Do not treat a null pointer exception from GetDeviceEnumerator() the same as an empty enumerator. If you receive a NPE, you must retry the call until it succeeds. It may take awhile. If you don't, and treat it as empty you may wind up creating new devices when you should not have. As an example, on my system one plugin which creates about 20 devices has actually has over 650 devices on my system (it keeps recreating the 20 at each startup if it receives the NPE). As you can imagine, if GetDeviceEnumerator() is already slow adding 650 devices makes the problem much worse and causes it to fail more often (vicious cycle).
Expect and handle NullPointerExceptions from HS calls, retry when appropriate. Know that RegisterWebPage is actually still queued on the server and will execute, but that calls you need the return value from will have to be retried. Do not let the NPE remain uncaught and cause your plugin to terminate, as then HS restarts it and you do your same Initialization including that which has already succeeded increasing load on the overall system.
If GetDeviceEnumeratorFails retry until you are asked to shutdown, but randomly delay between 30 seconds and 8 minutes (yes really). On a SEL box 2700 devices being returned to 4 callers can take 120 seconds to serialize and the same to deserialize. If you retry immediately you actually add load to the system as the call is still running in the background even though your thread is no longer waiting for it.
Cache the results from GetDeviceEnumerator(). Ideally call it no more than once in your logic (not including retries if that one time fails) and keep a hashes or dictionary of device you care about. Rich (I believe) is also considering adding a version where you can supply your INTERFACE_NAME and only get back your devices so you don't need to get a few thousand if your plugin only needs to see your own, move to this API when available.
Ideally do not register for HSEvent callbacks until you successfully finish your GetDeviceEnumerator call. Today the pattern is often InitIO starts your background thread and then does misc work like HSEvent registration. If you start processing HSEvents while the background thread is in GetDeviceEnumerator you will hold that thread until the call completes. If you don't need to make any calls into HS in your callback you are fine, but many callbacks call at least one HS API.
Do not due expensive work in your HS_Event callbacks. Use the thread pool for that work. Also use the HS_Event parameters to filter the work you consider doing. One plugin (author contacted) rebuilds his triggers on every CONFIG_CHANGE message, at startup there can be hundreds. This also directly leads to the queue overflowing errors.
Use a queue in your HSEvent callback, put the item in the queue and return immediately, if you do wind up hanging for a few seconds due to this bug, let your worker thread do it so the overall system is not effected.
I will update these recommendations as I learn more about why GetDeviceEnumerator() is so slow, and update with what version Rich expect any changes to this behavior to occur in. I suspect we will all need to urge our user base to upgrade, as it will likely be the only way to truly fix this completely.
Thanks for reading,
Bill
Some data to start:
On the server side, HomeSeer sets an 8 minute timeout for RPC calls it makes to your plugin. Your plugin is linked to the HCSF.dll, and your plugin gets a 60 second API timeout value by default. On the server side there is a queue of pending RPC calls per plugin with ONE THREAD per plugin that dispatches them is sequential order. (Hopefully that line made you grimace!).
What this means is, say in your InitIO you launch a background thread (as the documentation urges you to do). That background thread calls GetDeviceEnumerator() (a call we will come back to a bit later, as its either our villain or our victim depending on your point of view). And then in your InitIO you then call RegisterWebPage. Most of us assumed that those calls, having come from separate threads, executed concurrently on the server. But in actuality its a race condition, whomever gets into the queue first executes first. So lets assume that GetDeviceEnumerator() is first in the queue and RegisterWebPage is second, and SetDeviceString is third (in your startup you call SetDeviceString() to set some device state to indicate you are running).
GetDeviceEnumerator, on larger systems can take over a minute to complete. Internally at 60 second (the default timeout) the thread waiting on the result from the server times out and throws an exception. That exception is caught and the code that process the server result continues. It tries to access a member on what should be a return message, but it's null. So a null pointer exception is thrown vs null being returned to the caller. That NPE propagates back to your plugin code.
But since the GetDeviceEnumerator thread sent the command to the server, its actually still executing. So the InitIO thread that called RegisterWebPage and SetDeviceString also timeout and throw null pointer exceptions. They would have normally complete in milliseconds, but they were never run yet. However, they are still queued on the server, so they do eventually run to completion.
Lets take another example, your InitIO starts the same background thread to call GetDeviceEnumerator() but also registers for HSEvent callbacks. In a device change callback you call GetDeviceString(). Well same issue occurs, now you wind up hanging the HSEvent callback for up to 60 seconds, and in this case you need the result (vs the sets above). Instead you get a null pointer exception.
Even if GetDeviceEnumerator() takes only 10 seconds, in the above example the HSEvent thread is hung waiting on the GetDeviceString() call for up to 10 seconds. This results in our fun 'Dropping event callbacks due to full queue' errors. You made what should have been a cheap call (GetDeviceString()) Nothing in the documentation tells you that call can take as long as the longest prior call on another thread in your plugin to complete.
On the receive side something similar happens, there is a single dispatch queue that receives messages from the server, deserializes them, and then returns the deserialized object to the caller. Since deserializing GetDeviceEnumerator() can take awhile (I see about 35 seconds on my machine) and other API's waiting for their result get held up until that is complete. If two threads in a plugin where to both call GetDeviceEnumerator the second thread would hit 70 seconds of delay (failing after 60), if the second thread simply was calling GetDeviceRef, it would also get held up for the entire deserialization time of the object.
I am literally working on the code to address this, but wanted to get my brain dumped into a message so you all had the context. The server side will move to a thread pool, so only the thread making a call to HS will hold up waiting on a result. It fixes the InitIO and HSEvent errors mentioned above. What it does not yet fix is the fact that GetDeviceEnumerator() is failing on some systems (on my system it can take 45-75 second JUST TO DESERIALIZE IT). I am also adding worker threads here to move deserialization off of the comm thread.
I am hoping Rich and I will find something else to explain why this is so slow and address it. I suspect this is slower for some reason on Mono vs Windows and that is part of the reason I have seen it more than others.
UPDATE: I have found a recent performance fix for the BinaryFormatter that was incorporated into .NET 4.7.2 and .NET 2, but is not in Mono. I opened a bug today at the Mono project asking for them to take the Microsoft fix. Further update: The BinaryFormatter is just slow as compared to other serialization choices and when multiple plugins call GetDeviceEnumerator on a SEL box which is a dual core 1.5ghz machine, times can be multiple minutes. I rewrote the code so the deserialization happens after the timeout and in the calling threads context. So the timeout is now just on the sender and for GetDeviceEnumerator it goes up to 7 minutes (the HS timeout is 8). This should resolve the issue for most cases, but for larger configs they may simply require a box with a bit more horsepower.
UPDATE2: I found a place in the code that put the RPC listener thread into a tight loop if it had no work to perform. Ironically the sequential nature of the older version meant this didn't happen often, but means that plugins that were idle would suddenly see a burst in CPU usage (something MattLo reported in one of my plugins I could never find). I fixed that tight loop today (thread sleeps if no work to do). The overall effect on the system was very very noticeable. Part of the reason seralization and deserialization was taking so long was that each plugin accounted for one such tight loop and a few idle running plugins could starve the system of CPU time.
But in the mean time we all have a long legacy trail of HS3 until a version with fixes for the above is released. This means we have to be a bit defensive in our coding.
Specifically:
Do not treat a null pointer exception from GetDeviceEnumerator() the same as an empty enumerator. If you receive a NPE, you must retry the call until it succeeds. It may take awhile. If you don't, and treat it as empty you may wind up creating new devices when you should not have. As an example, on my system one plugin which creates about 20 devices has actually has over 650 devices on my system (it keeps recreating the 20 at each startup if it receives the NPE). As you can imagine, if GetDeviceEnumerator() is already slow adding 650 devices makes the problem much worse and causes it to fail more often (vicious cycle).
Expect and handle NullPointerExceptions from HS calls, retry when appropriate. Know that RegisterWebPage is actually still queued on the server and will execute, but that calls you need the return value from will have to be retried. Do not let the NPE remain uncaught and cause your plugin to terminate, as then HS restarts it and you do your same Initialization including that which has already succeeded increasing load on the overall system.
If GetDeviceEnumeratorFails retry until you are asked to shutdown, but randomly delay between 30 seconds and 8 minutes (yes really). On a SEL box 2700 devices being returned to 4 callers can take 120 seconds to serialize and the same to deserialize. If you retry immediately you actually add load to the system as the call is still running in the background even though your thread is no longer waiting for it.
Cache the results from GetDeviceEnumerator(). Ideally call it no more than once in your logic (not including retries if that one time fails) and keep a hashes or dictionary of device you care about. Rich (I believe) is also considering adding a version where you can supply your INTERFACE_NAME and only get back your devices so you don't need to get a few thousand if your plugin only needs to see your own, move to this API when available.
Ideally do not register for HSEvent callbacks until you successfully finish your GetDeviceEnumerator call. Today the pattern is often InitIO starts your background thread and then does misc work like HSEvent registration. If you start processing HSEvents while the background thread is in GetDeviceEnumerator you will hold that thread until the call completes. If you don't need to make any calls into HS in your callback you are fine, but many callbacks call at least one HS API.
Do not due expensive work in your HS_Event callbacks. Use the thread pool for that work. Also use the HS_Event parameters to filter the work you consider doing. One plugin (author contacted) rebuilds his triggers on every CONFIG_CHANGE message, at startup there can be hundreds. This also directly leads to the queue overflowing errors.
Use a queue in your HSEvent callback, put the item in the queue and return immediately, if you do wind up hanging for a few seconds due to this bug, let your worker thread do it so the overall system is not effected.
I will update these recommendations as I learn more about why GetDeviceEnumerator() is so slow, and update with what version Rich expect any changes to this behavior to occur in. I suspect we will all need to urge our user base to upgrade, as it will likely be the only way to truly fix this completely.
Thanks for reading,
Bill
Comment