Announcement

Collapse
No announcement yet.

Urgent: Defensive coding in your plugins (please read if you develop plugins for HS3)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Urgent: Defensive coding in your plugins (please read if you develop plugins for HS3)

    As some of you know, I have been fighting a bug on my system for well over a year which results in what appear to be random crashes in many plugins. I have reached out to many of you specifically trying to track this down (and probably drove a few of you crazy!), and have worked with HomeSeer and especially Rich in trying to find root cause. We were working under the assumption that this was a Mono issue, as the crash we see is deep in Mono remoting. Well turns out the issue is more generic than that, and I fully suspect happens on Windows as well. I wanted to document what is currently occurring, as it may be a bit of a surprise to you as I know we have all made certain assumptions in our coding that this bug renders incorrect.

    Some data to start:

    On the server side, HomeSeer sets an 8 minute timeout for RPC calls it makes to your plugin. Your plugin is linked to the HCSF.dll, and your plugin gets a 60 second API timeout value by default. On the server side there is a queue of pending RPC calls per plugin with ONE THREAD per plugin that dispatches them is sequential order. (Hopefully that line made you grimace!).

    What this means is, say in your InitIO you launch a background thread (as the documentation urges you to do). That background thread calls GetDeviceEnumerator() (a call we will come back to a bit later, as its either our villain or our victim depending on your point of view). And then in your InitIO you then call RegisterWebPage. Most of us assumed that those calls, having come from separate threads, executed concurrently on the server. But in actuality its a race condition, whomever gets into the queue first executes first. So lets assume that GetDeviceEnumerator() is first in the queue and RegisterWebPage is second, and SetDeviceString is third (in your startup you call SetDeviceString() to set some device state to indicate you are running).

    GetDeviceEnumerator, on larger systems can take over a minute to complete. Internally at 60 second (the default timeout) the thread waiting on the result from the server times out and throws an exception. That exception is caught and the code that process the server result continues. It tries to access a member on what should be a return message, but it's null. So a null pointer exception is thrown vs null being returned to the caller. That NPE propagates back to your plugin code.

    But since the GetDeviceEnumerator thread sent the command to the server, its actually still executing. So the InitIO thread that called RegisterWebPage and SetDeviceString also timeout and throw null pointer exceptions. They would have normally complete in milliseconds, but they were never run yet. However, they are still queued on the server, so they do eventually run to completion.

    Lets take another example, your InitIO starts the same background thread to call GetDeviceEnumerator() but also registers for HSEvent callbacks. In a device change callback you call GetDeviceString(). Well same issue occurs, now you wind up hanging the HSEvent callback for up to 60 seconds, and in this case you need the result (vs the sets above). Instead you get a null pointer exception.

    Even if GetDeviceEnumerator() takes only 10 seconds, in the above example the HSEvent thread is hung waiting on the GetDeviceString() call for up to 10 seconds. This results in our fun 'Dropping event callbacks due to full queue' errors. You made what should have been a cheap call (GetDeviceString()) Nothing in the documentation tells you that call can take as long as the longest prior call on another thread in your plugin to complete.

    On the receive side something similar happens, there is a single dispatch queue that receives messages from the server, deserializes them, and then returns the deserialized object to the caller. Since deserializing GetDeviceEnumerator() can take awhile (I see about 35 seconds on my machine) and other API's waiting for their result get held up until that is complete. If two threads in a plugin where to both call GetDeviceEnumerator the second thread would hit 70 seconds of delay (failing after 60), if the second thread simply was calling GetDeviceRef, it would also get held up for the entire deserialization time of the object.

    I am literally working on the code to address this, but wanted to get my brain dumped into a message so you all had the context. The server side will move to a thread pool, so only the thread making a call to HS will hold up waiting on a result. It fixes the InitIO and HSEvent errors mentioned above. What it does not yet fix is the fact that GetDeviceEnumerator() is failing on some systems (on my system it can take 45-75 second JUST TO DESERIALIZE IT). I am also adding worker threads here to move deserialization off of the comm thread.

    I am hoping Rich and I will find something else to explain why this is so slow and address it. I suspect this is slower for some reason on Mono vs Windows and that is part of the reason I have seen it more than others.

    UPDATE: I have found a recent performance fix for the BinaryFormatter that was incorporated into .NET 4.7.2 and .NET 2, but is not in Mono. I opened a bug today at the Mono project asking for them to take the Microsoft fix. Further update: The BinaryFormatter is just slow as compared to other serialization choices and when multiple plugins call GetDeviceEnumerator on a SEL box which is a dual core 1.5ghz machine, times can be multiple minutes. I rewrote the code so the deserialization happens after the timeout and in the calling threads context. So the timeout is now just on the sender and for GetDeviceEnumerator it goes up to 7 minutes (the HS timeout is 8). This should resolve the issue for most cases, but for larger configs they may simply require a box with a bit more horsepower.

    UPDATE2: I found a place in the code that put the RPC listener thread into a tight loop if it had no work to perform. Ironically the sequential nature of the older version meant this didn't happen often, but means that plugins that were idle would suddenly see a burst in CPU usage (something MattLo reported in one of my plugins I could never find). I fixed that tight loop today (thread sleeps if no work to do). The overall effect on the system was very very noticeable. Part of the reason seralization and deserialization was taking so long was that each plugin accounted for one such tight loop and a few idle running plugins could starve the system of CPU time.

    But in the mean time we all have a long legacy trail of HS3 until a version with fixes for the above is released. This means we have to be a bit defensive in our coding.

    Specifically:

    Do not treat a null pointer exception from GetDeviceEnumerator() the same as an empty enumerator. If you receive a NPE, you must retry the call until it succeeds. It may take awhile. If you don't, and treat it as empty you may wind up creating new devices when you should not have. As an example, on my system one plugin which creates about 20 devices has actually has over 650 devices on my system (it keeps recreating the 20 at each startup if it receives the NPE). As you can imagine, if GetDeviceEnumerator() is already slow adding 650 devices makes the problem much worse and causes it to fail more often (vicious cycle).

    Expect and handle NullPointerExceptions from HS calls, retry when appropriate. Know that RegisterWebPage is actually still queued on the server and will execute, but that calls you need the return value from will have to be retried. Do not let the NPE remain uncaught and cause your plugin to terminate, as then HS restarts it and you do your same Initialization including that which has already succeeded increasing load on the overall system.

    If GetDeviceEnumeratorFails retry until you are asked to shutdown, but randomly delay between 30 seconds and 8 minutes (yes really). On a SEL box 2700 devices being returned to 4 callers can take 120 seconds to serialize and the same to deserialize. If you retry immediately you actually add load to the system as the call is still running in the background even though your thread is no longer waiting for it.

    Cache the results from GetDeviceEnumerator(). Ideally call it no more than once in your logic (not including retries if that one time fails) and keep a hashes or dictionary of device you care about. Rich (I believe) is also considering adding a version where you can supply your INTERFACE_NAME and only get back your devices so you don't need to get a few thousand if your plugin only needs to see your own, move to this API when available.

    Ideally do not register for HSEvent callbacks until you successfully finish your GetDeviceEnumerator call. Today the pattern is often InitIO starts your background thread and then does misc work like HSEvent registration. If you start processing HSEvents while the background thread is in GetDeviceEnumerator you will hold that thread until the call completes. If you don't need to make any calls into HS in your callback you are fine, but many callbacks call at least one HS API.

    Do not due expensive work in your HS_Event callbacks. Use the thread pool for that work. Also use the HS_Event parameters to filter the work you consider doing. One plugin (author contacted) rebuilds his triggers on every CONFIG_CHANGE message, at startup there can be hundreds. This also directly leads to the queue overflowing errors.

    Use a queue in your HSEvent callback, put the item in the queue and return immediately, if you do wind up hanging for a few seconds due to this bug, let your worker thread do it so the overall system is not effected.

    I will update these recommendations as I learn more about why GetDeviceEnumerator() is so slow, and update with what version Rich expect any changes to this behavior to occur in. I suspect we will all need to urge our user base to upgrade, as it will likely be the only way to truly fix this completely.

    Thanks for reading,
    Bill

    #2
    GetDeviceEnumerator, on larger systems can take over a minute to complete.
    Well, this makes me wonder:
    HomeSeers devices are stored in the database. A database lookup should take about a microsecond, even on "huge" systems with thousands of devices. Why is it taking ages?

    From what you are writing, I take that it is the deserialization that takes time. But when is the deserialization necessary?


    Rich (I believe) is also considering adding a version where you can supply your INTERFACE_NAME and only get back your devices so you don't need to get a few thousand if your plugin only needs to see your own, move to this API when available.
    YES! Good move.
    I think you can find a post from me requesting this a few years back.




    What I think we need is something similar GetDevices() function that returns a List<Device>, if possible. Then plugin (and script) development would be a little more modern and a lot easier.
    HSPro 3.0.0.458, Z-NET with Z-wave plugin 3.0.1.190, RFXCOM + 2x RFXtrx433E, HSTouch, Squeezebox plugin, iTach IP/WF2IR & GC-100-6 with UltraGCIR, BLDenon, NetcamStudio, Jon00s Webpage builder, Harmony Hub plugin, SCSIP (with FreePBX), Arduino plugin, IFTTT, Pushalot plugin, Device History plugin.
    Running on Windows 10 (64) virtualized
    on ESXi (Fujitsu Primergy TX150 S8).
    WinSeer (for Win10) - TextSeer - FitbitSeer - HSPI_MoskusSample

    Are you Norwegian (or Scandinavian) and getting started with HomeSeer? Read the "HomeSeer School"!

    Comment


      #3
      Id it possible that it's not GetDeviceEnumerator is slow - it's looping through the devices:
      Code:
      while (!deviceEnumeration.Finished)
      {
          HSDevice device = deviceEnumeration.GetNext();
      }
      if each GetNext() communicates with the server and if you have many devices - it takes long time

      Comment


        #4
        Originally posted by alexbk66 View Post
        Id it possible that it's not GetDeviceEnumerator is slow - it's looping through the devices:
        Code:
        while (!deviceEnumeration.Finished)
        {
        HSDevice device = deviceEnumeration.GetNext();
        }
        if each GetNext() communicates with the server and if you have many devices - it takes long time
        The payload returned doesn’t cause any more round trips unless you intentionally call a device api with the HS parameter not set to null. GetNext() does not perform any remoting. The issue is the serialization time of the GetDeviceEnumerator returned object in hsconsole and the deserialization time in your plugin in process. .

        Here are some real world timing examples on a SEL box. Notice that the deserialization was taking 68seconds. That is not even counting the serialization time. Given the API had a 60 second timeout, you can see how 'badness occurs'. In the latest version I actually cache overlapping calls to GetDeviceEnumerator (e.g. if plugin A and plugin B both call it with A being first. B will just get a copy of the serialized payload that A received if B's call occurs before A's completes. Given that A can take 60 seconds, it greatly reduces the serlializaiton burden on HSConsole and with deserialization now in the user thread instead of there com thread, even if it takes 3 minutes it won't cause a timeout).

        That said, I do want people to be aware that on lower class hardware some of these calls may be much more expensive than realized. We are also discussing changing the BinaryFormatter to another as things like SlimSerializer are at least twice as fast, and since HS continues to sell the SEL class and RaspberryPi class hardware in addition to beefier boxes, such a change will add quite a bit of life to that class of machine.

        Actual examples:
        Slow Deserialization Process: HSPI_Autelis Thread: 5: Message: ScsRemoteInvokeReturnMessage: Returns Scheduler.Classes.clsDeviceEnumeration, Exception = Time: 68.284992s
        Slow Deserialization Process: HSPI_ISYInsteon Thread: 18: Message: ScsRemoteInvokeReturnMessage: Returns Scheduler.Classes.clsDeviceEnumeration, Exception = Time: 68.052978s

        Comment


          #5
          Originally posted by Moskus View Post
          From what you are writing, I take that it is the deserialization that takes time. But when is the deserialization necessary?
          .
          All calls to HomeSeer by your plugin go through a serialization and deserialization process when the api is called then a similar process when the result is returned. This is how the RPC mechanism used works. It’s the Microsoft BinaryFormatter that is slow and in the GetDeviceEnumerator case at least on a SEL box which is a 2 core 1.5ghz 32bit system that serialization and deserialization time can reach multi-minutes.

          Further the implementation was effectively single threaded so all HS calls by plugins were done sequentially regardless of what thread made the call.

          I fixed this by moving the deserialization to the callers thread (so it’s no longer part of the communications timeout). Now just the serialization is, and for that call the timeout is greatly increased. I also fixed the single threaded nature, separate plugin threads no longer get held up by others. The overall effect on system speed is fairly pronounced. Rich should be testing all this next week and hopefully get into a release shortly.

          Comment


            #6
            Originally posted by bsobel View Post
            I fixed this by moving the deserialization to the callers thread (so it’s no longer part of the communications timeout). Now just the serialization is, and for that call the timeout is greatly increased. I also fixed the single threaded nature, separate plugin threads no longer get held up by others. The overall effect on system speed is fairly pronounced. Rich should be testing all this next week and hopefully get into a release shortly.
            I know HS4 is in heavy development, but it would be great to see this in HS3 soon too...
            HSPro 3.0.0.458, Z-NET with Z-wave plugin 3.0.1.190, RFXCOM + 2x RFXtrx433E, HSTouch, Squeezebox plugin, iTach IP/WF2IR & GC-100-6 with UltraGCIR, BLDenon, NetcamStudio, Jon00s Webpage builder, Harmony Hub plugin, SCSIP (with FreePBX), Arduino plugin, IFTTT, Pushalot plugin, Device History plugin.
            Running on Windows 10 (64) virtualized
            on ESXi (Fujitsu Primergy TX150 S8).
            WinSeer (for Win10) - TextSeer - FitbitSeer - HSPI_MoskusSample

            Are you Norwegian (or Scandinavian) and getting started with HomeSeer? Read the "HomeSeer School"!

            Comment


              #7
              Originally posted by Moskus View Post

              I know HS4 is in heavy development, but it would be great to see this in HS3 soon too...
              I very much doubt any further work will be done on HS3 now. Once HS4 is released, HS3 will be removed from sale and all programming efforts placed adding the new HS4 features promised for 2020.
              Jon

              Comment


                #8
                I get that. I was just hoping that it could "easily" be fixed in HS3 too. I will convert to HS4, of course, but I'm not first in line...
                HSPro 3.0.0.458, Z-NET with Z-wave plugin 3.0.1.190, RFXCOM + 2x RFXtrx433E, HSTouch, Squeezebox plugin, iTach IP/WF2IR & GC-100-6 with UltraGCIR, BLDenon, NetcamStudio, Jon00s Webpage builder, Harmony Hub plugin, SCSIP (with FreePBX), Arduino plugin, IFTTT, Pushalot plugin, Device History plugin.
                Running on Windows 10 (64) virtualized
                on ESXi (Fujitsu Primergy TX150 S8).
                WinSeer (for Win10) - TextSeer - FitbitSeer - HSPI_MoskusSample

                Are you Norwegian (or Scandinavian) and getting started with HomeSeer? Read the "HomeSeer School"!

                Comment


                  #9
                  At the end of life for HS2, Rich put in a lot of work to correct and finalize quite a number of bugs and annoyances. Perhaps he'll take pity again and do the same for HS3 holdouts (like me until there's some solution for an HS4 HSPhone plugin).
                  Real courage is not securing your Wi-Fi network.

                  Comment

                  Working...
                  X