Announcement

**spud** · August 29, 2019, 12:13 PM

The Autelis plugin targets .NET framework 4

I did answer your visitor message
https://forums.homeseer.com/member/7...ge-from-bsobel

**bsobel** · August 29, 2019, 04:21 PM

Spud, sorry somehow I completely missed that, and couldn't figure out how to get back to where I sent it.

Could I ask you to consider a change then, in your code I believe internally that FindDeviceById is calling GetDeviceEnumerator() or another HS API to do a lookup. In that code would you consider doing the below where your current call is? The second call, if it fails would be caught by your existing catch block that logs, but I believe the second call will work so the issue will be masked.

try
{
HS.Call()
}
catch (NullPointerException npe)
{
HS.Call()
}

What we are seeing on some systems on Mono is the first call to a HS API on a secondary thread, the first call fails. The remoting sets up the TCP channel but the first request fails, I still don't have root cause. It is MUCH more pronounced during startup and Rich has been doing some startup changes with me to test this (I suspect that since his STA thread wasn't pumping messages during startup, a bunch of async calls are getting queued and timing out, leading to the issue). Usually the same thread works after the first success (so guessing/hoping in the above example you fire of a new thread every minute or so to do work, and that's why we see the same error every minute). On 4.6 and above it seems to get the NPE on more than just the first call, but with the version you are on, it seems to be just the first call on a new thread.

It is ALWAYS a null pointer exception and always occurs in the depths of mono remoting, I've modified my plugin to retry and a few others (basically as I see the issue on my system I've been reaching out). Ideally Rich and I will get to root cause, but this so far has taken a year to get as close/far as we turn out to be at this stage. Im confident any really fix will be in the HS code not the plugins, but since you are likely to have to support back builds for some time the above try/catch/recall at least once should mask the issue for myself and other users who may or may not even know they are running into it (its not consistent, its been a FUN bug to track down...)

Thank you for considering, and happy to beta if you are willing to trust me and make the change.

Best,
Bill

**bsobel** · September 5, 2019, 01:58 PM

Spud,

More details on what is going on. These details are likely to change as I am working on code changes for HS to address this, however we all (plugin developers) have a backwards compatibility issue we need to deal with:

Every plugin talks to HomeSeer over a RPC framework. That RPC framework has one queue and one thread per plugin on the server side. So what this means is if say you start a background thread and call GetDeviceEnumerator() it gets sent to the server and dispatched. Now say you receive an HSEvent notification you care about, or your thread monitoring equipment notices a change and you call SetDeviceString() (really any HS.* api). That call even though its on its own thread in your plugin winds up in the shared queue on the server and won't be processed until its the top of the queue. So in this example it doesn't execute until GetDeviceEnumerator() completes. I have the HSCF code and am changing this to a thread pool to resolve this, but this is how HS is working today and I think we all coded assuming that our individual plugin threads were indeed not being serialized in this way.

I desperately need you to change the Autelis plugin, as here is what is occurring: The server sets an 8 minute timeout on API's, but the plugin defaults to 60 seconds. On my system GetDeviceEnumerator() often takes 45-75 seconds. In the >60 second case the call is aborted and winds up throwing a null pointer exception. The plugin needs to retry when you see that exception (as the call time is variable), perhaps even simply shutting down if it can never get one after a reasonable number of tries (my plugins do 20 tries, with a few second sleep between). Today when you receive the NullPointerException you move on and recreate your devices (you treat the NPE exception the same as an empty enumerator).

My system currently has about 650 extra Autelis plugin devices due to this (accounting for 19% of my total devices). It's a circle of death as each new device caused GetDeviceEnumerator() to take a bit longer. So as you see the exception occasionally before over time its becoming more and more often, resulting in more devices, resulting in it being more often (rinse/repeat...)

This is nothing you did wrong in your plugin, the contract with the HS API isn't working as we thought. But until everyone is on a fixed version, we will all need to be more defensive in our coding.

Yo may recall in the past I had the similar issue, its taken until now to really get HS to dig in to the level that we can fully explain what is occurring.

Can you kindly get me a build with those changes to try?

Best
Bill

**spud** · September 5, 2019, 07:53 PM

thanks for the detailed explanation. It's weird that it only happens on Linux though.

you can test beta version 3.0.0.29

here is the code I used:

Code:

        public clsDeviceEnumeration SafeGetDeviceEnumerator()
        {
            clsDeviceEnumeration de = null;
            int retry = 0;

            while(de == null && retry < 5)
            {
                try
                {
                    de = (clsDeviceEnumeration)m_hs.GetDeviceEnumerator();
                }
                catch (NullReferenceException)
                {
                    Thread.Sleep(500);
                    if (retry >= 4)
                        throw new HomeSeerRemoteCallException("GetDeviceEnumerator failed after 5 retries");
                }
                retry++;
            }

            return de;
        }

and when HomeSeerRemoteCallException is catched the devices won't be recreated.

**bsobel** · September 5, 2019, 08:06 PM

Originally posted by spud View Post

thanks for the detailed explanation. It's weird that it only happens on Linux though.

you can test beta version 3.0.0.29

here is the code I used:

Code:

 public clsDeviceEnumeration SafeGetDeviceEnumerator()
{
clsDeviceEnumeration de = null;
int retry = 0;

while(de == null && retry < 5)
{
try
{
de = (clsDeviceEnumeration)m_hs.GetDeviceEnumerator();
}
catch (NullReferenceException)
{
Thread.Sleep(500);
if (retry >= 4)
throw new HomeSeerRemoteCallException("GetDeviceEnumerator failed after 5 retries");
}
retry++;
}

return de;
}

and when HomeSeerRemoteCallException is catched the devices won't be recreated.

Hi, and thank you! After further digging, the reason I am seeing it on Linux is simply the SEL box is a 1.5ghz 2 core box and a single plugin takes about 40 seconds to get a device enumerator. When multiple plugins are running the serialization and deserialization threads starve and the time going through the roof and well past 60 seconds. Im sure the same issue exists on Windows, those just have traditionally been beefier boxes.

I am about done with the HCSF changes and will send them to Rich, it will extend the life of the SEL and similar boxes. I just think at about now 2700 devices I'm a bigger customer than its designed for. I think around 1000 devices it would be fine, and the system feels MUCH more responsive after the changes I did. Realized today the return path was also effectively being serialized. With this new version I've moved the deserialization out of the communication thread, so no more random 60 second delays wondering why a plugin seems unresponsive when its simply blocked on another of its own threads thats in either serialization or deserialization.

In the above code, I would recommend raising the count of retries and delaying about a minute between them. When it times out the old call still winds up running, so retiring immediately adds even more load... And thanks for fixing the device creation case, much much appreciated!

**bsobel** · September 6, 2019, 01:52 PM

Spud, an update. I changed my overall recommendation to:

Your GetDeviceEnumerator call should loop until it completes or ShutdownIO is called. If it fails, wait randomly 30s to 8 minutes to try again. On lower power machines trying again right away actually increases load making the problem worse. For example right now on my SEL box with a few plugins running its taking 85 seconds to deserialize the data. I also moved the deserialization into the callers thread, which is after the timeout. So now the timeout is JUST the server communications (and the server doing the serialization). This should resolve the issue, but you should ideally cache results and greatly limit the number of calls to GDE you make overall (you being plugin developers in this context).

**bsobel** · September 9, 2019, 05:18 PM

Spud, one last request (low priority bug). In GetSafeDeviceEnumerator, can you check for a flag and exit your loop if ShutdownIO has been called? Not a huge one but it causes you to throw exceptions if the system is shutdown in the first few minutes after startup and its not completed yet.

**spud** · September 14, 2019, 07:59 AM

Please test version 3.0.0.30. I added a few optimizations as per the timings you sent.

**bsobel** · September 14, 2019, 11:23 AM

Originally posted by spud View Post

Please test version 3.0.0.30. I added a few optimizations as per the timings you sent.

Installed, initial reaction is tremendous improvement. I will watch it over the next few days, but so far CPU usage is night and day better. Thank you!

**bsobel** · September 15, 2019, 10:46 AM

Originally posted by bsobel View Post

Installed, initial reaction is tremendous improvement. I will watch it over the next few days, but so far CPU usage is night and day better. Thank you!

Ok after letting it run over night I am hoping you can consider one more change. Its to cache the device #s you get back after your call to GetDeviceEnumerator() since startup 15 hours ago its been called 242 times and spend 108 cpu minutes in that call. With the fixed HSCF Im running GetDeviceEnumerator() no longer fails, so now that its always running to completion I have better data on just how much CPU that entails. It is by far the most expensive thing you do when interacting with HS. The only other high number is SetDeviceByRef() and you kind of need to do that one when you have data ,)

**bsobel** · September 17, 2019, 04:34 PM

Just an FYI here are the raw CPU times from my system running for a few days. Zwave being the performance killer (giving data to HS on that) and Autelis being my #2 heaviest CPU user. That said it is substantially faster than before, the only issue I still see is the hundred of calls to GetDeviceEnumertor(). Im confident if that was cached you'd fall into the background noise level, there is nothing else that seems to be happening thats expensive.

**spud** · September 17, 2019, 05:31 PM

It is already supposed to be cached, on my system there is only one call to GetDeviceEnumerator, so there must be another problem on your system, I added some logs to see when FindDeviceByID is called, please install version 3.0.0.31 and capture some debug logs.

**bsobel** · September 17, 2019, 06:47 PM

Originally posted by spud View Post

It is already supposed to be cached, on my system there is only one call to GetDeviceEnumerator, so there must be another problem on your system, I added some logs to see when FindDeviceByID is called, please install version 3.0.0.31 and capture some debug logs.

Out but will as soon as I get home. Thanks.

**bsobel** · September 18, 2019, 10:14 AM

Ok here we go. Is it possible the extra enumeration occur for devices that don't exist? E.g. you need to cache their non-existence as well? I see the chemchlrerr is looked up multiple times and thats a device not on my system...

Sep-18 08:04:59 INFO Autelis version 3.0.0.31
Sep-18 08:05:01 INFO CMD SENT: http://10.1.0.249:80/names.xml?
Sep-18 08:05:03 DEBUG <response>
<equipment>
<aux1>Cleaner</aux1>
<aux2>Low Speed</aux2>
<aux3>Overflow</aux3>
<aux4>Air Blower</aux4>
<aux5>Lights</aux5>
<aux6>Heat Pump</aux6>
<aux7>Gas Heater</aux7>
<aux8>AUX8</aux8>
<aux9>AUX9</aux9>
<aux10>AUX10</aux10>
<aux11>AUX11</aux11>
<aux12>AUX12</aux12>
<aux13>AUX13</aux13>
<aux14>AUX14</aux14>
<aux15>AUX15</aux15>
<aux16>AUX16</aux16>
<aux17>AUX17</aux17>
<aux18>AUX18</aux18>
<aux19>AUX19</aux19>
<aux20>AUX20</aux20>
<aux21>AUX21</aux21>
<aux22>AUX22</aux22>
<aux23>AUX23</aux23>
</equipment>
</response>
Sep-18 08:05:08 INFO CMD SENT: http://10.1.0.249:80/lights.xml?
Sep-18 08:05:08 DEBUG WebException: ProtocolError The remote server returned an error: (404) Not found.
Sep-18 08:05:08 INFO CMD SENT: http://10.1.0.249:80/status.xml?
Sep-18 08:05:08 DEBUG <response>
<system>
<runstate>8</runstate>
<model>6520</model>
<dip>11100000</dip>
<opmode>0</opmode>
<vbat>615</vbat>
<lowbat>0</lowbat>
<version>1.6.9</version>
<time>1568819108</time>
</system>
<equipment>
<pump>1</pump>
<pumplo>0</pumplo>
<spa>0</spa>
<waterfall>0</waterfall>
<cleaner>0</cleaner>
<poolht2></poolht2>
<poolht>0</poolht>
<spaht>1</spaht>
<solarht>0</solarht>
<aux1></aux1>
<aux2></aux2>
<aux3></aux3>
<aux4>0</aux4>
<aux5>0</aux5>
<aux6>0</aux6>
<aux7>0</aux7>
<aux8></aux8>
<aux9></aux9>
<aux10></aux10>
<aux11></aux11>
<aux12></aux12>
<aux13></aux13>
<aux14></aux14>
<aux15></aux15>
<aux16></aux16>
<aux17></aux17>
<aux18></aux18>
<aux19></aux19>
<aux20></aux20>
<aux21></aux21>
<aux22></aux22>
<aux23></aux23>
</equipment>
<temp>
<poolsp>90</poolsp>
<poolsp2>60</poolsp2>
<spasp>99</spasp>
<pooltemp>80</pooltemp>
<spatemp>0</spatemp>
<airtemp>69</airtemp>
<solartemp>65</solartemp>
<tempunits>F</tempunits>
</temp>
</response>

Sep-18 08:05:09 DEBUG FindDeviceByID id=root
Sep-18 08:06:15 DEBUG FindDeviceByID id=pump
Sep-18 08:06:16 DEBUG FindDeviceByID id=pumplo
Sep-18 08:06:16 DEBUG FindDeviceByID id=spa
Sep-18 08:06:16 DEBUG FindDeviceByID id=waterfall
Sep-18 08:06:16 DEBUG FindDeviceByID id=cleaner
Sep-18 08:06:16 DEBUG FindDeviceByID id=poolht
Sep-18 08:06:17 DEBUG FindDeviceByID id=spaht
Sep-18 08:06:17 DEBUG FindDeviceByID id=solarht
Sep-18 08:06:17 DEBUG FindDeviceByID id=aux4
Sep-18 08:06:17 DEBUG FindDeviceByID id=aux5
Sep-18 08:06:17 DEBUG FindDeviceByID id=aux6
Sep-18 08:06:18 DEBUG FindDeviceByID id=aux7
Sep-18 08:06:18 DEBUG FindDeviceByID id=pooltemp
Sep-18 08:06:18 DEBUG FindDeviceByID id=spatemp
Sep-18 08:06:18 DEBUG FindDeviceByID id=airtemp
Sep-18 08:06:19 DEBUG FindDeviceByID id=solartemp
Sep-18 08:06:19 DEBUG FindDeviceByID id=poolsp
Sep-18 08:06:20 DEBUG FindDeviceByID id=poolsp2
Sep-18 08:06:20 DEBUG FindDeviceByID id=spasp
Sep-18 08:06:20 INFO CMD SENT: http://10.1.0.249:80/pumps.xml?
Sep-18 08:06:20 DEBUG WebException: ProtocolError The remote server returned an error: (404) Not found.
Sep-18 08:06:20 INFO CMD SENT: http://10.1.0.249:80/chem.xml?
Sep-18 08:06:21 DEBUG WebException: ProtocolError The remote server returned an error: (404) Not found.
Sep-18 08:06:21 DEBUG FindDeviceByID id=chemchlrerr
Sep-18 08:09:21 INFO CMD SENT: http://10.1.0.249:80/status.xml?
Sep-18 08:09:21 DEBUG <response>
<system>
<runstate>8</runstate>
<model>6520</model>
<dip>11100000</dip>
<opmode>0</opmode>
<vbat>615</vbat>
<lowbat>0</lowbat>
<version>1.6.9</version>
<time>1568819361</time>
</system>
<equipment>
<pump>1</pump>
<pumplo>0</pumplo>
<spa>0</spa>
<waterfall>0</waterfall>
<cleaner>0</cleaner>
<poolht2></poolht2>
<poolht>0</poolht>
<spaht>1</spaht>
<solarht>0</solarht>
<aux1></aux1>
<aux2></aux2>
<aux3></aux3>
<aux4>0</aux4>
<aux5>0</aux5>
<aux6>0</aux6>
<aux7>0</aux7>
<aux8></aux8>
<aux9></aux9>
<aux10></aux10>
<aux11></aux11>
<aux12></aux12>
<aux13></aux13>
<aux14></aux14>
<aux15></aux15>
<aux16></aux16>
<aux17></aux17>
<aux18></aux18>
<aux19></aux19>
<aux20></aux20>
<aux21></aux21>
<aux22></aux22>
<aux23></aux23>
</equipment>
<temp>
<poolsp>90</poolsp>
<poolsp2>60</poolsp2>
<spasp>99</spasp>
<pooltemp>80</pooltemp>
<spatemp>0</spatemp>
<airtemp>69</airtemp>
<solartemp>65</solartemp>
<tempunits>F</tempunits>
</temp>
</response>

Sep-18 08:09:27 INFO CMD SENT: http://10.1.0.249:80/pumps.xml?
Sep-18 08:09:27 DEBUG WebException: ProtocolError The remote server returned an error: (404) Not found.
Sep-18 08:09:27 INFO CMD SENT: http://10.1.0.249:80/chem.xml?
Sep-18 08:09:27 DEBUG WebException: ProtocolError The remote server returned an error: (404) Not found.
Sep-18 08:09:27 DEBUG FindDeviceByID id=chemchlrerr
Sep-18 08:13:18 INFO CMD SENT: http://10.1.0.249:80/status.xml?
Sep-18 08:13:18 DEBUG <response>
<system>
<runstate>8</runstate>
<model>6520</model>
<dip>11100000</dip>
<opmode>0</opmode>
<vbat>615</vbat>
<lowbat>0</lowbat>
<version>1.6.9</version>
<time>1568819598</time>
</system>
<equipment>
<pump>1</pump>
<pumplo>0</pumplo>
<spa>0</spa>
<waterfall>0</waterfall>
<cleaner>0</cleaner>
<poolht2></poolht2>
<poolht>0</poolht>
<spaht>1</spaht>
<solarht>0</solarht>
<aux1></aux1>
<aux2></aux2>
<aux3></aux3>
<aux4>0</aux4>
<aux5>0</aux5>
<aux6>0</aux6>
<aux7>0</aux7>
<aux8></aux8>
<aux9></aux9>
<aux10></aux10>
<aux11></aux11>
<aux12></aux12>
<aux13></aux13>
<aux14></aux14>
<aux15></aux15>
<aux16></aux16>
<aux17></aux17>
<aux18></aux18>
<aux19></aux19>
<aux20></aux20>
<aux21></aux21>
<aux22></aux22>
<aux23></aux23>
</equipment>
<temp>
<poolsp>90</poolsp>
<poolsp2>60</poolsp2>
<spasp>99</spasp>
<pooltemp>80</pooltemp>
<spatemp>0</spatemp>
<airtemp>71</airtemp>
<solartemp>66</solartemp>
<tempunits>F</tempunits>
</temp>
</response>

Sep-18 08:13:24 INFO CMD SENT: http://10.1.0.249:80/pumps.xml?
Sep-18 08:13:24 DEBUG WebException: ProtocolError The remote server returned an error: (404) Not found.
Sep-18 08:13:24 INFO CMD SENT: http://10.1.0.249:80/chem.xml?
Sep-18 08:13:24 DEBUG WebException: ProtocolError The remote server returned an error: (404) Not found.
Sep-18 08:13:24 DEBUG FindDeviceByID id=chemchlrerr

Announcement

What version of .NET are you building with

What version of .NET are you building with

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment