Announcement

Collapse
No announcement yet.

Command Failure When Controlling Large Number of Devices

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Command Failure When Controlling Large Number of Devices

    rjh I have been seeing an occasional command failure when controlling a large number of devices all at once, such as a large number of Z-Wave dimmers.

    I'm hoping a "fix" can be added to HS3, or at the very least, included in the HS4 release (which rumor has it, may be coming later this year).

    What's happening is that I often see command failures when using Easy Trigger's "Set Group" feature to command a large number of devices - e.g., to turn on / off all my lights (the group includes about 100 devices). When I send that command, I'd say on average a few percent fail. I think this error also happens when using the "regular" HomeSeer device control function, but I typically don't control as many devices at once when using HomeSeer's function (thus, even though I often see this when using Easy Trigger, my suspicion is its something that needs to be corrected at the Z-Wave plugin level).

    I believe what is happening is that the Z-Wave network is getting flooded with commands and some are getting lost in transmission. It seems that the plugin should be able to detect these failures and automatically check and correct (i.e., automatic re-transmission) but doesn't do so -- at least it doesn't seem to do so (please correct me if I'm wrong). I'd like to see this added to a future plugin release or in HS4.

    I'm thinking something along the following lines needs to be done (this seems relatively simple, at least on the surface) . . .

    1. Every time the plugin sends a command to a Z-wave device, it adds an entry into a table indicating the device, the command sent and the total number of tries made to execute the command;
    2. If the command succeeds, my understanding is that most Z-wave devices (at least the "plus" version) will report back their new status.
    3. If the commanded device reports back its new status indicating the command was received and processed, then delete the entry from the table. You may want to treat any report back as a "success" even if the new value doesn't match the commanded value (for example, for locks, if you commanded the lock to lock, but you got back a "jammed" report, you'd still treat the lock command as having been received and processed).
    4. If the commanded device doesn't report a new status after a few seconds (say 2 seconds after all of the outgoing and incoming Z-wave commands in the Z-wave interface's command queue have been processed), then either re-try the command immediately or poll the device to see if maybe it was just the report back that was lost and then re-try the command if needed.
    5. When re-trying the command, increment the value in the table indicating how many attempts were made, if you exceed a maximum, then log the error and give up.
    6. If another command is generated at the HomeSeer mobile or web interface that "overrides" the "original" command, you would also want to clear the table regardless of whether the prior "failure" was corrected and just re-start. E.g., if an attempt is made to set all devices to "off" and a few didn't report that they were set to off, but before you correct using steps 1-5 a new command comes to turn the "failed" device back On, then stop trying to complete the previously "failed" command and just start with the "new" command.
    7. Similar to #6, if the device was manually changed by a user (e.g., tapping the paddle on a light switch) while you were still trying to recover a prior command, you'd also want to "give up" trying to correct the prior command.

    Maybe this is configurable ( - e.g., you could have a check box for each device indicating "retry commands if report not received" with a default of retrying the command)

    Thanks for your consideration of and thoughts on this.

    #2
    Just a side note: When I am controlling numerous zwave devices, I will place a Wait command after about every 6 device commands. This allows Homeseer and ZWave to "catch up". I experimented to find how often to put wait statements in, and for how long to wait.

    Comment


      #3
      This would be an excellent addition to HS processing. I also see these types of errors regularly--say, several times per day. It occurs on my system not only when processing large groups of devices but also individual commands. I don't know the reason, but sometimes z-wave failure errors are logged even though the device does in fact receive and process the command although belatedly. A timing issue perhaps? Or is the system in fact logging an error then retrying?

      As for problems with large groups of devices, I experienced it consistently back when trying to use the All Off z-wave command--to the point of the command being unusable. (I think I've read the All On / All Off functions may have been or will be deprecated?) I then moved to EasyTrigger groups but still have the issue of devices not receiving commands unless I use the more recent ET feature of inserting pauses between each device command--minimum 1 second. This results in successfully sending all commands perhaps 95% of the time, but in many cases the time it takes to cycle through a large number of devices is undesirable.

      rjh I also would appreciate your consideration of jvm's thoughtful suggestion.
      -Wade

      Comment


        #4
        Originally posted by aa6vh View Post
        Just a side note: When I am controlling numerous zwave devices, I will place a Wait command after about every 6 device commands. This allows Homeseer and ZWave to "catch up". I experimented to find how often to put wait statements in, and for how long to wait.
        Thanks for the suggestion. Yes, that's one way to deal with it but highlights the problem that failure recovery is pushed to the user for ad-hoc solutions which can slow overall speed of the system (all those waits can add up if you have 100 devices to command) rather than failure recovery being handled in the plugin when possible. It seems to me the best "design philosophy" is to handle all predictable failure scenarios (and this is one) in the plugin rather than expecting each end-user to detect the error, diagnose why, and figure out an ad-hoc solution to address it.

        Comment


          #5
          +1, jvm. Excellent post. There seems to be ongoing contention for HS3/4 development resources. One camp (largely driven by marketing considerations, I suspect) is focussed on mobile app feature enhancements, while another group (of which I am a member) is much more interested in rock solid engine performance. To me, an occasional failure to execute an event, or script, or device command is simply intolerable. And anything less than 100% logging (if desired) of hardware and detected system failures compounds the injury. When it comes to the engine, "mostly" is simply not good enough.

          A different, but somewhat related request I would like to add is improved system performance monitoring. In current context, the system should collect data on device command failure failure rate (by device, and aggregate), average and maximum times for command completion, etc. Data values would be assigned to virtual devices which could then drive user written events. Implemented properly, performance monitoring functionality would help identify incipient hardware failure before it becomes catastrophic.

          Comment

          Working...
          X