Announcement

Collapse
No announcement yet.

Queue depth and response time troubleshooting

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Queue depth and response time troubleshooting

    Hi Michael,

    not sure if this is related to my HS4 migration, changes to the plugin or an increase in my number of devices/polling, but I started having issues with HS/MCSMQTT response time.

    I did read your documentation + performance tips; made some changes and troubleshooted for a while and it's clearly related to the MCSMQTT queue filling up.
    switched some devices to Express mode, disabled local graphs/history (still use Influx) and tried a number of general/queue settings unsuccessfully (current queue settings below).

    Before I look for a CPU upgrade, I'm quite puzzled why the plugin gets hits by so many callbacks (?) for rather basic usage.

    Have a look below, the first spike is the result of triggering a single motion sensor + associated event (below)
    If I continue moving across the house, this gets ugly quite rapidly as you can see on the graph.
    And obviously HS response time deeply suffers while this is happening.

    Why so many callbacks (close to 600 for a single event!?) does this look normal?

    If so, (don't think so as I didn't have these issues before) what are my options to improve performance ; CPU upgrade or other?
    ​If not, how can I troubleshoot further to fix this ?

    Thanks again for your support on the recent testing/troubleshooting spree =)


    Click image for larger version  Name:	image.png Views:	0 Size:	40.8 KB ID:	1590087

    Click image for larger version  Name:	image.png Views:	0 Size:	90.4 KB ID:	1590089



    Click image for larger version  Name:	image.png Views:	0 Size:	18.8 KB ID:	1590088


    #2
    During some of my investigations I observed an issue where the MQTT broker was sending many messages when mcsMQTT first connects. I had the receive queue limit set to 100 so the plugin was discarding many of these messages, of which were some that I was trying to investigate. What I did to alleviate this was to increase the queue limit dynamically until the queue depth got back to 0. This is likely why you have a max value of 1779. It means you are subscribed to at least 1779 topics what have been sent to the MQTT Broker with the retain flag set.

    mcsMQTT does not know why a Topic has been delivered so it needs to decode the JSON data for each to assess what it needs to do with each key value pair in the payload. The last time I benchmarked was version 3.5.5 of the plugin and much has developed since then. During that investigation it was learned that communicating with HS was an expensive endeavor so Express mode was introduced to minimize that aspect of CPU usage.

    The 5C statistic is the one that tells how long mcsMQTT is taking to process the received messages. 5D is the average time between receipt of messages. If 5C is greater than 5D then the queue will grow. This data should be looked at in a steady state situation. When the Broker reconnects, it will provide a significant disturbance to data averages due to retain message provisions of the MQTT protocol.

    I do not think the lower two parameters in the screenshot make much difference. If data is coming too fast, then no matter what is done, the queue will fill, and bad things will result.

    For the next step what I suggest is that an understanding of what is causing the queue depth to increase. At the time it is going from near 0 to above 100 then look at mcsMQTT Debug.txt to see what messages are being delivered. Why are they being delivered?

    I am more than happy to address the issue, but the root cause of the issue needs to be understood.

    Comment


      #3
      Ok, sending you 2 debug files via email :


      1st one is a 1 minute interval which reflects just one user driven action: me triggering a motion sensor (same as above).
      the trigger is at line 18550686, that's when all hell breaks loose ; close to 50000 lines in a single minute... queue spikes to 1800.


      Second debug file is a "normal' minute ; no actions just the system idling; About 10k lines. the queue remains below 10.


      At first glance I see a lot of Influx activity but both files do.
      Let me know if you can figure out where the issue is.

      Open to more testing if needed.


      Comment


        #4
        The scenario is that mcsMQTT, for a 20 second monitor interval has repeated attempts to subscribe. Each attempt induces the MQTT broker to send all it's retained messages so this fills the queue very quickly.

        I need to look into why multiple subscribe attempts are being sent, but even one with 1700 retained messages being sent will adversely affect the real time response to messages until the queue is emptied. To help mitigate this I can modify the queue depth limit to be restored to the user-setting rather than the setting that was dynamically set at startup. This way the surplus of retained messages will be discarded on resubscribe attempts.

        At your end, can you remove the setting to automatically resubscribed after some number of minutes. If the problem goes away then this will address why the resubscribe is being attempted.

        Comment


          #5
          Originally posted by Michael McSharry View Post

          At your end, can you remove the setting to automatically resubscribed after some number of minutes. If the problem goes away then this will address why the resubscribe is being attempted.
          Thanks Michael , the resubscribe setting should be found on the client side/ ex: tasmota? or in your plug-in? Anything to do with the retain flags?

          Comment


            #6
            The setting I am referring to is on the General Tab of MQTT Page of the plugin. A textbox was added to restore a specific subscribe list after some number of minutes. If this textbox is left blank then it should defeat this logic. The bottom line is that there are multiple places where a resubscribe is requested. Most are explicit user action or as part of restoring a lost connection. This textbox is the only one that I see that is automatic without any surrounding debug. I want to confirm it is the culprit and there is not something else causing it.

            Comment


              #7
              Would it be "Publish Periodic Status (Mins)" ? Currrently set to 0mins

              See my general config attached, would you mind confirming and/or if anything doesn't seem right, thx​​
              Attached Files

              Comment


                #8
                It is the textbox shown below which you apparently do not have set. It could be that the text label for this box only shows when Discover All is selected. The logic was not obvious to me when doing my initial review as it was trying to persist the setting between HS / plugin restart sessions. I need to look deeper so nothing for you to do on this right now.

                Click image for larger version

Name:	0x.png
Views:	74
Size:	17.3 KB
ID:	1590304

                Comment


                  #9
                  Fyi I have tried both the internal broker & Mosquitto with similar results. Currently on Mosquitto. keep me posted. Thx

                  Comment


                    #10
                    Strickly a debug build to investigate the message in the debug log of ...

                    2/2/2023 11:33:11 PM 18550625 | MQTT Subscription Unsubscribe

                    This build will include a suffix of "from xx" where xx will point me to why resubscription is being done. There are 25 potential sources.

                    http://mcsSprinklers.com/MCSMQTTHS4_6092.zip

                    p.s. it also has the WLED White channel update for values under 16.

                    Comment


                      #11
                      Ok will install and run the same sequence/debug tomorrow am. Stay tuned.
                      thx

                      Comment


                        #12
                        Done. Debug sent via email;

                        The winner is.... no 21. Who is the luck winner ?

                        MQTT Subscription Unsubscribe from 21

                        Comment


                          #13
                          It is a commanded resubscribe from HS Event. It can be raised due to an event being added or removed from the path below

                          Code:
                              Public Overrides Sub HSEvent(ByVal EventType As HomeSeer.PluginSdk.Constants.HSEvent, ByVal parms() As Object)
                                  If EventType = HomeSeer.PluginSdk.Constants.HSEvent.CONFIG_CHANGE Then
                                      If parms(5) = "Event Trigger Removed" OrElse parms(5) = "Event Deleted" OrElse parms(5) = "Event Trigger Added" Then
                                          If parms(1) > 0 Then
                                              RebuildTriggers()
                                          End If
                                      End If
                                  Else
                                      plugin.HSEvent(EventType, parms)
                                  End If
                              End Sub​
                          It could also be raised due to the edit of a mcsMQTT event.

                          Were you changing events to cause the failure mode testing?

                          The resubscribe was unconditional on either of these two paths. What I did was inspected each mcsMQTT event and determined if a Topic had changed and only then resubscribe. The change was made in the .exe. I also updated the dll to only allow the max queue limit to be ignored during the first minute after startup. I also removed some of the debug as that investigation has passed.

                          Note that the .exe and corresponding pdb go into the HS folder while the others go into the \bin\mcsMQTT folder.

                          http://mcsSprinklers.com/MCSMQTTHS4_6093.zip

                          Comment


                            #14
                            Makes some sense as all of my motion events create delayed events (below); would that explain it?
                            testing in a sec.


                            Click image for larger version

Name:	image.png
Views:	63
Size:	117.1 KB
ID:	1590504

                            Comment


                              #15
                              Yes it would explain it as the removal could be considered a configuration change by HS which raises an event so plugins know that something has changed.

                              Comment

                              Working...
                              X