Announcement

Collapse
No announcement yet.

Connection Failed every few days-Arduino Hard Crash. Needs a power cycle to reset.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Connection Failed every few days-Arduino Hard Crash. Needs a power cycle to reset.

    I have 3 boards in use for around a year, and an event which emails me every time i get a connection error.
    ---------------
    Board 1 is a genuine Uno with a clone Ethernet shield. It has;
    2 input pins, and
    4 PWM output pins driving FETs controlling 12v LED lighting strips.
    This board disconnects and immediately reconnects between 2 and 10 times per day. I could replace the shield with a genuine one and try that.
    ---------------
    Board 2 is a genuine EtherUno with POE module. It has;
    1 input,
    5 outputs driving heating control relays,
    1 PWM output which sets the flow temperature on my boiler, and
    1 OneWire pin with 3 temp sensors for my hot water tank.
    This board disconnects once or twice per week and then goes Connection FAILED. I cannot reset the board, and have to power it down and back up again.
    ---------------
    Board 3 is also a genuine EtherUno with POE module. It has;
    8 outputs driving a relay board;
    1 OneWire pin with 6 temp sensors.
    This board works perfectly and never disconnects.
    ---------------
    In all cases, the relay boards are driven from a separate 5v regulator, and therefore do not take power from the Arduino 5V line.

    Board 2 is causing problems as the heating stops working when it crashes! Any ideas on how to diagnose the issue? The EtherUno is actually a replacement board. Previously I had a normal genuine Uno with a genuine Ethernet shield. I have exactly the same problem on he replacement board.

    Here are screenshots of the board configs

    Thanks
    Attached Files

    #2
    I had some problems early on with intermittent disconnects. This was way back at version 1.0.0.16. I moved my port numbers to the 55000's using the IP address as last three. In your case they would be 55280, 55300 and 55320. That cleared up all of my intermittent disconnects on my genuine Mega and Genuine Ethernet shield. The other two with clone shields, one a genuine board and one a clone, still randomly disconnected. Switching to all genuine components stopped all of the random disconnects.

    Now I have 4 genuine Mega 2560 boards, all connected with genuine (R2) POE shields and I have not had a disconnect in at least a year.

    I also have events that trap disconnection or connection errors and send me a pushover message as well as automatic resetting of the boards if they remain disconnected more that 30 seconds. None of the events have run in over a year.
    HS4 Pro, 4.2.19.16 Windows 10 pro, Supermicro LP Xeon

    Comment


      #3
      Thanks Randy, I'll try that tomorrow.

      Out of interest, what caused you to try different port numbers, and why did that solve the problem?

      Comment


        #4
        Originally posted by apluck View Post
        Thanks Randy, I'll try that tomorrow.

        Out of interest, what caused you to try different port numbers, and why did that solve the problem?
        I could see no reason why the boards were disconnecting. I had very tight control of IP addresses on the network and my DHCP range was well above where I was putting the Arduinos. At the time I used Netstat and found a lot of activity in lower port numbers around the 8900 that the plug-in uses as a default. I saw nothing in the higher numbers. As a first step, I chose new port numbers and the disconnects on the genuine board and shield stopped immediately. I have no answer as to why that solved the problem as Netstat showed the specific ports (8901, 8902 and 8903) to be free. After replacing the other two clone shields and one clone board my problems stopped. I never went back to the 8900s to see if the problems came back.

        I then added a 4th board using the same port numbering convention, then switched all 4 shields out to POE shields.

        You will need to upload new sketches if you change port numbers.

        I would also recommend the 1.0.0.99 beta as Greig has addressed some connection issues, but they might be limited to USB connections.
        HS4 Pro, 4.2.19.16 Windows 10 pro, Supermicro LP Xeon

        Comment


          #5
          Connection Failed every few days-Arduino Hard Crash. Needs a power cycle to r...

          I had same issue for 2 weeks and pulled my hair out.

          I found it was a sign my Ethernet switch was failing.

          Replaced and all was fine.


          Sent from my iPhone using Tapatalk

          Comment


            #6
            update:-

            I changed the clone ethernet shield for a genuine one on board 1, and it made no difference - I still had connection errors. Then I changed the port for a higher one yesterday, and I haven't had a connection error since.

            On Board 2, I changed the port to a high port. Will need to wait a week or so to see if the problem has gone away, but so far so good.

            Fingers crossed this stops the problem.

            I guess either the plugin or the Arduino is getting confused by unexpected packets that appear on those ports.

            Comment


              #7
              Originally posted by apluck View Post
              update:-

              I changed the clone ethernet shield for a genuine one on board 1, and it made no difference - I still had connection errors. Then I changed the port for a higher one yesterday, and I haven't had a connection error since.

              On Board 2, I changed the port to a high port. Will need to wait a week or so to see if the problem has gone away, but so far so good.

              Fingers crossed this stops the problem.

              I guess either the plugin or the Arduino is getting confused by unexpected packets that appear on those ports.
              I don't know what the problems are in the 8900 region. Like I said, I could never find any evidence of traffic on the assigned ports, but experience has shown me that high numbered ports are the safest. HomeSeer tends to put a lot of activity up in the 49,000 region
              HS4 Pro, 4.2.19.16 Windows 10 pro, Supermicro LP Xeon

              Comment


                #8
                I am still getting connection errors on board 1. I havent yet an error on board 2, although that can go for a week without errors. Board 3 is working perfectly as usual.

                I upgraded to v99, and still see the same issue with board 1 disconnecting.

                Any ideas?

                Is it worth changing the port the server is listening on from 8888? If so, how could I do that?

                cheers,
                Al

                Comment


                  #9
                  I upgraded to v99, and I'm still getting connection errors on board 1 every few hours that immediately reset.

                  I think I had the board 2 hard crash again this afternoon. However on v99 the board seemed to stay "connected" but the onewire sensors read "error". Interestingly, I tried resetting the board with its reset button, but this did not recover it. I had to fully power cycle the board.

                  I have replaced all 3 onewire sensors and simplified the wiring.

                  I wonder if a onewire sensor is going short circuit or into a funny state causing the Arduino to go into a loop? Reset button presumably doesn't cut the 5v to the sensor, which is why it didn't recover?

                  Any other suggestions? I could leave the debug log running, but would have to do this for a week, causing a large file.

                  Comment


                    #10
                    Checking back in to see if any more ideas on how to resolve this. I'm still getting the hard crash on board 2 every couple of days. Board 3 is still working perfectly. Board 1 has not flagged any disconnects since I upgrade to this version of the plugin. Board 4 has an LCD on it, and has been working perfectly for 6 months. I'm now on version .122.

                    To recap, for board 2 I've tried using DHCP, tried a new board, tried using separate PSUs for the arduino and the relay board, tried a different ethernet cable, and tried a different ethernet switch.

                    Any suggestions?

                    Comment


                      #11
                      Originally posted by apluck View Post
                      Checking back in to see if any more ideas on how to resolve this. I'm still getting the hard crash on board 2 every couple of days. Board 3 is still working perfectly. Board 1 has not flagged any disconnects since I upgrade to this version of the plugin. Board 4 has an LCD on it, and has been working perfectly for 6 months. I'm now on version .122.

                      To recap, for board 2 I've tried using DHCP, tried a new board, tried using separate PSUs for the arduino and the relay board, tried a different ethernet cable, and tried a different ethernet switch.

                      Any suggestions?
                      Al,
                      I think I am running out of ideas like you. I have many other users with 3 or more boards that are not having any problems. The only thing I can suggest to try is to change the setup of board 2 and 3 round and see if the problem moves. Swop the IP and port on each board to start the if the fault dose not change then swop the hardware.

                      Greig
                      Zwave = Z-Stick, 3xHSM100� 7xACT ZDM230, 1xEverspring SM103, 2xACT HomePro ZRP210.
                      X10 = CM12U, 2xAM12, 1xAW10, 1 x TM13U, 1xMS13, 2xHR10, 2xSS13
                      Other Hardware = ADI Ocelot + secu16, Global Cache GC100, RFXtrx433, 3 x Foscams.
                      Plugings = RFXcom, ActiveBackup, Applied Digital Ocelot, BLDeviceMatrix, BLGarbage, BLLAN, Current Cost, Global Cache GC100,HSTouch Android, HSTouch Server, HSTouch Server Unlimited, NetCAM, PowerTrigger, SageWebcamXP, SqueezeBox, X10 CM11A/CM12U.
                      Scripts =
                      Various

                      Comment


                        #12
                        Checking in again to see if any new ideas.

                        I've been using v129 for a while now.

                        I still get the board errors every few days on board 2, only now it doesn't seem to fully crash the board. I can still control outputs from HS.

                        Is it related to OneWire? This is what is in the logs repeated up to nine times each minute.

                        Error = Exception in RecievedRom Value > -100 And Value < 150 : Conversion from string "401792219606000070, value Error " to type 'Long' is not valid.

                        This carries on until I power cycle the board.

                        Just to add, at Greig's suggestion above I tried moving the Onewire sensors onto another board for a couple of weeks. This was before v129. That board did not crash, but the original one which still had relay outputs on it did crash.

                        I also tried changing the power wiring around, so that the relay board (driven through optoisolators) had separate power from the Arduino.

                        No change. Still got the crashes.
                        Last edited by apluck; January 21, 2017, 05:38 AM.

                        Comment


                          #13
                          Originally posted by apluck View Post
                          Checking in again to see if any new ideas.

                          I've been using v129 for a while now.

                          I still get the board errors every few days on board 2, only now it doesn't seem to fully crash the board. I can still control outputs from HS.

                          It seems to be related to OneWire. This is what is in the logs repeated up to nine times each minute.

                          Error = Exception in RecievedRom Value > -100 And Value < 150 : Conversion from string "401792219606000070, value Error " to type 'Long' is not valid.

                          This carries on until I power cycle the board.
                          Can you send me a debug log of when this happens as I can not see how this string is ending up in the Value.

                          Can you also try changing your one wire resolution to 9-bit as I think I have tracked a but in this that may be causing problems for some users.

                          Greig.
                          Zwave = Z-Stick, 3xHSM100� 7xACT ZDM230, 1xEverspring SM103, 2xACT HomePro ZRP210.
                          X10 = CM12U, 2xAM12, 1xAW10, 1 x TM13U, 1xMS13, 2xHR10, 2xSS13
                          Other Hardware = ADI Ocelot + secu16, Global Cache GC100, RFXtrx433, 3 x Foscams.
                          Plugings = RFXcom, ActiveBackup, Applied Digital Ocelot, BLDeviceMatrix, BLGarbage, BLLAN, Current Cost, Global Cache GC100,HSTouch Android, HSTouch Server, HSTouch Server Unlimited, NetCAM, PowerTrigger, SageWebcamXP, SqueezeBox, X10 CM11A/CM12U.
                          Scripts =
                          Various

                          Comment


                            #14
                            Hi Greig,

                            Debug log attached.

                            The error occurs at 11 seconds past the minute so 10:42:11

                            thanks!
                            Attached Files

                            Comment


                              #15
                              Changed to 9 bit, and the OneWire devices are still in error.

                              I have not yet "reset" or power cycled the board, because if I do, then I can't debug further. I'll have to wait up to a week to see if it crashes again...

                              Comment

                              Working...
                              X