One for the books

Steve Bailey · Apr 28, 2006

Today I was called in to a client to diagnose and fix a problem. Without getting into too much detail, I found that in one rung of the PLC program a normally open contact had been changed to normally closed. That was all. No address changes, no changes in timer presets or anything else. The machine was running fine before they shut it down for the night but when they started it up in the morning they had this problem.

This is a small outfit and there is nobody there that knows anything about PLCs. All they know is that when they have a problem with this machine that can't be explained by a blown fuse, a bent limit switch, or a broken wire, they call me. I can't picture any of this company's employees making the change. Their copy of the PLC programming software is on the same computer that runs their HMI software. The HMI software uses the only available serial port on the PC. That means that in order to use the PLC programming software to make any changes to the PLC program you must first shut down the HMI application before going online with the PLC programming software.

Furthermore, the stored version of the PLC on the computer was correct. That means that whoever changed it used a different computer or else went online to change the program in the PLC and then went offline and changed the PC's copy back to the way it was originally. It implies a better knowledge of how to work with the PLC than I'm prepared to attribute to any of the employees.

In the course of my career I've run into cases where a PLC program has become corrupted due to electrical spikes or a failed battery, but whenever that's happened the PLC doesn't keep running. I've never heard of a spontaneous change to a PLC program that resulted in a valid program.

I think I may entered the Twilight Zone.

katratzi · Apr 28, 2006

OK, lets say no person changed the contact. Is it even remotely possible that the HMI or PLC software could cause something like this?

elevmike · Apr 28, 2006

Steve,

I'd change out the CPU if it happens again. I've seen this happen twice, with the same CPU and the exact same instruction. I figured it was a memory problem and dished the CPU for a new one.

Steve Bailey · Apr 28, 2006

Katratzi,

The HMI software is launched automatically when the PC boots up. It is not capale of modifying the PLC program. The PLC programming software must be started by a person. Furthermore, both the HMI software and the PLC programming software communicate serially with the PLC. There is no way that both can communicate simultaneously.

Mike,

You're saying that you've seen a PLC program spontaneously change and continue running???? That's the biggest challenge to me. I'm sure there's not much difference in the executable code between a NO contact and a NC contact, but I would expect that the program checksum would be different between the version with the NO contact and the version with the NC contact. That would mean that at the same time that the NO got changed to NC, the program checksum also got changed, and to the correct value. There a lot more incorrect checksum values than correct values. The odds are against it.

The CPU is a GE Fanuc 90-30 CPU363. When I went online with it, the programming software (VersaPro in this case) told me that the stored version in the PC was not the same as the version that was running in the PLC. Furthermore, it told me which subroutine was different.

elevmike · Apr 28, 2006

The CPU is a GE Fanuc 90-30 CPU363.

That's it!!! No kidding... On this unit we were also having intermitent problems with the counter. Change counters & encoders many times. It got to the point that when cycling power the tech would pull and re-plug the counter with a HOT rack! to get the counter to work. Nobody had a PC anywhere near the site. Ok so the counter problem was going on for a long time but nobody told me about it. Then the CPU issue happens twice in one month, then I find out about the entire story. We put the program in a new CPU and no problems for over a year. CPU went back to GE with a detailed note, never heard from them.

Steve Bailey · Apr 28, 2006

This CPU has been in service for over five years and this is the first time anything like this has happened. There are no problems with any of the I/O modules.

Do you remember any of the details about what changed in the program?

elevmike · Apr 28, 2006

Steve,

All I remember in detail is being pi$$ed about not being told about it after mabey a dozen shutdowns in 6 months wiht the same symptoms. After hours of poking around the tech contacted me, I went to the site with the laptop and found the diffenences in the program. I checked it aganst another identical running unit to make sure it wasnt the saved code on the PC, then changed a single instruction. A few weeks later the whole thing played out again. But this time I notice the blower on the drive making noise so we shut down to take the fan out and give it a squirt. When the tech applied power, he non-chelontely pulled out the counter and plugged it back in... Then I started asking questions. We installed the new CPU and never had a problem after that.

This job was installed about 6-7 years back. The CPU issues were about 18 monts ago. Winter/fall 2004. They stole one of our trucks parked in front of the building. That really made my day...

elevmike · Apr 28, 2006

BTW, On this same job we installed a Dupline system for hall & car calls. Over the course of the previous years about half the Dupline remotes were replaced also (including the master). That stopped also.

darrenj · Apr 29, 2006

In the last 6 years i have had to download to a processor twice for no apparent reason...monday ight all was well tuesday morning..only some systems would work..i could not for life of me figure out what was going on..The last course f action..before starting to swap modules was a download...i did this and everything was perfect again...dont know why..perhaps a trasient or a spike corruptede the program?..dont know why...i would look at the logic and it said xxx ouput was on but it wasnt...after i re loaded the program.(and no it didnt change or at least i couldnt see if it did..but then again the programm is the size of a major phone book!) everything was good...

I would do as Mike did change the processor and see what happens..

D

OkiePC · Apr 29, 2006

While I agree that the processor should fault if something corrupted it's program, I have seen this happen with a very old system (1977 ISSC PLC IPC210?).

These machines had a watchdog circuit in the last remote rack that was supposed to drop out control power if the program ever failed to complete it's scan within 0.5 seconds. They did not use a checksum to verify the program. I don't know anything about the Fanuc.

I had 19 machines with this antique junk and it was so abused that I got to see this kind of thing many times in my two years in that department.

Several times I found bad ribbon cables connecting the IO as the root cause of program corruption. somehow a bad ribbon cable could affect the program area of memory in these controllers. At least 5% of those occurrences did not cause the program to hang, so it would continue to run with the corrupted program.

I once had a tire assembling machine that would dance around like it was possessed as soon as you reset the e-stop. All of the ourputs would flicker off and on in a seemingly random pattern. Many times I could monitor logic and find corrupted sections that were still functional and hard to detect without great fmiliarity with the original code. It was very unsafe, and I am relieved that we finally got rid of that junk.

These occurrences usually followed thunderstorms or plant shutdowns during which the power may have spiked or browned out.

Anyway It may be possible, however rare, that a real electrical problem corrupted your program.

I think it is also possible that any device connected to it with a communication cable could be the culprit. Even though the HMI is not intended to be a programming terminal, isn't it possible that noise on the cable changed an innocent read instruction into a write?

I am not at all familiar the the processor you are using...just throwing my 2 cents in. This is one of those strange and interesting things that may never happen again to provide you with closure on the subject.

Paul

tom_stalcup · Apr 30, 2006

Just a couple of ideas...

The plc doesn't have the program stored in an EEProm of some kind does it? I have ran into issues where the program running in the processor didn't match the one stored in the EEProm... But 5 years is a long time for something like that to hide out, so it's not very likely.

Another WAG.... I went out to a machine one day that wouldn't run, and found out that one of the bits was backwards...
I did some investigating, and 6 months before that one of the switches had gone out, and someone edited the online program to get it running(because they couldn't figure out what the switch was supposed to do). Flash forward 6 months, and the sticky contact on the switch started working again, and now the machine mysteriously wouldn't run for no apparent reason.....

Sparkz · Apr 30, 2006

Is it correct to say that a checksum is performed on the program stored in (E)EPROM, not after the program is transferred to RAM at startup?

The difference in opcode between AND/NAND, OR/NOR, etc, is just 1 bit. If there's no RAMtest at startup (???), you end up in situations like this.

Otherwise it's more likely that somebody's been messing with it, like Tom mentioned.

Terry Woods · Apr 30, 2006

Very spooky, Steve. (Maybe...)

I ran into a situation that had very similar results, although the potential causes in my case were not nearly as restricted as you indicated in your case.

The situation involved an Input, namely a Float Switch. The original design and original code called for the switch to be a Normally Open type. One day the switch went bad - it's real easy to "know", without going into the program, when that switch goes bad. The switch was replaced with what was identified on the drawings as the proper type. Guess what... the process didn't work right.

Accessing the PLC, I studied the code... I found that the contact in the code was not as I expected. When I create names for Inputs I typically include the switch-type (N/O or N/C). Seeing that the name indicated Normally Open, and knowing that the switch-type was supposed to be Normally Open, I could see that the code could NOT work! I wondered... how could it have been working? I checked the saved code on the PC. The saved code was as I expected it to be - the names matched, however, the contact-types were opposite.

We then checked out the switch that was removed and found that it was indeed the wrong type. I returned the code to normal and all was well.

Apparently, at some point earlier, the switch had gone bad and was replaced with a Normally Closed type, by "whomever". Then the code had to have been changed to support that type of switch - otherwise it couldn't work. The change was not saved to the PC and the "whomever" didn't let anyone know what had been done.

A little reasearch showed that the particular brand of float switches were available only as normally open or normally closed... no combination types were available. So, if you ran out of one type... you were kinda stuck with using the other. So, I reckon, there were no Normally Open switches available and so it was replaced, by "whomever", with a Normally Closed one. And, of course, the code had to be changed to support that type.

By the time the latest failure occurred, replacement parts had been ordered and stored on the shelf.

Sometimes, the simplest explanations are the best... Occam's razor

More often than not, nobody admits to doing anything. And it isn't worth the effort to find someone to blame - not that you are looking for someone to blame. Although it sure would be nice to know the truth!

Regarding the HMI question...

It occurred to me that, while maybe not the greatest way to handle it, the situation could be designed to use either switch-type without going into the code. Using an HMI, the switch-type could be identified as Normally Open or Normally Closed (i.e., NOT Normally Open). The code would then react to the input as the identified-type.

That is...
If the type is identified as Normally Open and the signal is ON...
-OR-
If the type is identified as NOT Normally Open (Normally Closed) and the signal is OFF...
...then the named condition (Tank Level is At, or Above, Level-X) is ASSERTED via Control Relay.

In the code, the process responds to the state of the Control Relay rather than the state of the Input. In that case, I would have to remove the N/O, N/C indication from the Input name.

danw · Apr 30, 2006

Mystery program changes can account for Mr. Bailey's situation, but I can totally believe it's a CPU failure.

Nowadays memory failures seem few and far between (I found a Compact Flash card 6 months ago, new out of the wrapper that was bad) but it hasn't always been that way. In the late '70s I did a stint with Data General in Framingham, Mass when minicomputers were wirewrapped backplanes, but memory had progressd to silicon dynamic RAM. Lots of problems with memory chips that wouldn't quite hold a bit until the next refresh cycle.

The worst cases were the ones that would initially pass, but fail over time in the field. It only takes one bad bit in, whether op code or operand to screw things up.

Steve's experience sounds just like the flakey memory bit stories of the late '70s.

Dan

Peter Nachtwey · Apr 30, 2006

danw said:
Steve's experience sounds just like the flakey memory bit stories of the late '70s.

Dan

Yes, and the problem hasn't gone away. RoHS and the desired to make smaller and faster chips push the limits of technology. Samsung has made a faulty static ram for a few years. The only admitted it to us around the beginning of the year. They posted a letter on the 1st of December of last year. The Samusng document is full of weasel words and doesn't admit to much. They blame the manufacturers for improper design that loops on the same memory location so that memory locations fails after 1 to 3 years. Well excuse me for applying power to chip. I know we don't loop on a single memory location.

I wonder if Samsung uses their own chips. Can you imagine buying a LCD TV and it fails after a year or two because of a bad RAM? What if the RAM fails just out of warranty?

It is not unlikely that there are many chips that suffer from similar problems as the designs for memory chips get smaller and smaller and the chip substrates break down.

As a manufacturer we see many problems with chips, resistor packs, capacitors and oscillators. For the most part these manufacturer just make the parts and ship them and rely on those that put the boards together to be their quality control.

Some problems will not show up for years

.
It costs too much to check each part by itself. I would hope that they at least check a few parts out of a batch.

One for the books

Steve Bailey

Lifetime Supporting Member + Moderator

katratzi

Member

elevmike

Member

Steve Bailey

Lifetime Supporting Member + Moderator

elevmike

Member

Steve Bailey

Lifetime Supporting Member + Moderator

elevmike

Member

elevmike

Member

darrenj

Member

OkiePC

Lifetime Supporting Member

tom_stalcup

Lifetime Supporting Member

Sparkz

Member

Terry Woods

Member

danw

Lifetime Supporting Member

Peter Nachtwey

Member

Similar Topics