Reducing the Chances of Human Error in Data Centers

Reducing the Chances of Human Error in Data Centers

By: Josh Taylor | Director – CABLExpress Product Management

 

It’s probably not a good sign when you see #DeltaDown, #DeltaOutage, and #DeltaMeltdown2016 as the top trending topics on Twitter as you’re about to board your plane. It’s even worse to see those if you are part of Delta’s IT team.

After learning about the recent cancellation of hundreds of Delta Air Lines flights, CNN followed up with an article about the catastrophic “computer” systems crash that led to Delta’s worldwide outage on Monday. The article, titled “The real reason airline computers crash,” reports that the massive outage was caused by, “Human error. Mistakes. Good old fashioned screw ups.”

You can’t help but feel for the person and/or team held accountable for this crash. The weight of that responsibility for grounding an entire fleet of planes and virtually slamming the brakes on thousands of people’s plans is haunting in itself. Most of those travelers undoubtedly have a different opinion. This outage caused a lot of distress for all those involved. There are few lessons we can all learn from this malfunction.

We’re All Human

If you have any involvement with a data center, you understand the challenges and nightmares that can send your team into a tailspin. At CABLExpress, we have spent a significant amount of time and effort researching potential causes for human error in relation to the product set CABLExpress has developed over the last 37 years.

We provide layer one structured fiber optic cabling solutions for enterprise-class data centers. Designed solely on our motto, Respect Layer One®, we understand that your layer one cabling is the lifeline of your technology infrastructure and disregarding it can make a lot of different things go wrong. That’s why we consistently improve the manufacturing processes on our cables, modules and enclosures to continue to set the standard and lead the industry in quality and performance.

We work with businesses just like Delta Air Lines and see the same type of “human error” incidents that occur with fiber optic cabling and its related components. We have witnessed human error causing massive outages that shut down credit card transactions on a global scale, costing millions of dollars in penalties to be paid. We have seen human error crash hospital systems that create a ripple effect of delays and confusion for both staff and patients. From large to small organizations, data center downtime caused by human error causes confusion, delay and major inconvenience to the businesses and all parties involved – customers, employees and stakeholders.

Port Replication to the Rescue

Observation is good, but taking action from those observations is what brings forth innovation. CABLExpress acted on these observations and developed a product set of port replication fiber optic enclosures.  

The concept behind port replication is simply this: standard fiber optic patch panels are modular and scalable. That means you add and adapt the patch panels as the data center fiber cabling grows and changes. Port replication “mirrors” the ports of active fiber optic hardware in the fiber patch panels, creating a direct, one-to-one relationship between the active hardware ports and the passive structured cabling environment.

This allows switches to be cabled once and then replicated in a Main Distribution Area (MDA), simplifying the cabling process with all numbers on the hardware directly corresponding to the numbers on the patch panel. Then any moves, adds and changes (MACs) can be made at the MDA.

Without Port Replication – Disorganized Mess

We learned very early on that using fiber optic patch panels without port replication can have some devastating effects at critical times. The patch panels are where MACs happen. These MACs typically happen under duress. After all, it is not acceptable for computer systems to be down, so even “scheduled downtime” is hard to come by. It requires precision and timely execution.

Adding to this, they almost always happen during off-hours to minimize user impact (think late nights/weekends, with Mountain Dew and microwave burritos for the IT staff’s sustenance). During these outages, multiple changes from a core switch to a patch panel are made. A correlation needs to be made between the port number on the switch to the patch panel port that it is connected to. Complicating this correlation, these two devices are usually physically far away.

This requires communication between two people working in concert OR one person walking back and forth. Neither of these is an ideal situation. Mistakes in the correlation between these numbers can certainly happen. It is very easy to make a mathematical error between the two ports.

For example, port 18 on blade slot 4 of the core switch will be plugged into patch panel port 6 in module slot 14. Could the cacophony of server fans make that sound like port 16 on blade 1 if you’re yelling directions back and forth? Perhaps the patch panels are labeled. Maybe that information is correct, maybe it isn’t. Maybe someone counted wrong, or just simply made a mistake while writing it down. The potential for human error is quite large at this point and could easily lead to catastrophic downtime in the data center. It's no wonder we've seen and heard about it happening so many times.

With Port Replication – Cabling Success

That’s why we developed a simple and effective structured cabling solution with superhero powers to eliminate all of these variables. Port replication alleviates human error concerns. When you organize your structured cabling with a one-to-one ratio, you will reduce downtime, maximize efficiency, and avoid costly disruption and expenses in the future.

We designed a series of patch panels and modules that mirror all major core switch brands and types - from the port numbers, to the physical layout and orientation. You can learn more about the port replication concept here.

When combined with a standards-based cabling design, it eliminates the need for complicated labeling schemes and databases (that all rely on human input and the data integrity issues therein). It’s important that we learn from these disastrous outages and realize that human error is a real threat and can cause massive disruptions on a global scale. It is only logical to reduce the potential for these errors as much as possible!