Originally Posted by
Brian Reid
The Large Format Photography Forum is hosted on a server in a commercial data center. The server itself is maintained by a person (me) who has decades of experience at running data center operations and maintaining servers. This server is a freebie. I bought it and installed it. The data center does not charge for its presence there and its staff do not maintain the server. I do.
RAID sounds tempting, but after managing thousands of RAID-based servers over the years, I have concluded that RAID reduces reliability. A RAID controller is just one more component that can fail, and because RAID controllers are engineered to squeeze every last drop of performance out out those busy electrons, they are usually right at the edge of failure even when they are working perfectly. I've had vastly more downtime from failed RAID controllers than from failed disks.
Even with a RAID controller, there are other components in the mix. Disks do fail, yes, but so do other electronic components, especially power supplies and fans.
If you need ultra-high reliability, you need at least 2 identical servers, preferably 3. Tandem Computer was issued patents a long time ago on the algorithms needed to do "nonstop computing" on redundant systems. Tandem was bought by Compaq which was in turn bought by HP. Those "HP NonStop" systems are what you use when it really really can't go down.
To get high reliability without using custom nonstop systems, you need to have lots of redundant computers and you need to have people watching them. That's only feasible when you have a large installation.
I've managed the Leica User Group server for 20 years and the LF Photography server for a bit less than 10 years. I try hard not to have excessive downtime. The recent server failure happened while I was on an airplane to see my dying father-in-law. My friends at the data center tried to fix it for me, but the problem was beyond what they were willing to do. It turned out that there was some sort of power surge that damaged the boot disk in a way that also fried that channel of the SATA controller. (Even on RAID-based systems, it is unusual to boot from RAID). I couldn't really autopsy the machine because I wanted to get it back into service, but my diagnosis is that there was a broken conductor inside the red (5 volt) wire supplying power to the disk. The insulation was intact, but the wire was broken. It took me 4 days to find and fix the problem.
In a commercial server environment I would have dropped that server into the trash pile, spun up a new server, and been back on the air quickly. But I bought that server with $1500 of my own money and I wasn't about to throw it away. I was stubbornly determined to fix it. I can fix pretty much anything in the world of broken computers, but this one had me stumped for days.
My stubbornness and my unwillingness to give up and buy a new server caused you guys a long delay. I apologize.
Bookmarks