Sorry, folks, unplanned site outage

Re: Sorry, folks, unplanned site outage

Great to see the site up again. I was starting to feel bewildered to periodically see that the site was still down while making querries in search engines about large format items that referenced the LFPF site. When you don't have something is really when you miss it.

Re: Sorry, folks, unplanned site outage

Thanks for coming back!

Re: Sorry, folks, unplanned site outage

On the technical questions, we are happy with things as they are. More "features" would lead to expenses, which we'd prefer to avoid. The data is all backed up to another location daily (last 7 days are kept in both places) and so far, after three other outages like this in the past decade, we haven't lost a thing. Our disaster recovery measures seem appropriate to the criticality of the data we store--no one is going to die or go to prison or go bankrupt if we lost all of it.

Our generous server provider, Brian Reid, asked my to post his take on this:

Quote:

Originally Posted by Brian Reid

The Large Format Photography Forum is hosted on a server in a commercial data center. The server itself is maintained by a person (me) who has decades of experience at running data center operations and maintaining servers. This server is a freebie. I bought it and installed it. The data center does not charge for its presence there and its staff do not maintain the server. I do.

RAID sounds tempting, but after managing thousands of RAID-based servers over the years, I have concluded that RAID reduces reliability. A RAID controller is just one more component that can fail, and because RAID controllers are engineered to squeeze every last drop of performance out out those busy electrons, they are usually right at the edge of failure even when they are working perfectly. I've had vastly more downtime from failed RAID controllers than from failed disks.

Even with a RAID controller, there are other components in the mix. Disks do fail, yes, but so do other electronic components, especially power supplies and fans.

If you need ultra-high reliability, you need at least 2 identical servers, preferably 3. Tandem Computer was issued patents a long time ago on the algorithms needed to do "nonstop computing" on redundant systems. Tandem was bought by Compaq which was in turn bought by HP. Those "HP NonStop" systems are what you use when it really really can't go down.

To get high reliability without using custom nonstop systems, you need to have lots of redundant computers and you need to have people watching them. That's only feasible when you have a large installation.

I've managed the Leica User Group server for 20 years and the LF Photography server for a bit less than 10 years. I try hard not to have excessive downtime. The recent server failure happened while I was on an airplane to see my dying father-in-law. My friends at the data center tried to fix it for me, but the problem was beyond what they were willing to do. It turned out that there was some sort of power surge that damaged the boot disk in a way that also fried that channel of the SATA controller. (Even on RAID-based systems, it is unusual to boot from RAID). I couldn't really autopsy the machine because I wanted to get it back into service, but my diagnosis is that there was a broken conductor inside the red (5 volt) wire supplying power to the disk. The insulation was intact, but the wire was broken. It took me 4 days to find and fix the problem.

In a commercial server environment I would have dropped that server into the trash pile, spun up a new server, and been back on the air quickly. But I bought that server with $1500 of my own money and I wasn't about to throw it away. I was stubbornly determined to fix it. I can fix pretty much anything in the world of broken computers, but this one had me stumped for days.

My stubbornness and my unwillingness to give up and buy a new server caused you guys a long delay. I apologize.

We will occasionally have outages of more than a day, but so far the ones we've had haven't been of sufficient impact to warrant charging for membership here or adding advertisement revenue, which would be required if we needed to invest in a high availability infrastructure environment. The infrastructure we have, thanks to Brian's generosity, works fine 99+% of the time.

Re: Sorry, folks, unplanned site outage

An admirable sentiment, and amazing generosity on Brian's part. Respect.

Marc!

Re: Sorry, folks, unplanned site outage

Actually, it works pretty well, given the image-intensive nature of the forum and number of users. So thank you Brian for your generosity and hard work, and I'm very sorry to hear about your father-in-law. Given the completely non-critical nature of the forum, a few days away didn't hurt anyone or anything I'm sure.

Re: Sorry, folks, unplanned site outage

Thank you to Brian and Tom (and all concerned) for their tireless efforts in restoring the site. This community is greatly appreciated by me and is a credit to those who work so diligently behind the scenes to make it the success that it is. Kudos.

Re: Sorry, folks, unplanned site outage

thank you guys!!

Re: Sorry, folks, unplanned site outage

Glad to have the site back, thank you for all the work put into making this place work.
Now, remember to feed those LFPF hamsters!

Re: Sorry, folks, unplanned site outage

Great to have it back, folks. And thanks to the tech guys for their efforts!

Re: Sorry, folks, unplanned site outage

Didn't realize how much I would miss this site. Welcome back!