Friday, August 17, 2007

Hosting, Continued

So after spending the better part of my day (and night) working on getting my site back online, It was finally fully operational again around 7 this morning.

HostMySite's support was good - although I was talking to so many different people it was hard to keep track and I don't think they were talking much amongst themselves. At my pleading (around 4pm yesterday when it still wasn't online), they set me up with a new account on a different, CF 7 server. So I put the site to that server, and everything worked fine, with the exception of a few settings, DSN's and the like that had to be transferred too. Because they set up the new server with my same Control Panel login info, when I would log into the Control Panel to make changes, it would only affect the old site. I had to request all those updates directly from support, requests which they promptly (I was getting emails throughout the night) took care of.

However, the thing that has bothered me most about this ordeal is that with the last round of emails before I moved servers, support told me "it appears the timeout is happening at this CFLOOP tag, so why don't you try optimizing that section of the code and then it should run". Now, I hadn't uploaded anything to the site in days that should cause this sudden slowdown, and there is no part of my site that's complex enough (it's all basic database interaction, mostly selects, with small data sets) to bring a server to its knees like that. So after I took a few deep breaths, I emailed my rebuttal.

First off, the error I'm getting is inconsistent. It's always the same 500 - request timeout error. But the CFLOOP and line of code in the stack trace that it references varies between several different places. I've seen at least three different locations. They are all within the ModelGlue core, and once I saw an error within the ColdSpring core. Now I'm not saying anyone is exempt from bugs, but I just could not buy the idea that ModelGlue and ColdSpring are taking the server out with an infinite loop or such nonsense. These frameworks are used by hundreds of developers in major applications without such issues. And with all the talent and brains that's gone into developing them, the chance that the framework code, of all the code on my site, is what's causing the problem is about...oh...well...IMPOSSIBLE! No, the problem is that the server chokes at the first somewhat intense processing it comes across while loading the site, and the frameworks just happen to load first.

This is essentially what I told HMS, along with providing some screen shots of the different stack traces I was getting. Then I begged them - if there's any way they can explain how my code is causing the problem, and provide me with some more details (using the handy dandy CF server monitor that they should have), then I would be more than happy to investigate and fix whatever the problem is. Because I certainly don't want the problem to come back after I switch servers.

However the answer I got was "I can see what you are saying from the random errors however that was not what I experienced when attempting to troubleshoot the errors.

Please let us know when you have moved the site to the CF7 Server and we will do some comparing to see if we cannot narrow down the exact problem in the code and possibly even fix it."

So they've kindly offered to help, but still haven't provided me with any more details. If you ask me my personal opinion, I'd really like to stick with them on this. They have been (mostly) helpful and have catered to my demands. I'm sure that upgrading to CF8 has not been an easy thing for them, and I know they were offering beta hosting way back when, so they have probably tested the hell out of it. But then business decisions come into play, and my boss (CEO) and IT Director don't have the rapport with them that I do. So I have to make a tough decision to switch hosts.

I was not feeling very comfortable with the idea (I wanted to be sure the problem was not my code or CF8, rather than their server). But I talked to Edge Web Hosting, and they helped me feel a little better about it. Ok, I talked to a salesperson. But he definitely knew what he was talking about. They have an uptime guarantee that isn't limited to the network and hardware (after some investigation I saw that HMS only limits their guarantee to these items), and they have a maximum of 60 sites/users on any server at once with their Master CF plan. They seem more focused on quality than price, so it's more expensive. But it's still not a bad deal.

Is it just showing "action" without purpose to switch? Maybe. But I do feel like I'm getting a better plan with Edge. And it's not like I'll never go back to HMS. I'm still going to use CF 7 for the time being, just because I'm a little gun shy about jumping on the upgrade bandwagon before the host can prove they're ready, and I don't have an immediate need to switch to CF8 on the website.

To be sure that I wasn't going to have these same problems with a new host, and to confirm in my own mind that I'm not holding the host responsible for my errors, I played around with the old site on HMS' server and tried running some CFDUMPs without loading a framework. Voila, they worked. I tried just loading the application and framework but then aborting...500 error/timeout. Now a couple dumps are hardly enough to take down a server under heavy load, but I found it sort of disturbing that I simply could not get a framework to load, but was able to run CF without the frameworks just fine. My best guess is that the server is just really overloaded, and it can handle a couple variables but not say, parsing an xml file. I am determined to find out why my Model Glue site won't run on their CF8 server.

But I'll probably still spend my weekend moving hosts. Argh. I'll keep updating as I figure this out.


Anonymous said...

I am experiencing the same issue with HMS. I am curious as to know how the new host is going for you and / or if you have found a good solution to this issue.

raelehman said...

@Anonymous, I haven't gone back to using HMS since this incident, but have had great results with Edge Web Hosting. Also I see the issue pop up from time to time with other HMS customers, and I think the problem has to do with the Java 6 object creation bug compounded by their servers being overloaded. So you might try asking to be moved to a different server?