Scanning...

Scanning...
There's nothing shiny in the solar system map. The solemn darkness eats away at your soul.

Monday, December 29, 2014

RLS: How a WiFi Thermostat took down an Exchange Server

RLS stands for "Real Life Stories" and I'll be posting them from time to time when the situation warrants it.  This real life story ruined my plans to play EVE over the weekend, and it's kind of obscure, so I figured it was worth posting.

From looking at the title you might think this would be some elaborate tale of hacking and mystery.  Wouldn't that be exciting?  But no, that is not the case.  First you should know that in real life I'm an IT Professional.  My experience ranges across everything from software development to desktop support to server management.  I've been at my current job for nearly 8 years and am responsible for a variety of things.  Maintaining the servers and network infrastructure is one of my primary functions, although on any given day you could also find me performing a webmeeting to train customers on our software product.  Such is the nature of working for a small business.

On Friday I expected a quiet day.  My youngest son came to work with me since he didn't have school.  Things went fine in the morning, he entertained himself with his new iPad Mini that he got for Christmas.  He hadn't eaten much breakfast and so by 11:00am he was pestering me for lunch.  I gave him a bag of peanuts to buy some time.  He ate half of them and was good for a bit.

I had spent the morning prepping some changes to our Exchange server for a new domain we were adding.  Around noon I rebooted the server and that's when things went haywire.  Our Exchange server is several years old, it requires some poking and prodding to come up properly after a reboot.  This time though, it wasn't working.  The Transport Service wouldn't start, no matter what I did.  Something major was wrong and I snapped into triage mode.  Ignoring all around me, including the time, I started digging through server logs, and googling potential problems.

For about an hour I hammered at it, some of that time was waiting on the slow server to reboot but mostly it was just trying different tricks.  There are enough things that can go wrong to break the Transport Service that it wasn't immediately obvious what was wrong.  My son was getting hungry again, but being in triage mode I wasn't hearing it.  I told him to finish the bag of peanuts and kept working.

Time flies when you tune out the world and are lost in thought.  For nearly another two hours I dodged my son and rambled incessantly at no one in particular as I researched the problem and tried solutions.  Finally, as it approached 3:00pm, I dropped out of my triage bubble and decided I had to do something about my son.  It's hard to convince yourself to leave when a server is down and you're the only one there who can do anything about it.  Luckily my son wanted Taco Bell and it's right around the corner.  Ten minutes later we were back and eating.  I polished off a Mexican Pizza while continuing to research the problem.

About this time it was becoming clear that one of the things that can break the Transport Service is a broken domain replication scheme.  At first I didn't think that could be it, because while I had run into problems like that in the past, they were long since resolved.  At least that's what I thought...

In the past I've found that even though the master computers in a Forest (the Microsoft Organizational Entity) should always be able to see and talk to each other, it's incredibly easy to break that portion while the servers themselves still appear to be communicating.  Just because one server can see and access another doesn't mean the services that keep them synced can also.  This can be determined fairly easily by reviewing the logs, but unless something big is happening to get my attention I don't always go through the logs.

Over the next couple hours I narrowed down the issues.  Was the domain/forest still configured correctly?  It appeared to be.  When was the last successful communication? Five months ago.  Wait, what?!  How is that possible?  We moved into this building six months ago and everything was working fine at that point!  But the logs don't lie, the last successful replication was near the end of July.  I racked my brain to remember what could have happened around that time.

Still unsure of what to do next, I tested name resolution from the server and found my culprit.  Everything I tried to ping on our internal network was resolving to our outside IP, that was a huge red flag.  With a few more tests the answer became clear, somehow the server was using Google DNS (8.8.8.8 & 8.8.4.4) to resolve names instead of my internal servers.  A little more digging found the reason, a spare network adapter on the server was setup for a completely different subnet and had the Google DNS addresses in it.  Somehow it was defaulting to using that DNS server even though we weren't routing traffic to that subnet at all.  That's when it hit me, late July, a weird subnet, both clues pointed to a WiFi Thermostat we had bought about a month after we moved in.

I wanted to look into that further, but getting the server back up was top priority.  It was closing in on 5:00pm and it was already clear I'd be staying late.  My son wasn't happy about that, but such is life.

First I wiped the settings from that adapter, and instantly the name resolution started working again.  That would have been a great solution if I'd noticed the replication problem within 60 days, during which time the server could have worked out all the discrepancies and gotten back to normal.  But past that you have to start over.  So I kicked the other server out of the forest, rebooted the Exchange server and voila, the Transport Service started working again and email was back up.  It was now about 5:30pm and I was mentally fried.

I decided that was good enough for the day, I'd come back on Saturday morning and get the other server put back into the forest and get it functioning.  As we drove home I analyzed how this had happened.

You see these new WiFi thermostats have to be configured when you get them, they need to join your WiFi network to be useful.  In this case I had used a spare adapter on the server to connect to the thermostat to set it up initially.  I had considered using a spare PC for this, but the server was a quicker solution at the time.  Not to mention that I could have cleared the settings or disabled the adapter afterwards, but I left it in place in case I needed to reset the thermostat later.  It seemed like a good idea at the time, but now it's just a reminder to avoid messing with anything that could break the servers.

By the time I got home Friday evening any will I had to play EVE was gone.  I needed slow, easy entertainment at that point.  The challenges of EVE would have overwhelmed me, I'd had enough challenges for one day.  Plus I had to get up the next morning and dive into that mess again.  Luckily the remaining tasks wouldn't take long and I would only be at work for about an hour before returning home and enjoying a lazy Saturday.  So on Friday and for the rest of the weekend I played quite a bit of Blue Dragon with my son and ate yummy things like fresh kielbasa and cheesecake.

The clock is ticking though and towers don't fuel themselves.  So I'll be back at it this week and the normal EVE related posts will return soon.

No comments:

Post a Comment

Anonymous comments have been enabled. Please don't give me a reason to turn them back off.