bad apples

So in my free time (HA!) I do some IT consulting work for a few small businesses.  Last month I ran into an interesting problem one of my clients was experiencing.  Felice emailed to report they could not send email from a particular account they have.  This account is set up to receive certain infrequent inquires from the company’s web site.  They then check the inbox via Outlook, yes infrequently, and may choose to bestow a courteous reply upon the sender.  I was somewhat surprised to hear that she had called Microsoft about this problem sending email and was supposedly told the computer had a virus.  This makes a modicum of sense when you consider Microsoft does offer free consumer support for…. wait for it….. virus problems.  At any rate they did not actually help in any way whatsoever.

applesI replied with the standard line of interrogation.  Do you have the password for the account?  Do you have access to webmail?  Are the messages sitting in the outbox?  Did you try a reboot?

Felice: no, no, yes, yes of course.

So no way to check easily via another method and no password…… hmm this may complicate things.  I could have tried to obtain an error message, but I happened to be busy working my day job at the time and decided responsibly to table this for the evening.

I took a drive over the Tappan Zee Bridge depicted so beautifully at the top of this page, which incidentally is not AS beautiful during the day when you can see all the rust and wear and tear and doubly not AS beautiful when you’re sitting in traffic.  It IS being replaced.  And luckily I do have my entertaining podcasts to get me through commuting nightmares (yay This American Life!)…. but I digress.

I reached my destination before dusk and got right to business.  Well not RIGHT to business; I always chat with my clients.  I’m not particularly outgoing or skilled in small talk, but I’ve developed just enough rudimentary social graces to “get by”.  So after catching up with Felice I got RIGHT to business.  I checked the send/receive error in Outlook and the error indicated that the outgoing mail server was refusing the connection.  I saw the hostname of the server and it had the word “Barracuda” in it.  My familiarity with the Barracuda SPAM firewall is evidenced by my earlier post on the love/hate relationship of our Barracuda and mail server as well as my collection of black EAT SPAM t-shirts.  I wish all our hardware vendors packaged clothing with their products.  I’d never have to go to the mall again!

Now, I neglected to mention one crucial detail.  While chatting with Felice, she revealed that they’ve had this issue for a week or so before calling me (wow).  More significantly, the emergence of the problem immediately followed a long power outage.  Now this is a small business with a Verizon FiOS internet connection and a dynamic IP address.  So when I hear power outage I immediately think “new IP address”.  So now getting back to the Barracuda, I found the web page to check Barracuda’s SPAM blacklist.  I entered their current IP address and sure enough POSITIVE HIT.  So that seals it.  Another FiOS customer, who had this IP in the past, somehow got themselves on a SPAM blacklist.  Now poor Felice ended up with it by chance and inherited its ugly reputation.

Thinking back a bit, I do have to give Microsoft some credit.  They may have indeed asked Felice for the error and surmised that HER computer was infected, zombified and spewing out SPAM.  This would be a logical line of deduction.  But A) why didn’t they help her remove the infection?, B) they were wrong anyway and C) i’m giving Microsoft too much credit.  Oh and D) suck it Microsoft, computers under my care do not become SPAM zombies.

So I could either try to get their email / web host to whitelist their IP in the Barracuda or get them a new IP.  Seeing as their email host is a mom and pop shop with no after-hours support, I opted for the latter option.  I rebooted the FiOS router, checked WhatIsMyIP.com, and same damn IP.  I got Verizon FiOS support on the horn and the technician had me turn the router off and on several times while he “tried something”.  No dice.  After succumbing to the reality that he lacked the power to do something so simple as get us a new IP address, we settled on the cop out of leaving the router turned off overnight.  Surely after 12 hours it would receive a different address.  And it did.  Problem solved.

The sad part about all of this IS……. what if Felice and her small business didn’t have an IT genius like myself?  She wasn’t even clear on who their email provider was and could not find a bill.  Microsoft blamed her.  And there’s no way Verizon would have helped if she simply explained she couldn’t send email.  They would have tested her internet connection and told her to talk to her email provider, after trying to sell her on some unnecessary “business services” including “advanced email hosting” or some such useless crap.  This type of thing should not just happen to people.  They’re just trying to run a business like good Americans, and they innocently inherit a tainted IP address from some porn addict who can’t stop from clicking everything bouncing around the screen.  Anyway I feel sorry for the next customer to get that IP.  Maybe I should track them down and send them a business card……..

relationship problems: barracuda and exchange have “communication issues”

I considered for a moment whether to ascribe gender roles to the enterprise tech involved in the spat.  I decided it might make this short story more entertaining.  And let’s face it, people just love anthropomorphizing their technology.  Forgive me for the clichéd gender stereotypes.

Meet the couple

The Barracuda stands guard and provides security so we’ll make him the dude (Arthur).  The Exchange server is far more complex and does the bulk of the work.  Definitely the chick (Anna).  Anna and Arthur have been together for a little over a year.  The relationship has been going well.  But they work together, so that can be trouble.

The fight

It’s hard to know how this particular fight started.  Anna and Arthur would tell completely different stories.  Like most tiffs between young lovers it probably started over something silly.  Anna forgot to lock the door in the morning or Arthur drunk-posted something stupid to facebook.  I was made aware of the trouble by my boss.  He informed me that people were reporting to him that they were not receiving emails they’d been expecting.  As the tech relationship counselor of the office I sprung into action.

My preconceptions were incorrect

I immediately expected Arthur was to blame.  He’d been acting erratically as of late.  Translation: We have a 2-node Barracuda cluster and the first node has been flaky.  It’s getting old and needs to be replaced.  I have the replacement sitting in my office, but I need to plan a trip to the datacenter to install it.  Given this knowledge I accessed the management interface of Barracuda 1 (B1) and right away saw something disturbing on the Status page.  The status of the energize updates and instant replacement subscriptions had a red error code with a message to contact Barracuda support.  Uh oh!  It also had ~500 messages queued up for delivery to our Exchange server.  I rebooted B1.  While it was rebooting I accessed B2.  It had no errors but it DID have ~500 messages queued up just like B1.  So the errors on B1 are a red herring!  Yes B1 is screwed up in general.  But Arthur may not be to blame after all.

Back pressure

My boss had mentioned to me something about “insufficient resources” messages in the Barracuda logs.  Indeed messages were being rejected by our Exchange CAS array with an SMTP code indicating this.  I checked our two Exchange CAS / Hub Transport servers.  I looked through Event Viewer on CAS1.  All clean.  CAS2 was a different story.  I found two events which indicated Exchange was refusing to accept incoming SMTP messages.  This was triggered by a feature called Back Pressure.  Exchange 2010 tracks a bunch of system metrics to determine whether or not it is at risk of serious impairment.  When it detects a dangerous state it backs off on its processing load.  In this case CAS2 decided it was getting too low on disk space on the drive containing its log files.  Never mind there were a few GB free.  Exchange feels that’s not enough.  So it stopped accepting incoming SMTP messages.  The services keep running.  It just writes 2 events to the Application log and sits there silently.  Anna decided to give Arthur the cold shoulder.  So mail was still being delivered by CAS1.  But any mail which happened to hit CAS2 would be rejected.  And the Barracudas would queue it up for a later redelivery attempt.  Since these are virtual machines I simply expanded the disk and restarted the Exchange Transport service.  CAS2 resumed service.

Not ready to make up

I checked the queues on the Barracudas.  They were starting to go down.  Then I witnessed them bump back up.  Wth?  Back to Event Viewer on the CAS boxes.  Lo and behold they are both dropping connection attempts from both Barracudas.  The reason is that the Barracudas are trying to establish more simultaneous connections than the Receive Connectors will allow.  Argh!  I didn’t find any obvious way to limit the SMTP sessions on the Barracudas.  So I increased the maximum number of sessions allowed on the CAS servers.  The default is 20.  I changed it to 50, which seemed like a reasonable number to me.  This got the couple communicating again.

Lessons learned

Keep plenty of free space on Exchange drives containing DBs or log files.  OR tweak the back pressure disk space thresholds as described at the bottom of the page here.  It involves some simple edits to the EdgeTransport.exe.config file.  Microsoft doesn’t recommend it.  But I don’t trust them anyway (see my previous post involving Network Load Balancing).

I’ll very soon be replacing Barracuda 1, which means Anna will be getting a new boyfriend.  I really hope he’s not a jerk.  But at least in this relationship there’s always Instant Replacement!

why you shouldn’t decommission exchange 2003 in the middle of the day

After reading the post title I know what you’re thinking.  I won’t try to justify the ill-advised nature of this decision.  Suffice to say if I only scheduled production changes for after-hours I’d either fall hopelessly behind or never sleep.

We have these two old Exchange 2003 servers configured in a cluster.  Let’s call the hostname for this cluster OldMailServer.bruteforce.local.  We already migrated all mailboxes to our 2010 cluster (NewMailServer.bruteforce.local) many months ago.  We were fully aware that many network devices, including scanners, were still pointing to OldMailServer.bruteforce.local or its IP address (2003 IP).  I decided on the following steps to complete the decommission of Exchange 2003.

The plan:

  1. Re-point OldMailServer.bruteforce.local  in DNS to 2010 IP.
  2. Assign a new temporary IP address to the 2003 cluster
  3. Since devices are still sending mail to 2003 IP, we assign this as a secondary IP on the 2010 CAS array.
  4. We must add a static ARP table entry in our data center switches for 2003 IP since it is being shared by two CAS servers in a Windows NLB cluster.
  5. Stop the 2003 cluster group (basically shut down exchange services).
  6. Wait a week and then uninstall Exchange 2003.

How did this blow up in our faces?

We completed the first 5 steps and the calls started hitting our homicidal Help Desk.  There were two problems being reported:

  1. Many users were getting username / password prompts from outlook purporting to be from OldMailServer.bruteforce.local!!  Yes the server that is OFF.  The only place that host name exists now is in DNS.
  2. No devices configured to send mail to 2003 IP are able to hit it.

Here we go.  It took a couple hours but we eventually got this ironed out.

So what went wrong with our poor innocent coworkers?

The cause of the username / password prompts from OldMailServer was some sort of reverse DNS function of Outlook.  When I pointed OldMailServer.bruteforce.local to 2010 IP I foolishly allowed it to update the PTR record.  So now 2010 IP has two PTR records.  So if one were to be conversing with NewMailServer, and one was so inclined to do a reverse DNS lookup on its IP, in reply one might get NewMailServer and one might get OldMailServer.  I’m not exactly sure what Outlook was doing here.  But the solution was to kill the PTR record pointing to OldMailServer.  Can someone explain this to me?

So what went wrong with the scanners?

Now why was nothing able to hit NewMailServer using the 2003 IP you ask?  If you remember we had configured this as a secondary IP on the 2010 CAS array.  The CAS array is configured to listen and accept traffic on all IPs.  Ok great.  The problem was with step 4 in our plan.  The dreaded static ARP entry.  I noticed that when I added the secondary IP in Windows Network Load Balancing Manager, it created a new MAC address for it (MAC 2).  I asked our Network Engineer to edit the static ARP entry for 2003 IP to point to MAC 2.  This had the effect of making 2003 IP virtually unreachable.

Why I now hate Windows Network Load Balancing (more than before)

I found a server in the data center which could successfully ping 2003 IP.  I checked its ARP table and observed that 2003 IP was associated with MAC 1.  This is the MAC address of 2010 IP, the first IP configured in NLB.  Huh?  Ok lets change the static ARP entry in the switches to point 2003 IP to MAC 1.  Now both 2003 IP and 2010 IP have static ARP entries pointing to MAC 1.  Guess what?  It works.  Thank you Microsoft.  I should send a handwritten letter to the Windows Server product manager to thank him for this useless new MAC address created for NO REASON.

What did I learn?

Two things really

  1. If you add additional IPs to Windows NLB, they just use the original MAC.  The new MAC addresses created for them are meaningless.  Software Developers call this a bug.
  2. Don’t turn off your old Exchange server in the middle of the day.Strike that second one.  I looked at my calendar and you wouldn’t believe what I’ve scheduled for this week.