Impress Your Friends with Wireshark and TCP Dump!

For IT generalists like me, who work in a wide breadth of disciplines and tackle different types of challenges day to day, Wireshark is kind of like the “Most Interesting Man in the World” from the Dos Equis beer commercials.  Remember how he doesn’t usually drink beer, but when he does it’s Dos Equis?  Well I don’t usually need to resort to network packet captures to solve problems, but when I do I always use Wireshark!  Dos Equis is finally dumping that ad campaign by the way.

The ability to capture raw network traffic and perform analysis on the data captured is an absolutely vital skill for any experienced IT engineer.  Sometimes log files, observation and research aren’t sufficient.  There is always blind guessing and intuition, but at some point a deep dive is needed.

This tale started amidst a migration of all our VMs – around 130 – from one vSphere cluster to another.  We have some colo space at a data center and we’ve been moving our infrastructure from our colo cabinets to a “managed” environment in the same data center.  In this new environment the data center staff are responsible for the hardware and the hypervisor.  In other words it’s an Infrastructure as a Service offering.  Over the course of a couple months we worked with the data center staff to move all the VMs using a combination of Veeam and Zerto replication software.  One day early in the migration, our Help Desk started receiving reports from remote employees that they could not VPN in.  What we found was that for periods of time anyone trying to establish a new VPN connection could not.  It would just time out.  However if the person kept trying and trying (and trying and trying) it would eventually work.  Whenever I get reports of a widespread infrastructure problem I always first suspect any changes we’ve recently made.  Certainly the big one at the time was the VM migrations, though it wasn’t immediately obvious to me at first how one might be related to the other.

Our remote access VPN utilizes an old Juniper SA4500 appliance in the colo space.  Employees use either the Junos Pulse desktop application or a web-based option to connect.  I turned on session logging on the appliance and reproduced the issue myself.  Here are excerpts from the resulting log.

VPN Log 1

VPN Log 2

The first highlighted line shows that I was authenticated successfully to a domain controller.  The second highlighted line reveals the problem.  My user account did not map to any roles.  Roles are determined by Active Directory group membership.  There are a couple points in the log where the timestamp jumps a minute or more.  Both occurrences  were immediately proceeded by the line “Groups Search with LDAP Disabled”.

A later log, when the problem was not manifesting itself, yielded this output.

VPN Log 3

There are many lines enumerating my user account group membership.  After this, it maps me to my proper roles and completes the login.  So it appears that the VPN appliance is intermittently unable to enumerate AD group membership.

We had migrated two domain controllers recently to the managed environment.  I made sure the VPN appliance had good network access to them.  We extended our layer 2 network across to the managed environment, so the traffic would not traverse a firewall or even a router.  No IPs changed.  I could not find any issue with the migrated DCs.  Unfortunately the VPN logs did not provide enough detail to determine the root cause of the problem.

As I poked around, I noticed that the VPN appliance had a TCPDump function.  TCPDump is a popular open source packet analyzer with a BSD license.  It utilizes libpcap libraries for network packet capture.  I experimented with the TCPDump function by turning it on and reproducing the problem.  The VPN appliance will then produce a file when the capture is stopped.  This is when I enlisted Wireshark – to open and interrogate the TCPDump output file.  The TCPDump file, as expected, contained all the network traffic to and from the VPN appliance.  It should be noted that I could have achieved a similar result by mirroring the switch port connected to the internal port of the VPN appliance and sending the traffic to a machine running Wirehark.  Having the capture functionality integrated right into the VPN appliance GUI was just more convenient.  Thanks Juniper!

I was able to basically follow along the sequence I observed in the VPN client connection log, but at a network packet level.  Hopefully this level of detail would reveal something I couldn’t see in the other log.  As I scrolled along, lo and behold I saw the output in the excerpt below.

SA4500 TCPDump Excerpt

The VPN appliance is sending and re-transmitting unanswered SYN packets to two IPs on the 172.22.54.x segment.  “What is this segment?”  I thought to myself.  Then it hit me.  This is the new management network segment.  Every VM we migrate over gets a new virtual NIC added on this management segment.  I checked the two migrated domain controllers, and their management NICs indeed were configured with these two IPs.  And there is no way the VPN appliance would be able to reach these IPs, as there is no route from our production network to the management network.  The new question was WHY was it reaching out to these IPs?  How did it know about them?  And that’s when I finally checked DNS.

Bad Facility LDAP Entry (Redacted)

This is the zone corresponding to one of the migrated DCs.  I’ve redacted server names.  As you can see the highlighted entry is the domain controller’s management IP.  The server registered it in DNS as a glue record.  Any host doing a query for the domain name itself, in this case therapy.sacnrm.local, has a one in three chance of resolving to that unreachable management IP.  Then I found this.

Bad GC Entry.png

The servers were also registering the management IPs as global catalogs for the forest.  This was what was tripping up the VPN appliance.  It was performing a DNS lookup for global catalogs to interrogate for group membership.  The DNS server would round-robin through the list and at times return the management IPs.  The VPN appliance would then cache the bad result for a time and no one could connect because their group membership could not be enumerated and their roles could not be determined.  This is a good point in the series of events to share a dark secret.  When I’m working hard troubleshooting an issue for hours or days, there is a small part of me that worries that it’s really something very simple.  And due to tunnel vision or me being obtuse, I’m simply missing the obvious.  I would feel pretty embarrassed if I worked on an issue for two days and it turned out to be something simple.  It has happened before and this is where a second or third set of eyes helps.  At any rate, this was the point when I realized that the issue was actually somewhat complicated and not obvious.  What a relief!

For my next move I tried deleting the offending DNS records, but they would magically reappear before long.  Having now played DNS whack-a-mole, I do not think it would do well at the county fair.  I’d rather shoot the water guns at the targets or lob ping pong balls into glasses to win goldfish.  My research led me to learn that the NetLogon service on domain controllers registers these entries and will replace them if they disappear.  Here’s a Microsoft KB article on the issue.  There is a registry change that prevents this behavior.  We had to manually make this change on our DCs to permanently resolve the issue.

So this was a couple days of my life earlier this year.  I was thrilled to figure this out and restore consistent remote access.  Of course in hindsight I wish I had checked DNS earlier.  And I was a bit disappointed that our managed infrastructure team was not familiar with this behavior.  But it was a great learning experience and Wireshark surely saved my bacon.  Time for a much deserved Dos Equis!  Stay thirsty my friends.

bad apples

So in my free time (HA!) I do some IT consulting work for a few small businesses.  Last month I ran into an interesting problem one of my clients was experiencing.  Felice emailed to report they could not send email from a particular account they have.  This account is set up to receive certain infrequent inquires from the company’s web site.  They then check the inbox via Outlook, yes infrequently, and may choose to bestow a courteous reply upon the sender.  I was somewhat surprised to hear that she had called Microsoft about this problem sending email and was supposedly told the computer had a virus.  This makes a modicum of sense when you consider Microsoft does offer free consumer support for…. wait for it….. virus problems.  At any rate they did not actually help in any way whatsoever.

applesI replied with the standard line of interrogation.  Do you have the password for the account?  Do you have access to webmail?  Are the messages sitting in the outbox?  Did you try a reboot?

Felice: no, no, yes, yes of course.

So no way to check easily via another method and no password…… hmm this may complicate things.  I could have tried to obtain an error message, but I happened to be busy working my day job at the time and decided responsibly to table this for the evening.

I took a drive over the Tappan Zee Bridge depicted so beautifully at the top of this page, which incidentally is not AS beautiful during the day when you can see all the rust and wear and tear and doubly not AS beautiful when you’re sitting in traffic.  It IS being replaced.  And luckily I do have my entertaining podcasts to get me through commuting nightmares (yay This American Life!)…. but I digress.

I reached my destination before dusk and got right to business.  Well not RIGHT to business; I always chat with my clients.  I’m not particularly outgoing or skilled in small talk, but I’ve developed just enough rudimentary social graces to “get by”.  So after catching up with Felice I got RIGHT to business.  I checked the send/receive error in Outlook and the error indicated that the outgoing mail server was refusing the connection.  I saw the hostname of the server and it had the word “Barracuda” in it.  My familiarity with the Barracuda SPAM firewall is evidenced by my earlier post on the love/hate relationship of our Barracuda and mail server as well as my collection of black EAT SPAM t-shirts.  I wish all our hardware vendors packaged clothing with their products.  I’d never have to go to the mall again!

Now, I neglected to mention one crucial detail.  While chatting with Felice, she revealed that they’ve had this issue for a week or so before calling me (wow).  More significantly, the emergence of the problem immediately followed a long power outage.  Now this is a small business with a Verizon FiOS internet connection and a dynamic IP address.  So when I hear power outage I immediately think “new IP address”.  So now getting back to the Barracuda, I found the web page to check Barracuda’s SPAM blacklist.  I entered their current IP address and sure enough POSITIVE HIT.  So that seals it.  Another FiOS customer, who had this IP in the past, somehow got themselves on a SPAM blacklist.  Now poor Felice ended up with it by chance and inherited its ugly reputation.

Thinking back a bit, I do have to give Microsoft some credit.  They may have indeed asked Felice for the error and surmised that HER computer was infected, zombified and spewing out SPAM.  This would be a logical line of deduction.  But A) why didn’t they help her remove the infection?, B) they were wrong anyway and C) i’m giving Microsoft too much credit.  Oh and D) suck it Microsoft, computers under my care do not become SPAM zombies.

So I could either try to get their email / web host to whitelist their IP in the Barracuda or get them a new IP.  Seeing as their email host is a mom and pop shop with no after-hours support, I opted for the latter option.  I rebooted the FiOS router, checked WhatIsMyIP.com, and same damn IP.  I got Verizon FiOS support on the horn and the technician had me turn the router off and on several times while he “tried something”.  No dice.  After succumbing to the reality that he lacked the power to do something so simple as get us a new IP address, we settled on the cop out of leaving the router turned off overnight.  Surely after 12 hours it would receive a different address.  And it did.  Problem solved.

The sad part about all of this IS……. what if Felice and her small business didn’t have an IT genius like myself?  She wasn’t even clear on who their email provider was and could not find a bill.  Microsoft blamed her.  And there’s no way Verizon would have helped if she simply explained she couldn’t send email.  They would have tested her internet connection and told her to talk to her email provider, after trying to sell her on some unnecessary “business services” including “advanced email hosting” or some such useless crap.  This type of thing should not just happen to people.  They’re just trying to run a business like good Americans, and they innocently inherit a tainted IP address from some porn addict who can’t stop from clicking everything bouncing around the screen.  Anyway I feel sorry for the next customer to get that IP.  Maybe I should track them down and send them a business card……..

the reluctant print server

I’m not a big fan of printing and working with paper.  I like to keep my desk and drawers free of clutter.  A number of years ago I worked for a municipal government.  I’ll never forget the first time I laid eyes on the desk of one of our building inspectors.  It was entirely covered in a disorganized blob of papers several inches tall.  Whatever was on the bottom of that mess is likely still there.  Obviously the experience stuck with me.

A few weeks ago I was VERY surprised to find out one of our print servers decided, in spite of its nature, that my way is the better way, the way of the future.  It respectfully refused to honor the requests of our staff to use consumables and contribute to deforestation.  While the print server earned my respect and admiration for its idealistic stand, I had to dash its dreams.  For it is the sysadmin’s responsibility to keep the toner and paper fusing as it were.

I was first informed of the issue by the IT Support team late one afternoon.  People were reporting that their printers had disappeared in Citrix.  I took a look at their print server (called PS01) which is used by around 1,500 people.  When I first terminal’d in the spooler was running.  But upon inspection of the server logs I could see that the spooler was crashing every 1 to 5 minutes.  After each crash our monitoring software would dutifully detect the failure and restart the spooler.  Then it would again crash.  Windows Event Viewer unfortunately did not provide any specifics regarding the cause.

My first instinct was to clear all files out of the spool folder (C:\Windows\System32\spool\PRINTERS).  Often these repeated spooler crashes are caused by a print job which bugs out the driver.  Not this time.  My next move was to reach out to Support to see if anyone installed a new printer that day and to reboot the server.  The spooler continued to crash after reboot.  But I got a response back from one of the Support Analysts who installed a printer earlier in the afternoon.  I had noticed in Event Viewer that she had logged onto the server a few hours earlier.  I deleted the printer she had installed, but again no joy.  By this time it was well after 5:00.  I turned on logging of all print events including informational events in the print server properties.  I figured I could then check the logs to see which print job crashed the spooler.  I soon came to the realization that the spooler would crash regardless of print activity.  It would crash every few minutes even if no jobs were being processed.

At this point I got desperate and resorted to attaching a debugger to the spoolsv.exe (print spooler) process.  I used adplus from the Windows Debugging Tools as described in this article.  I waited a couple minutes for the spooler to crash and then examined the log file file.  Unfortunately the file made no reference to a print driver or any cause of the crash.  The adplus tool also left a bunch of dump (.dmp) files behind.  I fed some of these into Windbg.  No cause identified.  I repeated this process again with another crash.  The resulting log files were similarly unhelpful.  This is getting ridiculous, I thought to myself.

I had taken a break to drive home and eat dinner.  So now it was quite late in the evening.  I decided to call and open a ticket with Microsoft and continue to work on the problem while waiting for a callback from an engineer.  After getting off the phone I finally turned to my #1 favorite Windows diagnostic tool….. Process Monitor.  This tool was developed by Mark Russinovich and the Sysinternals team.  They were acquired by Microsoft a couple years ago.  Process Monitor captures all file system, registry, network and process activity on a system.  I ran it until the spooler crashed.  I then filtered to only show output which includes the spoolsv.exe process.  I scrolled to the time of the crash and here’s what I saw.

image

You can see here where the process is exiting.  Now all I had to do was scroll up and see what it was doing just prior to the crash.  Lo and behold I observed MANY lines referencing a particular printer and driver.

image

I saw hundreds of lines just prior to the crash which referenced a particular HP LaserJet 2030 series printer.  In Server Manager I sorted the printer list by driver and observed that this was the only printer on the server using this particular driver.  So I assume someone installed this printer earlier in the day and loaded the new driver.  I tried to look at the properties of the printer, but the window would freeze when I tried to bring the properties up.  So I deleted the printer and removed the driver through the Print Server Properties.  The crashes ceased.

Apparently the print spooler is very active in the background.  It seems to cycle through the registry looking for printers and it re-enumerates the printers on the machine.  Every time it hit this particular printer, CRASH!!!  In hindsight I should have thought to use Process Monitor before going through the trouble of attaching a debugger.  But hey it was an educational experience.  Needless to say this incident DID NOT improve my relationship with printers.  One of these days I will write a post about our myriad issues surrounding printing in Citrix……

relationship problems: barracuda and exchange have “communication issues”

I considered for a moment whether to ascribe gender roles to the enterprise tech involved in the spat.  I decided it might make this short story more entertaining.  And let’s face it, people just love anthropomorphizing their technology.  Forgive me for the clichéd gender stereotypes.

Meet the couple

The Barracuda stands guard and provides security so we’ll make him the dude (Arthur).  The Exchange server is far more complex and does the bulk of the work.  Definitely the chick (Anna).  Anna and Arthur have been together for a little over a year.  The relationship has been going well.  But they work together, so that can be trouble.

The fight

It’s hard to know how this particular fight started.  Anna and Arthur would tell completely different stories.  Like most tiffs between young lovers it probably started over something silly.  Anna forgot to lock the door in the morning or Arthur drunk-posted something stupid to facebook.  I was made aware of the trouble by my boss.  He informed me that people were reporting to him that they were not receiving emails they’d been expecting.  As the tech relationship counselor of the office I sprung into action.

My preconceptions were incorrect

I immediately expected Arthur was to blame.  He’d been acting erratically as of late.  Translation: We have a 2-node Barracuda cluster and the first node has been flaky.  It’s getting old and needs to be replaced.  I have the replacement sitting in my office, but I need to plan a trip to the datacenter to install it.  Given this knowledge I accessed the management interface of Barracuda 1 (B1) and right away saw something disturbing on the Status page.  The status of the energize updates and instant replacement subscriptions had a red error code with a message to contact Barracuda support.  Uh oh!  It also had ~500 messages queued up for delivery to our Exchange server.  I rebooted B1.  While it was rebooting I accessed B2.  It had no errors but it DID have ~500 messages queued up just like B1.  So the errors on B1 are a red herring!  Yes B1 is screwed up in general.  But Arthur may not be to blame after all.

Back pressure

My boss had mentioned to me something about “insufficient resources” messages in the Barracuda logs.  Indeed messages were being rejected by our Exchange CAS array with an SMTP code indicating this.  I checked our two Exchange CAS / Hub Transport servers.  I looked through Event Viewer on CAS1.  All clean.  CAS2 was a different story.  I found two events which indicated Exchange was refusing to accept incoming SMTP messages.  This was triggered by a feature called Back Pressure.  Exchange 2010 tracks a bunch of system metrics to determine whether or not it is at risk of serious impairment.  When it detects a dangerous state it backs off on its processing load.  In this case CAS2 decided it was getting too low on disk space on the drive containing its log files.  Never mind there were a few GB free.  Exchange feels that’s not enough.  So it stopped accepting incoming SMTP messages.  The services keep running.  It just writes 2 events to the Application log and sits there silently.  Anna decided to give Arthur the cold shoulder.  So mail was still being delivered by CAS1.  But any mail which happened to hit CAS2 would be rejected.  And the Barracudas would queue it up for a later redelivery attempt.  Since these are virtual machines I simply expanded the disk and restarted the Exchange Transport service.  CAS2 resumed service.

Not ready to make up

I checked the queues on the Barracudas.  They were starting to go down.  Then I witnessed them bump back up.  Wth?  Back to Event Viewer on the CAS boxes.  Lo and behold they are both dropping connection attempts from both Barracudas.  The reason is that the Barracudas are trying to establish more simultaneous connections than the Receive Connectors will allow.  Argh!  I didn’t find any obvious way to limit the SMTP sessions on the Barracudas.  So I increased the maximum number of sessions allowed on the CAS servers.  The default is 20.  I changed it to 50, which seemed like a reasonable number to me.  This got the couple communicating again.

Lessons learned

Keep plenty of free space on Exchange drives containing DBs or log files.  OR tweak the back pressure disk space thresholds as described at the bottom of the page here.  It involves some simple edits to the EdgeTransport.exe.config file.  Microsoft doesn’t recommend it.  But I don’t trust them anyway (see my previous post involving Network Load Balancing).

I’ll very soon be replacing Barracuda 1, which means Anna will be getting a new boyfriend.  I really hope he’s not a jerk.  But at least in this relationship there’s always Instant Replacement!