Investigations

Oracle Application that hangs and crashes

This week we had an interesting and challenging problem for one of our customers.

Our customers were reporting that one of our mission critical applications started to sporadicly hang and stopped working properly.
This was pretty unexpected as there was no planned changes in our infrastructure towards this application.

My college that initially got this support request did some initial queries and investigation, before he called our Problem team in to look at this case.
What he had found out was that the problem was located only to one location, and only this Oracle application.
It was pretty weired, but it was easy to reproduce the problem on one or more of the client.

My college had made a process dump from the hanging application as it crashed and had some well documented times this was happening.

I started looking at the process dump from the latest crash.

!analzye -v -hang was showing me that it was waiting for my driver orantcp11 .

orantcp11!nttini+fe7
06cb9d77 85c0 test eax,eax
SYMBOL_STACK_INDEX: 5
SYMBOL_NAME: orantcp11!nttini+fe7
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: orantcp11
IMAGE_NAME: orantcp11.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4bb3466a
STACK_COMMAND: ~0s ; kb
BUCKET_ID: HANG_orantcp11!nttini+fe7
FAILURE_BUCKET_ID: APPLICATION_HANG_cfffffff_orantcp11.dll!nttini

We didn’t have to much to go on, and WinDBG is not my strongest side, I’ll just have to trust what it was telling me.
If the driver is hanging, what could be wrong?

At our team meeting we decided to draw up the client <-> server communications and found one deviation for the location in question.
All the traffic has to pass through an extra firewall as this customer is was a private institution connected to us.

We decided to reproduce the issue and do a network dump on the client.
What we found in this was that the communication suddenly just stopped from server, our client sendt an ACK packet to the server on the last response,
and after that the it ended, as we killed the application on the client we saw that there was an RST, ACK packet going from the client to the server.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10691 None 2015-11-05T14:56:30.9055960 10.0.0.50 10.0.0.10 TCP Flags: …A.R.., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 8222720(scale factor: 8)

We decided to do this again and then dump traffic on the server to.
Looking at the network dumps we could see the server and client connecting and communication perfectly until it just died.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10688 None 2015-11-05T14:52:41.5605470 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535
10690 None 2015-11-05T14:52:41.7774920 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 32120
10689 Application 2015-11-05T14:52:41.7722210 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535

We noticed on the server that the last packet , Message number 10689, is a resend of the packet 10688.
When we where looking into the client side of things we noticed that something had been changed.

115591 None 2015-11-05T14:52:41.3666688 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 56, Seq Range: 3748805685 – 3748805741, Ack: 752756193, Win: 32120
115601 None 2015-11-05T14:52:41.5770435 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748805741, Win: 16229120(scale factor: 8)

The packet length had changed on it’s way from the server to our client from a length of 396 to 56, and that the retransmitted packet from the server didn’t arrive at it’s destination.
Looking through out change log we found that our networking department had replaced and software upgraded the firewall towards this location around the date this issue had started to occur.

We checked all the settings on our firewall and after a lot of research, we found a document that described a Application Layer Inspection feature on the firewall that by default was on i the OS.
The stated feature was a needed feature to have on if you had Oracle clients earlier than version 10, but for newer Oracle clients it was recommended to turn off.

We got our networking department on the phone and we logged and carried through with an emergency change on the firewall turning off this feature that was specific to Oracle SQL on the firewall.
Since then we haven’t been able to reproduce the issues on our client at that location.

Case closed ūüėÄ

Advertisements
Investigations

The case of high amount of Broadcast traffic

I was looking into a different kind of problem the other day when I had to do a network dump on one of our servers.

As I was watching the network dump scrolling down the my screen I kept noticing all the DHCP Requests that was flying by my screen.
Something wasn’t right.
 
I contacted our networking department and they had already a case on that subnet with high amounts of Broadcasts. 
 
I decided to take the first DHCP Request and see what was happening.
I filtered the requests on the EthernetAddress and started looking for a pattern.
 
Image
 
After studying the DHCP Request I found that the client requesting didn’t get any¬†answers. And it kept sending request pretty often.¬†
After looking at the time stamp it sends out an Request and then after 2 seconds, it sends a new one doubling the number of seconds to time out. 
It is sending out in the following patteren: 0 Р2 Р4 Р8  and last 16 seconds before it started the same procedure again. 
 
Since the EthernetAddress didn’t get any address on our network, no DHCP on our Server scopes,¬†
I had to get our networking to find the port the MAC was sitting on. 
After they directed me to the correct networking port I could log on to the attached server and check it’s hardware.
 
When I was studying our Windows 2008 R2 server I couldn’t find any mac address with the getmac command on the server.
 
What could it be?
 
After looking into the packet abit more I saw that it had a VendocClassIdentifier called: brcmftsk.
brcmtfsk
 
Tried to google it but it didn’t return to many good results.
 
In the middle of the lunch break, as I was discussing the matter with an college, he tipped me of with an article[1] regarding Broadcom and DHCP.
 
 
The author of the article has¬†exactly¬†the same problem as me, but he solved it years before I knew it was a problem ūüôā
 
As it turns out, all our Windows servers are requesting an address for its iSCSI adapter on our network.
This was the default configuration on the iSCSI Adapter and according to the article it had to be turned of manually.
 
Since we didn’t use iSCSI or DHCP on the server segment we didn’t notice any depletion in the IP Scope or disruption on the iSCSI service.
But after having 200+ servers in the same segment requesting DHCP request for both of their iSCSI adapters, the amount of Broadcasts was questionable.
 
But how do you turn configure the iSCSI on 200+ servers? It is not manually at least that I know.
 
The best way of doing it is, doing it properly, install the DroadCom Managed Applications Control Suite and configure it there. 
But not all our servers has this suite installed and that would require extra downtime and planing. 
 
So our quick and dirty workaround for this issue was to disable the driver on our servers.
 
The driver service for the iSCSI Adapter is named BXOIS in the registry and is a Kernel loaded driver.
We decided that we should just disable this service and then the driver will be disabled.[3] 
As we have an Active Directory environment, we added this to the default configuration for all our Windows Servers.
 
After we applied the Group Policy we could se the registry had changed properly.
 
PS C:\> reg query  HKLM\System\CurrentControlSet\Services\BXOIS
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\BXOIS
    Start    REG_DWORD    0x4
 
After rebooting our server the following iSCSI devices was listed like this in Devmgmt.msc
 
DeviceDriver disabeld
 
The warnings sign says the following on the driver when you open it:
A driver (service) fir this device has been disabled. An alternate driver may be providing this functionality. (Code 32).
 
After we deployed this to all our servers we saw that the Broadcasts on our network dramaticly dropped.
 
Sources:
 
 
 
 
Investigations

DHCP and Failing Dynamic DNS Update

The Case:

We where having an issue with some of our clients not getting updated properly in our DNS.
Our clients somehow had the wrong IP registered in our DNS server.
We knew that our clients was accessing a 802.1x network and spent some time getting authenticated.
Since this is a newly¬†acquired¬†network it had it’s own DHCP server and that DHCP server was enabled to
DDNS update for our clients, we where only experiencing the issue when the clients where successfully
authenticated on 802.1x.

The investigation:

So where to start? Is it the new DHCP that is overwriting the entries after the client successfully authenticated?
Is it the client that is not able to update it’s own records due to DNS No-Refresh intervall on the DNS server?
Is our primary DNS Server that is not able to update the records due to DNS No-Refresh?
Anyways, I started comparing our DHCP Servers DNS Settings.
Both of them where configured identically for the DNS Settings and was using the same DNS Update Credentials.

PS C:\> Get-DhcpServerv4DnsSetting -ComputerName dns01.contoso.com
  DynamicUpdates             : Always
  DeleteDnsRROnLeaseExpiry   : True
  UpdateDnsRRForOlderClients : True
  DnsSuffix                  :
  DisableDnsPtrRRUpdate      : False
  NameProtection             : False

After checking all the scopes on the server, we could see that all the scopes had the same settings as our primary settings on the DHCP Server.
Here is a PowerShell Script to list all the scopes in a table with their DNS Settings (not perfect but gives you an overview,):

$srv = "."
$scopelist = Get-DHCPServerv4Scope -computername $srv
foreach ($item in $scopelist) {
Write-host $item.Name , $item.ScopeId.IPAddressToString
Get-DHCPServerv4DNSsetting -ComputerName $srv -ScopeId $item.Scoped.IPAddressToString | Format-Table
}

Well everything seamed to be as it should be.. but why didn’t our primary DNS Server update the latest DHCP Lease in DNS with the proper IP?
I decided to check out the DCHPLogs.. And check our client’s name. After opening the logfile I could see straigt away a bunch of error messages
related to DNS Updates.
The log file was filled up with ID 30 and ID 31 entries. They look something like this:
11,07/16/14,11:16:00,Renew,10.84.149.38,AP30f7.0d92.5ea1,30F70D925EA1,,1276772352,0,,,
31,07/16/14,11:16:00,DNS Update Failed,10.84.149.38,AP30f7.0d92.5ea1,,,0,6,,,
30,07/16/14,11:16:00,DNS Update Request,10.84.149.38,AP30f7.0d92.5ea1,,,0,6,,,
11,07/16/14,11:16:00,Renew,10.84.149.38,AP30f7.0d92.5ea1,30F70D925EA1,,1276772352,0,,,

After counting all the 31 Events in Notepad++ we had to many failed updates (aprox 60000 a day).
All kinds of clients failed to get updated. But also all kinds of clients succeed also. What could be doing this?

Discussing with Microsoft, we decided to increase the Que limit for DDNS updates on the DHCP server. [1]
Since we have Windows 2008 R2 DNS and DHCP Servers we increased the DynamicDNSQueueLength to 65536 as described in the article[2].

This didn’t actually help, as our que is still too big. So I stared to investigate what records are failing
and I was a bit surprised, but apparently we have a lot of units that have “invalid FQDN hostnames” reporting to the DHCP Server.¬†
As our DHCP Server is set to update “DynamicUpdates : Always and¬†UpdateDnsRRForOlderClients : True” it will update any type of client
that is reporting it’s hostname to our DHCP server. The¬†invalid¬†FQDN’s was like this:¬†AP30f7.0d92.5ea1.
It is two letters¬†(AP)¬†followed by an MAC address for our Cisco accesspoints. One of the issues with this is the use of “.” in the hostname.
If we where to allow the update to happen, we have to create the 0d92.5ea1 Zone in our DNS server and then the DHCP server will create
a record for AP30f7 inside that Zone. But we have to create a Zone for every Access Point because the last 9 characters is uniq to every Access Point.

So that is not an option. So how can we fix this or reduce the problem?
There is three ways of doing this:
1. Either we rename all our Access Points to a fitting standard
2. We adjust our settings for our DHCP Server globally
3. We adjust just the Scopes in question.

Our network admin wasn’t too glad about renaming several thousands of access points. So I decided to explore the other options.
If we are adjusting the DHCP DNS Update settings, what settings should we have?
Should we just adjust the scope settings or could we do it globally on the DHCP Server?

To figure our this I had to read up [3][4] on the DHCP protocol and how clients send information to the DHCP Server to do or not do DDNS.
In article “Using DNS with DHCP” [4] there is a nice schema of how it looks when the client is communicating with the DHCP Server:

DHCP and DNS Update interaction
DHCP and DNS Update interaction

This is when the client it self is updating the record via DDNS and not when the DHCP Server is doing it.
The setting in the DHCP Server is then “Dynamic Updates: OnClientRequest” which is default settings DDNS Updates from DHCP.
But ours has this set to “Always” and we also have the option “UpdateDNSRRForOlderClients: TRUE” also.

This means that all clients that are getting a lease from our DHCP Server will be updated in our DNS Server by our DHCP server.

So how does this work, DHCP has four steps that it goes through to give a lease and in an default DHCP setup decide if the client
or the DHCP server should do the DNS Update. Shown in the picture:

DHCP Message Exchange
DHCP Message Exchange

First the client does an DHCPDiscover and send the Hostname (ID 12) in the packet.
It then receives an packet from the DHCP Server with an IP Address, Subnet,DNSServer and other DHCPOptions in the DHCPOffer packet from the DHCP Server.
Then the Client sends a DHCPREQUEST packet, this packet contains RequestIP (ID 50), HostName (ID 12) and it may send FQDN (ID 81).
This last option is optional to send for the clients, but all the Windows XP/2003 and newer clients sends it.
The DHCPACK packet is the returned from the DHCP Server and it contains the DHCP DNS Settings in the FQDN Field (ID 81).
If the DNS Settings are default the client will update it self to the DNS Server.

So why is our Access Point not having a FQDN when we are looking in the DHCP Scope and in our logs?
As I was inspecting the packet dump from one of the Access Points i noted that the Access Points didn’t send
the option FQDN (ID 81) in the DHCPREQUEST packet to the DHCP server. So this client is actually not requesting to
be Updated in our DNS, but our DHCP Server still does it.

So this means we either adjust the Scope or set the settings globally on the DHCP Server.
If we change the settings to DynamicUpdates: OnClientRequest and set the option UpdateDNSRRForOlderClients: False.
We will avoid updates requests from clients that are trying to update to an non existing Zone on our DNS Server.

Conclution:

So we have experienced two things in this scenario.
1. Having a good naming standard for all your devices will help a lot and not get you into unexpected trouble like this.
2. Configuring the DHCP server to update everything to DNS might sound like a good idea, but only if you have done point 1 properly.

We have adjusted our DHCP Server globally to have the DynamicUpdates: OnClientRequest and UpdateDNSRRForFolderClients: False.
This is done at the root level of the IPv4 Settings, but you have to check all the scopes after as if it has been changed from the standard config.
The setting will not propegate to the Scopes. You can check with the earlier Powershell script,
and if the settings are inconsistent, it is easy to rewrite it so it sets the correct settings on the scopes that are not following the global config.
The DHCPServer module in Powershell is available in Windows 2012 and newer OSes. So for the Powershell script to work you have to have this OS.

Hope you feel this was worth your time reading and that you enjoyed it.

Regards,
Kenneth

Sources:
[3]Windows Server 2008 TCP/IP Protocols and Services