Investigations

Oracle Application that hangs and crashes

This week we had an interesting and challenging problem for one of our customers.

Our customers were reporting that one of our mission critical applications started to sporadicly hang and stopped working properly.
This was pretty unexpected as there was no planned changes in our infrastructure towards this application.

My college that initially got this support request did some initial queries and investigation, before he called our Problem team in to look at this case.
What he had found out was that the problem was located only to one location, and only this Oracle application.
It was pretty weired, but it was easy to reproduce the problem on one or more of the client.

My college had made a process dump from the hanging application as it crashed and had some well documented times this was happening.

I started looking at the process dump from the latest crash.

!analzye -v -hang was showing me that it was waiting for my driver orantcp11 .

orantcp11!nttini+fe7
06cb9d77 85c0 test eax,eax
SYMBOL_STACK_INDEX: 5
SYMBOL_NAME: orantcp11!nttini+fe7
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: orantcp11
IMAGE_NAME: orantcp11.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4bb3466a
STACK_COMMAND: ~0s ; kb
BUCKET_ID: HANG_orantcp11!nttini+fe7
FAILURE_BUCKET_ID: APPLICATION_HANG_cfffffff_orantcp11.dll!nttini

We didn’t have to much to go on, and WinDBG is not my strongest side, I’ll just have to trust what it was telling me.
If the driver is hanging, what could be wrong?

At our team meeting we decided to draw up the client <-> server communications and found one deviation for the location in question.
All the traffic has to pass through an extra firewall as this customer is was a private institution connected to us.

We decided to reproduce the issue and do a network dump on the client.
What we found in this was that the communication suddenly just stopped from server, our client sendt an ACK packet to the server on the last response,
and after that the it ended, as we killed the application on the client we saw that there was an RST, ACK packet going from the client to the server.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10691 None 2015-11-05T14:56:30.9055960 10.0.0.50 10.0.0.10 TCP Flags: …A.R.., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 8222720(scale factor: 8)

We decided to do this again and then dump traffic on the server to.
Looking at the network dumps we could see the server and client connecting and communication perfectly until it just died.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10688 None 2015-11-05T14:52:41.5605470 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535
10690 None 2015-11-05T14:52:41.7774920 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 32120
10689 Application 2015-11-05T14:52:41.7722210 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535

We noticed on the server that the last packet , Message number 10689, is a resend of the packet 10688.
When we where looking into the client side of things we noticed that something had been changed.

115591 None 2015-11-05T14:52:41.3666688 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 56, Seq Range: 3748805685 – 3748805741, Ack: 752756193, Win: 32120
115601 None 2015-11-05T14:52:41.5770435 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748805741, Win: 16229120(scale factor: 8)

The packet length had changed on it’s way from the server to our client from a length of 396 to 56, and that the retransmitted packet from the server didn’t arrive at it’s destination.
Looking through out change log we found that our networking department had replaced and software upgraded the firewall towards this location around the date this issue had started to occur.

We checked all the settings on our firewall and after a lot of research, we found a document that described a Application Layer Inspection feature on the firewall that by default was on i the OS.
The stated feature was a needed feature to have on if you had Oracle clients earlier than version 10, but for newer Oracle clients it was recommended to turn off.

We got our networking department on the phone and we logged and carried through with an emergency change on the firewall turning off this feature that was specific to Oracle SQL on the firewall.
Since then we haven’t been able to reproduce the issues on our client at that location.

Case closed 😀

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s