Investigations

Oracle Application that hangs and crashes

This week we had an interesting and challenging problem for one of our customers.

Our customers were reporting that one of our mission critical applications started to sporadicly hang and stopped working properly.
This was pretty unexpected as there was no planned changes in our infrastructure towards this application.

My college that initially got this support request did some initial queries and investigation, before he called our Problem team in to look at this case.
What he had found out was that the problem was located only to one location, and only this Oracle application.
It was pretty weired, but it was easy to reproduce the problem on one or more of the client.

My college had made a process dump from the hanging application as it crashed and had some well documented times this was happening.

I started looking at the process dump from the latest crash.

!analzye -v -hang was showing me that it was waiting for my driver orantcp11 .

orantcp11!nttini+fe7
06cb9d77 85c0 test eax,eax
SYMBOL_STACK_INDEX: 5
SYMBOL_NAME: orantcp11!nttini+fe7
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: orantcp11
IMAGE_NAME: orantcp11.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4bb3466a
STACK_COMMAND: ~0s ; kb
BUCKET_ID: HANG_orantcp11!nttini+fe7
FAILURE_BUCKET_ID: APPLICATION_HANG_cfffffff_orantcp11.dll!nttini

We didn’t have to much to go on, and WinDBG is not my strongest side, I’ll just have to trust what it was telling me.
If the driver is hanging, what could be wrong?

At our team meeting we decided to draw up the client <-> server communications and found one deviation for the location in question.
All the traffic has to pass through an extra firewall as this customer is was a private institution connected to us.

We decided to reproduce the issue and do a network dump on the client.
What we found in this was that the communication suddenly just stopped from server, our client sendt an ACK packet to the server on the last response,
and after that the it ended, as we killed the application on the client we saw that there was an RST, ACK packet going from the client to the server.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10691 None 2015-11-05T14:56:30.9055960 10.0.0.50 10.0.0.10 TCP Flags: …A.R.., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 8222720(scale factor: 8)

We decided to do this again and then dump traffic on the server to.
Looking at the network dumps we could see the server and client connecting and communication perfectly until it just died.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10688 None 2015-11-05T14:52:41.5605470 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535
10690 None 2015-11-05T14:52:41.7774920 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 32120
10689 Application 2015-11-05T14:52:41.7722210 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535

We noticed on the server that the last packet , Message number 10689, is a resend of the packet 10688.
When we where looking into the client side of things we noticed that something had been changed.

115591 None 2015-11-05T14:52:41.3666688 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 56, Seq Range: 3748805685 – 3748805741, Ack: 752756193, Win: 32120
115601 None 2015-11-05T14:52:41.5770435 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748805741, Win: 16229120(scale factor: 8)

The packet length had changed on it’s way from the server to our client from a length of 396 to 56, and that the retransmitted packet from the server didn’t arrive at it’s destination.
Looking through out change log we found that our networking department had replaced and software upgraded the firewall towards this location around the date this issue had started to occur.

We checked all the settings on our firewall and after a lot of research, we found a document that described a Application Layer Inspection feature on the firewall that by default was on i the OS.
The stated feature was a needed feature to have on if you had Oracle clients earlier than version 10, but for newer Oracle clients it was recommended to turn off.

We got our networking department on the phone and we logged and carried through with an emergency change on the firewall turning off this feature that was specific to Oracle SQL on the firewall.
Since then we haven’t been able to reproduce the issues on our client at that location.

Case closed ūüėÄ

Uncategorized

Finding the performance issue on Isilon

So where do you look when you have an performance issue on your Isilon?

There are three main areas to look for issues to narrow it down to where you should put your resources.
I’m going to cover the following areas and show some examples from our system.
General overview
To get an general overview of where to begin the hunt for our performance deamon,
using this command gives you an general and good indication of where to begin.
bgo-isilon01-1# isi statistics pstat
___________________________NFS3 Operations Per Second__________________________
null               0.00/s  getattr            0.35/s  setattr            0.00/s
lookup             0.00/s  access             0.00/s  readlink           0.00/s
read               0.00/s  write              0.00/s  create             0.00/s
mkdir              0.00/s  symlink            0.00/s  mknod              0.00/s
remove             0.00/s  rmdir              0.00/s  rename             0.00/s
link               0.00/s  readdir            0.00/s  readdirplus        0.00/s
statfs             0.00/s  fsinfo             0.00/s  pathconf           0.00/s
commit             0.00/s  noop               0.00/s
TOTAL              0.35/s

___CPU Utilization___                                     _____OneFS Stats_____
user             4.8%                                     In         90.63 MB/s
system          20.9%                                     Out        14.48 MB/s
idle            77.1%                                     Total     105.11 MB/s

____Network Input____        ____Network Output___        _______Disk I/O______
MB/s            79.94        MB/s           107.76        Disk     9979.20 iops
Pkt/s        63373.50        Pkt/s        70165.20        Read       25.98 MB/s
Errors/s         2.07        Errors/s         0.00        Write     163.05 MB/s

The interesting parts are the lower half that gives you an performance overview.
CPU Resources
After reading a lot of articles on EMC Support Community I’ve summed up some experiences that have been discussed there.
High CPU usage is not always a problem, but it can be.
To get an overview over the whole clusters CPU usage you can use the following command:
bgo-isilon01-1#¬† isi statistics system –nodes –top
isi statistics: Mon Dec  8 14:36:05 2014
————————————————————
Node   CPU   SMB   FTP  HTTP ISCSI   NFS  HDFS Total NetIn NetOut DiskIn DiskOut
LNN %Used   B/s   B/s   B/s   B/s   B/s   B/s   B/s   B/s    B/s    B/s     B/s
All  15.4   11M   0.0 918.3   0.0  2.7M   0.0   14M   13M    18M    62M     31M
1  22.2  596K   0.0 918.3   0.0  25.5   0.0  597K  1.5M   818K   5.8M    815K
2  16.1  2.0M   0.0   0.0   0.0  2.7M   0.0  4.7M  3.8M    16M   3.8M    1.7M
3  31.8  8.7M   0.0   0.0   0.0   0.0   0.0  8.7M  7.8M   1.6M   7.4M    1.2M
4   7.7   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    14M    6.9M
5   9.4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    11M    7.1M
6   9.9   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    10M    6.8M
7  10.5   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0   9.9M    6.6M
From there you can dig deeper on the node in question, for example following is an example from my node 1:
bgo-isilon01-1# ps -fwulp `pgrep lwio`
bgo-isilon01-1: USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN
bgo-isilon01-1: root 13355 38.3  2.7 482600 336012  ??  S    24Oct14 24257:14.77 lw-container lwi     0 69387   0  96  0 ucond
Shows that there is some significant use of CPU on this one node. If you are checking the treads which can be used by adding a -H option,
you will see in this case that it the treads has the workload equally spread over the threads.
bgo-isilon01-1# ps -fwulHp `pgrep lwio`
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 864:03.41 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  I    24Oct14   0:00.03 lw-container lwi     0 69387   0  20  0 sigwait
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 292:37.04 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1014:51.42 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 167:07.17 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  I    24Oct14   0:00.11 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:30.21 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14  74:34.93 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:15.84 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 4
82600 336012  ??  S    24Oct14 1364:34.30 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1363:41.60 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:59.26 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:10.67 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1367:52.95 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:28.46 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:23.53 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:52.03 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:56.50 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1363:53.73 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:06.38 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:52.35 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:25.73 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:50.94 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1366:15.88 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:23.59 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:28.71 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:30.61 lw-container lwi     0 69387   0  96  0 ucond
If you have one thread running at 100% you are looking at a potential problem.
A quote from Tim Wright at EMC:
“It is entirely normal and expected to see multiple threads consuming 15, 20, 25% cpu at times. *If* you see one or more threads that are consistently and constantly consuming 100% cpu, *then* you probably have a problem. If you just see the sum of all the lwio threads consuming¬† >100% cpu, that is not likely to be a problem. Certain operations including auth can be somewhat cpu-intensive. Imagine hundreds of users connecting to the cluster in the morning.”
Network and Protocol Performance.
Here we are going to check the physical links first and then go on to check the protocols.
Your main command to check interfaces for Ierrs and Oerrs is netstat.
It is possible to use the isi_for_array command to issue this command but the output is sort of distorted to put here.
But if you run netstat -i on one of the nodes is looks pretty much like this:
bgo-isilon01-1# netstat -i
Name    Mtu Network       Address                      Ipkts Ierrs    Opkts Oerrs  Coll
cxgb0  1500 <Link#1>      00:07:43:09:90:bc         31448503823 25069 46349885580     0     0
cxgb0¬† 1500 10.0.0.0¬†¬† 10.0.0.131¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 31448503823¬†¬†¬†¬† – 46349885580¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb0¬† 1500 10.0.0.0¬†¬† 10.0.0.123¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 31448503823¬†¬†¬†¬† – 46349885580¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb0¬† 1500 10.0.0.0¬†¬† 10.0.0.126¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 31448503823¬†¬†¬†¬† – 46349885580¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb1  1500 <Link#2>      00:07:43:09:90:bd         58378279537 44629 79115225564     0     0
cxgb1¬† 1500 10.0.0.0¬†¬† 10.0.0.128¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 58378279537¬†¬†¬†¬† – 79115225564¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb1¬† 1500 10.0.0.0¬†¬† 10.0.0.121¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 58378279537¬†¬†¬†¬† – 79115225564¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb1¬† 1500 10.0.0.0¬†¬† 10.0.0.112¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 58378279537¬†¬†¬†¬† – 79115225564¬†¬†¬†¬† –¬†¬†¬†¬† –
em0    1500 <Link#3>      00:25:90:98:ee:82          8995333     0        0     0     0
em1    1500 <Link#4>      00:25:90:98:ee:83          8995333     0        0     0     0
lo0   16384 <Link#5>                                39779509     0 39778192     0     0
lo0¬†¬† 16384 your-net¬†¬†¬†¬†¬† localhost¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 39779509¬†¬†¬†¬† – 39778192¬†¬†¬†¬† –¬†¬†¬†¬† –
lo0¬†¬† 16384 128.221.254.0 bgo-isilon01-1¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 39779509¬†¬†¬†¬† – 39778192¬†¬†¬†¬† –¬†¬†¬†¬† –
ib0    2004 <Link#6>      00:15:1b:00:10:82:02:52   296657289     0 302119159     0     0
ib0¬†¬†¬† 2004 128.221.253.0 128.221.253.1¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 296657289¬†¬†¬†¬† – 302119159¬†¬†¬†¬† –¬†¬†¬†¬† –
ib1    2004 <Link#7>      00:15:1b:00:10:82:02:53   325413619     0 327997270     0     0
ib1¬†¬†¬† 2004 128.221.252.0 128.221.252.1¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 325413619¬†¬†¬†¬† – 327997270¬†¬†¬†¬† –¬†¬†¬†¬† –
As we can see here we are actually having a network issue on this node on both of our interfaces.
cxgb0 (10gige-2) and cxgb1 (10gige-1).
It is not possible to reset these counters, the only way of doing this is to do an ifconfig cxgb0 down ; ifconfig cxgb0 up.
But that is not recommended.
Actually here our chase for the error is finished on the Isilon. But it is a good idea to get an overview of the protocol performance counters at the same time.
First of, get an overview over active clients on our system:
bgo-isilon01-1# isi statistics query –nodes=all –stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 –interval 5 –repeat 12 –degraded
  NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb2
       1                            540                            3                           37
       2                            697                            1                           27
       3                            928                            0                           25
       4                              0                            0                            0
       5                              0                            0                            0
       6                              0                            0                            0
       7                              0                            0                            0
 average                            309                            1                           13
This will repeat 12 times with 5 seconds interval.
Since we only have 3 access nodes the clients are distributed only on node 1 to 3.
If we run this command:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –total –interval 5 –repeat 12 –degraded
Ops    In   Out TimeAvg TimeStdDev Node Proto Class Op
N/s   B/s   B/s      us         us
6.9 880.2  1.7K   791.8     1102.1    1  smb1     *  *
369.6   74K   71K 31426.6   731763.0    1  smb2     *  *
0.2  29.7  43.9    63.0        0.0    2  smb1     *  *
2.0K  374K  3.0M  2207.0    83113.5    2  smb2     *  *
0.2  34.8  51.5   240.0        0.0    3  smb1     *  *
686.5  2.8M  1.0M  4543.0    40379.7    3  smb2     *  *
 
You get to see Network bandwidth that each protocol is using on the nodes. You can also see how the Time Average spent on the number of Ops.
Next place to look is:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –orderby=Class –interval 5 –repeat 12 –degraded
Ops    In   Out    TimeAvg TimeStdDev Node Proto           Class                    Op
N/s   B/s   B/s         us         us
192.0   70K   37K     2128.4     1693.9    1  smb2          create                create
31.0   12K  6.1K     8277.1    45135.5    2  smb2          create                create
158.9   68K   32K     1727.6     4361.0    3  smb2          create                create
0.3  30.4   0.0       86.5        7.8    1  smb1      file_state nttrans:notify_change
4.7 463.9 362.8 27473836.0 62147152.0    1  smb2      file_state         change_notify
185.0   17K   24K     2418.5     6679.5    1  smb2      file_state                 close
0.2  22.0  22.0      100.0        0.0    1  smb2      file_state          oplock_break
4.0 400.1 310.1  6400240.5  9013317.0    2  smb2      file_state         change_notify
28.8  2.7K  3.7K      887.8     5004.6    2  smb2      file_state                 close
15.6  1.8K  1.1K      146.1       90.3    2  smb2      file_state                  lock
0.2  20.8  20.8      125.0        0.0    2  smb2      file_state          oplock_break
4.1 407.4 336.3  3423766.0  6155676.0    3  smb2      file_state         change_notify
118.4   11K   15K      156.3       83.2    3  smb2      file_state                 close
11.5  1.3K 831.1      133.8      127.7    3  smb2      file_state                  lock
1.4 141.2 141.2      149.2       61.5    3  smb2      file_state          oplock_break
1.7 299.2  67.4     1916.0      800.9    1  smb1  namespace_read      trans2:findfirst
1.6 241.1 161.7     1707.0      531.9    1  smb1  namespace_read      trans2:qpathinfo
98.2   10K   58K      651.4     1182.7    1  smb2  namespace_read       query_directory
12.7  1.4K  1.1K      924.5     1640.6    1  smb2  namespace_read            query_info
13.2  1.4K  115K      936.7     1717.0    2  smb2  namespace_read       query_directory
5.8 626.3 507.3      246.4      348.3    2  smb2  namespace_read            query_info
121.3   12K  274K     9975.6    64047.6    3  smb2  namespace_read       query_directory
56.4  6.1K  5.6K      363.1      444.9    3  smb2  namespace_read            query_info
18.4  2.0K  1.3K      840.3     1904.4    1  smb2 namespace_write              set_info
5.2 556.1 364.1      128.0       40.0    2  smb2 namespace_write              set_info
5.2  1.6K 364.4     4016.6     6241.8    3  smb2 namespace_write              set_info
0.6  45.7   0.0       24.0        1.7    1  smb2           other                cancel
0.6  58.4  45.7     1944.7      503.8    1  smb2           other                 flush
0.8  57.6   0.0       35.2       11.8    2  smb2           other                cancel
1.6 114.1   0.0       30.4        2.6    3  smb2           other                cancel
2.4K  285K  524K    15549.1    11171.1    1  smb2            read                  read
180.8   21K  9.4M      346.8      112.4    2  smb2            read                  read
62.7  7.3K  1.0M      514.0     1628.2    3  smb2            read                  read
0.4  56.0  82.8      100.0       39.6    3  smb1   session_state               negprot
0.7  73.3 159.6       64.7       10.0    3  smb2   session_state             negotiate
0.7  1.8K 162.5   110202.7     8156.3    3  smb2   session_state         session_setup
0.7  96.4  57.0      782.3      214.5    3  smb2   session_state          tree_connect
820.6   54M   69K     2023.7     2729.1    1  smb2           write                 write
170.2   11M   14K     3056.0     3288.4    2  smb2           write                 write
145.3  9.5M   12K      763.7      406.8    3  smb2           write                 write

The following command is great to find the clients that is causing the load:
bgo-isilon01-1#¬† isi statistics client –orderby=Ops –top
isi statistics: Mon Dec  8 15:10:49 2014
————————————————————
Ops         In        Out    TimeAvg       Node      Proto            Class   UserName        LocalName       RemoteName
N/s        B/s        B/s         us
2.4K       278K       842K    11973.7          1       smb2             read DOMAIN\user7    10.0.0.131 app112.domain.com
491.6        58K        14M     7045.0          2       smb2             read DOMAIN\user6    10.0.0.129 clu086.domain.com
181.2        68K        35K     1294.1          1       smb2           create    DOMAIN\user5    10.0.0.131 73602.domain.com
181.2        17K        23K     2443.8          1       smb2       file_state    DOMAIN\user5    10.0.0.131 73602.domain.com
145.4       9.5M        12K     2204.3          1       smb2            write    UNKNOWN    10.0.0.131 app112.domain.com
134.8       8.8M        11K     4910.0          2       smb2            write    UNKNOWN    10.0.0.129 clu086.domain.com
120.0        13K        35K      286.8          2       smb2   namespace_read    DOMAIN\user4   10.0.0.132 83267.domain.com
99.2        10K        49K      314.3          2       smb2   namespace_read    DOMAIN\user3    10.0.0.129 74892.domain.com
95.4        11K       5.3M      434.6          2       smb2             read DOMAIN\user2    10.0.0.129 54392.domain.com
94.8       6.0M       8.0K      886.9          2       smb2            write    UNKNOWN    10.0.0.129 54392.domain.com
90.0        34K        16K     1476.5          2       smb2           create    DOMAIN\user    10.0.0.129 69728.domain.com
88.2       5.7M       7.4K      714.4          3       smb2            write    UNKNOWN    10.0.0.133 clu017.domain.com
83.2       8.4K        56K      197.1          3       smb2   namespace_read  DOMAIN\user    10.0.0.133 68602.domain.com
 
Disk IO
 
To look at Disk IO you have the following command:
bgo-isilon01-1# isi statistics drive -nall -t –long –orderby=OpsOut
 
This lists all the drives on all nodes and orders it by OpsOut.
A small snipet from the output of that command:
isi statistics: Mon Dec  8 14:39:17 2014
————————————————————
Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes
LNN:bay        N/s     B/s      B    N/s      B/s       B      ms  N/s      ms           %    %
2:10 SATA  25.0    545K    22K   49.6     230K    4.6K     0.1  0.0    11.9    0.9 12.1 21.0   2.1M
3:11 SATA  29.0    486K    17K   46.8     230K    4.9K     0.1  0.0    15.6    1.0 10.8 21.0   2.1M
2:6 SATA  34.2    899K    26K   46.6     290K    6.2K     0.1  0.0    11.0    1.0 11.7 26.6   2.0M
3:4 SATA  15.4     65K   4.2K   45.8     651K     14K     0.1  0.0     7.7    1.0 10.4 26.6   2.0M
2:4 SATA  20.4    256K    13K   44.8     427K    9.5K     0.1  0.0    10.6    1.1 10.7 26.6   2.0M
2:3 SATA  15.6    150K   9.6K   44.0     603K     14K     0.1  0.0    10.3    1.0 12.1 26.6   2.0M
2:11 SATA  17.8    219K    12K   43.4     200K    4.6K     0.1  0.0    17.5    1.2 14.1 21.0   2.1M
3:7 SATA  19.4    128K   6.6K   41.0     202K    4.9K     0.1  0.0    21.3    1.5  7.8 21.0   2.1M
3:1 SATA  36.0    902K    25K   40.8     555K     14K     0.1  0.0     9.9    1.1 11.2 26.6   2.0M
1:6 SATA  14.2    130K   9.2K   40.2     736K     18K     0.1  0.0    17.7    1.5  6.8 26.6   2.0M
 
Here you can identify if any of your drives are having trouble or if whole nodes in your cluster is not scaled enough
for the load you are having in your environment.
Sources:
Investigations

The case of high amount of Broadcast traffic

I was looking into a different kind of problem the other day when I had to do a network dump on one of our servers.

As I was watching the network dump scrolling down the my screen I kept noticing all the DHCP Requests that was flying by my screen.
Something wasn’t right.
 
I contacted our networking department and they had already a case on that subnet with high amounts of Broadcasts. 
 
I decided to take the first DHCP Request and see what was happening.
I filtered the requests on the EthernetAddress and started looking for a pattern.
 
Image
 
After studying the DHCP Request I found that the client requesting didn’t get any¬†answers. And it kept sending request pretty often.¬†
After looking at the time stamp it sends out an Request and then after 2 seconds, it sends a new one doubling the number of seconds to time out. 
It is sending out in the following patteren: 0 Р2 Р4 Р8  and last 16 seconds before it started the same procedure again. 
 
Since the EthernetAddress didn’t get any address on our network, no DHCP on our Server scopes,¬†
I had to get our networking to find the port the MAC was sitting on. 
After they directed me to the correct networking port I could log on to the attached server and check it’s hardware.
 
When I was studying our Windows 2008 R2 server I couldn’t find any mac address with the getmac command on the server.
 
What could it be?
 
After looking into the packet abit more I saw that it had a VendocClassIdentifier called: brcmftsk.
brcmtfsk
 
Tried to google it but it didn’t return to many good results.
 
In the middle of the lunch break, as I was discussing the matter with an college, he tipped me of with an article[1] regarding Broadcom and DHCP.
 
 
The author of the article has¬†exactly¬†the same problem as me, but he solved it years before I knew it was a problem ūüôā
 
As it turns out, all our Windows servers are requesting an address for its iSCSI adapter on our network.
This was the default configuration on the iSCSI Adapter and according to the article it had to be turned of manually.
 
Since we didn’t use iSCSI or DHCP on the server segment we didn’t notice any depletion in the IP Scope or disruption on the iSCSI service.
But after having 200+ servers in the same segment requesting DHCP request for both of their iSCSI adapters, the amount of Broadcasts was questionable.
 
But how do you turn configure the iSCSI on 200+ servers? It is not manually at least that I know.
 
The best way of doing it is, doing it properly, install the DroadCom Managed Applications Control Suite and configure it there. 
But not all our servers has this suite installed and that would require extra downtime and planing. 
 
So our quick and dirty workaround for this issue was to disable the driver on our servers.
 
The driver service for the iSCSI Adapter is named BXOIS in the registry and is a Kernel loaded driver.
We decided that we should just disable this service and then the driver will be disabled.[3] 
As we have an Active Directory environment, we added this to the default configuration for all our Windows Servers.
 
After we applied the Group Policy we could se the registry had changed properly.
 
PS C:\> reg query  HKLM\System\CurrentControlSet\Services\BXOIS
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\BXOIS
    Start    REG_DWORD    0x4
 
After rebooting our server the following iSCSI devices was listed like this in Devmgmt.msc
 
DeviceDriver disabeld
 
The warnings sign says the following on the driver when you open it:
A driver (service) fir this device has been disabled. An alternate driver may be providing this functionality. (Code 32).
 
After we deployed this to all our servers we saw that the Broadcasts on our network dramaticly dropped.
 
Sources: