Uncategorized

I’ve updated for better readability

I’ve tried to update so it is easier to read my posts.
In some cases or web browsers this blog looks horrible,
but IE is not the most commonly used browser out there.

Hope you survive ūüôā

Advertisements
Investigations

Oracle Application that hangs and crashes

This week we had an interesting and challenging problem for one of our customers.

Our customers were reporting that one of our mission critical applications started to sporadicly hang and stopped working properly.
This was pretty unexpected as there was no planned changes in our infrastructure towards this application.

My college that initially got this support request did some initial queries and investigation, before he called our Problem team in to look at this case.
What he had found out was that the problem was located only to one location, and only this Oracle application.
It was pretty weired, but it was easy to reproduce the problem on one or more of the client.

My college had made a process dump from the hanging application as it crashed and had some well documented times this was happening.

I started looking at the process dump from the latest crash.

!analzye -v -hang was showing me that it was waiting for my driver orantcp11 .

orantcp11!nttini+fe7
06cb9d77 85c0 test eax,eax
SYMBOL_STACK_INDEX: 5
SYMBOL_NAME: orantcp11!nttini+fe7
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: orantcp11
IMAGE_NAME: orantcp11.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4bb3466a
STACK_COMMAND: ~0s ; kb
BUCKET_ID: HANG_orantcp11!nttini+fe7
FAILURE_BUCKET_ID: APPLICATION_HANG_cfffffff_orantcp11.dll!nttini

We didn’t have to much to go on, and WinDBG is not my strongest side, I’ll just have to trust what it was telling me.
If the driver is hanging, what could be wrong?

At our team meeting we decided to draw up the client <-> server communications and found one deviation for the location in question.
All the traffic has to pass through an extra firewall as this customer is was a private institution connected to us.

We decided to reproduce the issue and do a network dump on the client.
What we found in this was that the communication suddenly just stopped from server, our client sendt an ACK packet to the server on the last response,
and after that the it ended, as we killed the application on the client we saw that there was an RST, ACK packet going from the client to the server.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10691 None 2015-11-05T14:56:30.9055960 10.0.0.50 10.0.0.10 TCP Flags: …A.R.., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 8222720(scale factor: 8)

We decided to do this again and then dump traffic on the server to.
Looking at the network dumps we could see the server and client connecting and communication perfectly until it just died.

MessageNumber DiagnosisTypes Timestamp TimeElapsed Source Destination Module Summary
10688 None 2015-11-05T14:52:41.5605470 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535
10690 None 2015-11-05T14:52:41.7774920 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748806081, Win: 32120
10689 Application 2015-11-05T14:52:41.7722210 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 396, Seq Range: 3748805685 – 3748806081, Ack: 752756193, Win: 65535

We noticed on the server that the last packet , Message number 10689, is a resend of the packet 10688.
When we where looking into the client side of things we noticed that something had been changed.

115591 None 2015-11-05T14:52:41.3666688 10.0.0.10 10.0.0.50 TCP Flags: …AP…, SrcPort: 1521, DstPort: 58006, Length: 56, Seq Range: 3748805685 – 3748805741, Ack: 752756193, Win: 32120
115601 None 2015-11-05T14:52:41.5770435 10.0.0.50 10.0.0.10 TCP Flags: …A…., SrcPort: 58006, DstPort: 1521, Length: 0, Seq Range: 752756193 – 752756193, Ack: 3748805741, Win: 16229120(scale factor: 8)

The packet length had changed on it’s way from the server to our client from a length of 396 to 56, and that the retransmitted packet from the server didn’t arrive at it’s destination.
Looking through out change log we found that our networking department had replaced and software upgraded the firewall towards this location around the date this issue had started to occur.

We checked all the settings on our firewall and after a lot of research, we found a document that described a Application Layer Inspection feature on the firewall that by default was on i the OS.
The stated feature was a needed feature to have on if you had Oracle clients earlier than version 10, but for newer Oracle clients it was recommended to turn off.

We got our networking department on the phone and we logged and carried through with an emergency change on the firewall turning off this feature that was specific to Oracle SQL on the firewall.
Since then we haven’t been able to reproduce the issues on our client at that location.

Case closed ūüėÄ

Uncategorized

Finding the performance issue on Isilon

So where do you look when you have an performance issue on your Isilon?

There are three main areas to look for issues to narrow it down to where you should put your resources.
I’m going to cover the following areas and show some examples from our system.
General overview
To get an general overview of where to begin the hunt for our performance deamon,
using this command gives you an general and good indication of where to begin.
bgo-isilon01-1# isi statistics pstat
___________________________NFS3 Operations Per Second__________________________
null               0.00/s  getattr            0.35/s  setattr            0.00/s
lookup             0.00/s  access             0.00/s  readlink           0.00/s
read               0.00/s  write              0.00/s  create             0.00/s
mkdir              0.00/s  symlink            0.00/s  mknod              0.00/s
remove             0.00/s  rmdir              0.00/s  rename             0.00/s
link               0.00/s  readdir            0.00/s  readdirplus        0.00/s
statfs             0.00/s  fsinfo             0.00/s  pathconf           0.00/s
commit             0.00/s  noop               0.00/s
TOTAL              0.35/s

___CPU Utilization___                                     _____OneFS Stats_____
user             4.8%                                     In         90.63 MB/s
system          20.9%                                     Out        14.48 MB/s
idle            77.1%                                     Total     105.11 MB/s

____Network Input____        ____Network Output___        _______Disk I/O______
MB/s            79.94        MB/s           107.76        Disk     9979.20 iops
Pkt/s        63373.50        Pkt/s        70165.20        Read       25.98 MB/s
Errors/s         2.07        Errors/s         0.00        Write     163.05 MB/s

The interesting parts are the lower half that gives you an performance overview.
CPU Resources
After reading a lot of articles on EMC Support Community I’ve summed up some experiences that have been discussed there.
High CPU usage is not always a problem, but it can be.
To get an overview over the whole clusters CPU usage you can use the following command:
bgo-isilon01-1#¬† isi statistics system –nodes –top
isi statistics: Mon Dec  8 14:36:05 2014
————————————————————
Node   CPU   SMB   FTP  HTTP ISCSI   NFS  HDFS Total NetIn NetOut DiskIn DiskOut
LNN %Used   B/s   B/s   B/s   B/s   B/s   B/s   B/s   B/s    B/s    B/s     B/s
All  15.4   11M   0.0 918.3   0.0  2.7M   0.0   14M   13M    18M    62M     31M
1  22.2  596K   0.0 918.3   0.0  25.5   0.0  597K  1.5M   818K   5.8M    815K
2  16.1  2.0M   0.0   0.0   0.0  2.7M   0.0  4.7M  3.8M    16M   3.8M    1.7M
3  31.8  8.7M   0.0   0.0   0.0   0.0   0.0  8.7M  7.8M   1.6M   7.4M    1.2M
4   7.7   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    14M    6.9M
5   9.4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    11M    7.1M
6   9.9   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    10M    6.8M
7  10.5   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0   9.9M    6.6M
From there you can dig deeper on the node in question, for example following is an example from my node 1:
bgo-isilon01-1# ps -fwulp `pgrep lwio`
bgo-isilon01-1: USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN
bgo-isilon01-1: root 13355 38.3  2.7 482600 336012  ??  S    24Oct14 24257:14.77 lw-container lwi     0 69387   0  96  0 ucond
Shows that there is some significant use of CPU on this one node. If you are checking the treads which can be used by adding a -H option,
you will see in this case that it the treads has the workload equally spread over the threads.
bgo-isilon01-1# ps -fwulHp `pgrep lwio`
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 864:03.41 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  I    24Oct14   0:00.03 lw-container lwi     0 69387   0  20  0 sigwait
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 292:37.04 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1014:51.42 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 167:07.17 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  I    24Oct14   0:00.11 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:30.21 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14  74:34.93 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:15.84 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 4
82600 336012  ??  S    24Oct14 1364:34.30 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1363:41.60 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:59.26 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:10.67 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1367:52.95 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:28.46 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:23.53 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:52.03 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:56.50 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1363:53.73 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:06.38 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:52.35 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:25.73 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:50.94 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1366:15.88 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:23.59 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:28.71 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:30.61 lw-container lwi     0 69387   0  96  0 ucond
If you have one thread running at 100% you are looking at a potential problem.
A quote from Tim Wright at EMC:
“It is entirely normal and expected to see multiple threads consuming 15, 20, 25% cpu at times. *If* you see one or more threads that are consistently and constantly consuming 100% cpu, *then* you probably have a problem. If you just see the sum of all the lwio threads consuming¬† >100% cpu, that is not likely to be a problem. Certain operations including auth can be somewhat cpu-intensive. Imagine hundreds of users connecting to the cluster in the morning.”
Network and Protocol Performance.
Here we are going to check the physical links first and then go on to check the protocols.
Your main command to check interfaces for Ierrs and Oerrs is netstat.
It is possible to use the isi_for_array command to issue this command but the output is sort of distorted to put here.
But if you run netstat -i on one of the nodes is looks pretty much like this:
bgo-isilon01-1# netstat -i
Name    Mtu Network       Address                      Ipkts Ierrs    Opkts Oerrs  Coll
cxgb0  1500 <Link#1>      00:07:43:09:90:bc         31448503823 25069 46349885580     0     0
cxgb0¬† 1500 10.0.0.0¬†¬† 10.0.0.131¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 31448503823¬†¬†¬†¬† – 46349885580¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb0¬† 1500 10.0.0.0¬†¬† 10.0.0.123¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 31448503823¬†¬†¬†¬† – 46349885580¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb0¬† 1500 10.0.0.0¬†¬† 10.0.0.126¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 31448503823¬†¬†¬†¬† – 46349885580¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb1  1500 <Link#2>      00:07:43:09:90:bd         58378279537 44629 79115225564     0     0
cxgb1¬† 1500 10.0.0.0¬†¬† 10.0.0.128¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 58378279537¬†¬†¬†¬† – 79115225564¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb1¬† 1500 10.0.0.0¬†¬† 10.0.0.121¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 58378279537¬†¬†¬†¬† – 79115225564¬†¬†¬†¬† –¬†¬†¬†¬† –
cxgb1¬† 1500 10.0.0.0¬†¬† 10.0.0.112¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 58378279537¬†¬†¬†¬† – 79115225564¬†¬†¬†¬† –¬†¬†¬†¬† –
em0    1500 <Link#3>      00:25:90:98:ee:82          8995333     0        0     0     0
em1    1500 <Link#4>      00:25:90:98:ee:83          8995333     0        0     0     0
lo0   16384 <Link#5>                                39779509     0 39778192     0     0
lo0¬†¬† 16384 your-net¬†¬†¬†¬†¬† localhost¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 39779509¬†¬†¬†¬† – 39778192¬†¬†¬†¬† –¬†¬†¬†¬† –
lo0¬†¬† 16384 128.221.254.0 bgo-isilon01-1¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 39779509¬†¬†¬†¬† – 39778192¬†¬†¬†¬† –¬†¬†¬†¬† –
ib0    2004 <Link#6>      00:15:1b:00:10:82:02:52   296657289     0 302119159     0     0
ib0¬†¬†¬† 2004 128.221.253.0 128.221.253.1¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 296657289¬†¬†¬†¬† – 302119159¬†¬†¬†¬† –¬†¬†¬†¬† –
ib1    2004 <Link#7>      00:15:1b:00:10:82:02:53   325413619     0 327997270     0     0
ib1¬†¬†¬† 2004 128.221.252.0 128.221.252.1¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† 325413619¬†¬†¬†¬† – 327997270¬†¬†¬†¬† –¬†¬†¬†¬† –
As we can see here we are actually having a network issue on this node on both of our interfaces.
cxgb0 (10gige-2) and cxgb1 (10gige-1).
It is not possible to reset these counters, the only way of doing this is to do an ifconfig cxgb0 down ; ifconfig cxgb0 up.
But that is not recommended.
Actually here our chase for the error is finished on the Isilon. But it is a good idea to get an overview of the protocol performance counters at the same time.
First of, get an overview over active clients on our system:
bgo-isilon01-1# isi statistics query –nodes=all –stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 –interval 5 –repeat 12 –degraded
  NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb2
       1                            540                            3                           37
       2                            697                            1                           27
       3                            928                            0                           25
       4                              0                            0                            0
       5                              0                            0                            0
       6                              0                            0                            0
       7                              0                            0                            0
 average                            309                            1                           13
This will repeat 12 times with 5 seconds interval.
Since we only have 3 access nodes the clients are distributed only on node 1 to 3.
If we run this command:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –total –interval 5 –repeat 12 –degraded
Ops    In   Out TimeAvg TimeStdDev Node Proto Class Op
N/s   B/s   B/s      us         us
6.9 880.2  1.7K   791.8     1102.1    1  smb1     *  *
369.6   74K   71K 31426.6   731763.0    1  smb2     *  *
0.2  29.7  43.9    63.0        0.0    2  smb1     *  *
2.0K  374K  3.0M  2207.0    83113.5    2  smb2     *  *
0.2  34.8  51.5   240.0        0.0    3  smb1     *  *
686.5  2.8M  1.0M  4543.0    40379.7    3  smb2     *  *
 
You get to see Network bandwidth that each protocol is using on the nodes. You can also see how the Time Average spent on the number of Ops.
Next place to look is:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –orderby=Class –interval 5 –repeat 12 –degraded
Ops    In   Out    TimeAvg TimeStdDev Node Proto           Class                    Op
N/s   B/s   B/s         us         us
192.0   70K   37K     2128.4     1693.9    1  smb2          create                create
31.0   12K  6.1K     8277.1    45135.5    2  smb2          create                create
158.9   68K   32K     1727.6     4361.0    3  smb2          create                create
0.3  30.4   0.0       86.5        7.8    1  smb1      file_state nttrans:notify_change
4.7 463.9 362.8 27473836.0 62147152.0    1  smb2      file_state         change_notify
185.0   17K   24K     2418.5     6679.5    1  smb2      file_state                 close
0.2  22.0  22.0      100.0        0.0    1  smb2      file_state          oplock_break
4.0 400.1 310.1  6400240.5  9013317.0    2  smb2      file_state         change_notify
28.8  2.7K  3.7K      887.8     5004.6    2  smb2      file_state                 close
15.6  1.8K  1.1K      146.1       90.3    2  smb2      file_state                  lock
0.2  20.8  20.8      125.0        0.0    2  smb2      file_state          oplock_break
4.1 407.4 336.3  3423766.0  6155676.0    3  smb2      file_state         change_notify
118.4   11K   15K      156.3       83.2    3  smb2      file_state                 close
11.5  1.3K 831.1      133.8      127.7    3  smb2      file_state                  lock
1.4 141.2 141.2      149.2       61.5    3  smb2      file_state          oplock_break
1.7 299.2  67.4     1916.0      800.9    1  smb1  namespace_read      trans2:findfirst
1.6 241.1 161.7     1707.0      531.9    1  smb1  namespace_read      trans2:qpathinfo
98.2   10K   58K      651.4     1182.7    1  smb2  namespace_read       query_directory
12.7  1.4K  1.1K      924.5     1640.6    1  smb2  namespace_read            query_info
13.2  1.4K  115K      936.7     1717.0    2  smb2  namespace_read       query_directory
5.8 626.3 507.3      246.4      348.3    2  smb2  namespace_read            query_info
121.3   12K  274K     9975.6    64047.6    3  smb2  namespace_read       query_directory
56.4  6.1K  5.6K      363.1      444.9    3  smb2  namespace_read            query_info
18.4  2.0K  1.3K      840.3     1904.4    1  smb2 namespace_write              set_info
5.2 556.1 364.1      128.0       40.0    2  smb2 namespace_write              set_info
5.2  1.6K 364.4     4016.6     6241.8    3  smb2 namespace_write              set_info
0.6  45.7   0.0       24.0        1.7    1  smb2           other                cancel
0.6  58.4  45.7     1944.7      503.8    1  smb2           other                 flush
0.8  57.6   0.0       35.2       11.8    2  smb2           other                cancel
1.6 114.1   0.0       30.4        2.6    3  smb2           other                cancel
2.4K  285K  524K    15549.1    11171.1    1  smb2            read                  read
180.8   21K  9.4M      346.8      112.4    2  smb2            read                  read
62.7  7.3K  1.0M      514.0     1628.2    3  smb2            read                  read
0.4  56.0  82.8      100.0       39.6    3  smb1   session_state               negprot
0.7  73.3 159.6       64.7       10.0    3  smb2   session_state             negotiate
0.7  1.8K 162.5   110202.7     8156.3    3  smb2   session_state         session_setup
0.7  96.4  57.0      782.3      214.5    3  smb2   session_state          tree_connect
820.6   54M   69K     2023.7     2729.1    1  smb2           write                 write
170.2   11M   14K     3056.0     3288.4    2  smb2           write                 write
145.3  9.5M   12K      763.7      406.8    3  smb2           write                 write

The following command is great to find the clients that is causing the load:
bgo-isilon01-1#¬† isi statistics client –orderby=Ops –top
isi statistics: Mon Dec  8 15:10:49 2014
————————————————————
Ops         In        Out    TimeAvg       Node      Proto            Class   UserName        LocalName       RemoteName
N/s        B/s        B/s         us
2.4K       278K       842K    11973.7          1       smb2             read DOMAIN\user7    10.0.0.131 app112.domain.com
491.6        58K        14M     7045.0          2       smb2             read DOMAIN\user6    10.0.0.129 clu086.domain.com
181.2        68K        35K     1294.1          1       smb2           create    DOMAIN\user5    10.0.0.131 73602.domain.com
181.2        17K        23K     2443.8          1       smb2       file_state    DOMAIN\user5    10.0.0.131 73602.domain.com
145.4       9.5M        12K     2204.3          1       smb2            write    UNKNOWN    10.0.0.131 app112.domain.com
134.8       8.8M        11K     4910.0          2       smb2            write    UNKNOWN    10.0.0.129 clu086.domain.com
120.0        13K        35K      286.8          2       smb2   namespace_read    DOMAIN\user4   10.0.0.132 83267.domain.com
99.2        10K        49K      314.3          2       smb2   namespace_read    DOMAIN\user3    10.0.0.129 74892.domain.com
95.4        11K       5.3M      434.6          2       smb2             read DOMAIN\user2    10.0.0.129 54392.domain.com
94.8       6.0M       8.0K      886.9          2       smb2            write    UNKNOWN    10.0.0.129 54392.domain.com
90.0        34K        16K     1476.5          2       smb2           create    DOMAIN\user    10.0.0.129 69728.domain.com
88.2       5.7M       7.4K      714.4          3       smb2            write    UNKNOWN    10.0.0.133 clu017.domain.com
83.2       8.4K        56K      197.1          3       smb2   namespace_read  DOMAIN\user    10.0.0.133 68602.domain.com
 
Disk IO
 
To look at Disk IO you have the following command:
bgo-isilon01-1# isi statistics drive -nall -t –long –orderby=OpsOut
 
This lists all the drives on all nodes and orders it by OpsOut.
A small snipet from the output of that command:
isi statistics: Mon Dec  8 14:39:17 2014
————————————————————
Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes
LNN:bay        N/s     B/s      B    N/s      B/s       B      ms  N/s      ms           %    %
2:10 SATA  25.0    545K    22K   49.6     230K    4.6K     0.1  0.0    11.9    0.9 12.1 21.0   2.1M
3:11 SATA  29.0    486K    17K   46.8     230K    4.9K     0.1  0.0    15.6    1.0 10.8 21.0   2.1M
2:6 SATA  34.2    899K    26K   46.6     290K    6.2K     0.1  0.0    11.0    1.0 11.7 26.6   2.0M
3:4 SATA  15.4     65K   4.2K   45.8     651K     14K     0.1  0.0     7.7    1.0 10.4 26.6   2.0M
2:4 SATA  20.4    256K    13K   44.8     427K    9.5K     0.1  0.0    10.6    1.1 10.7 26.6   2.0M
2:3 SATA  15.6    150K   9.6K   44.0     603K     14K     0.1  0.0    10.3    1.0 12.1 26.6   2.0M
2:11 SATA  17.8    219K    12K   43.4     200K    4.6K     0.1  0.0    17.5    1.2 14.1 21.0   2.1M
3:7 SATA  19.4    128K   6.6K   41.0     202K    4.9K     0.1  0.0    21.3    1.5  7.8 21.0   2.1M
3:1 SATA  36.0    902K    25K   40.8     555K     14K     0.1  0.0     9.9    1.1 11.2 26.6   2.0M
1:6 SATA  14.2    130K   9.2K   40.2     736K     18K     0.1  0.0    17.7    1.5  6.8 26.6   2.0M
 
Here you can identify if any of your drives are having trouble or if whole nodes in your cluster is not scaled enough
for the load you are having in your environment.
Sources:
Uncategorized

Finding the faulting Printer Driver

We ha just had a case where our print spooler didn’t recover properly after it Crashed.

As we restarted the spooler, the spooler ran for a while and then crashed again.
We did this a couple of times.
As the Print Spooler was pending in our Cluster we could see that the spoolsv.exe process
was jumping up and down in memory usage and that the our E:\Spooler folder was get less objects and at the same time receiving new ones.
After the second time, of trying to restart the spooler service we started to look for the oldest document.
Our FP00160.SHD and FP00160.SPL was a clear candidate for inspection as they where the oldest documents in the print que.
Both files are possible to open in Notepad for inspection.
In the SHD file, if you check the first line it will be filled with cryptic signs but somewhere it will say something like this:
M Y P R I N T E R \ s 0 1 9 \ M Y P R I N T E R
This is the printer you should check out for a faulty driver or maybe update it.
But there is more information available in the SPL file.
From the top of the SPL file:
@PJL COMMENT
@PJL SET USERNAME = “username”
@PJL SET KMCOETYPE = 2
@PJL SET DRIVERJOBID = “ACDE480057D1091E0A110100BE”
@PJL SET STRINGCODESET = UTF8
@PJL SET BITSPERPIXEL = 8
@PJL SET RESOLUTION = 600
@PJL SET JOBNAME = “Microsoft Word – My Document”
Now you can check the Spooler for the printer MYPRINTER and see where the document came from (computer),
and cancel the document in the Spooler if you want.
Hope this helps in troubleshooting and getting you closer to finding the faulty print driver.
kF
Investigations

The case of high amount of Broadcast traffic

I was looking into a different kind of problem the other day when I had to do a network dump on one of our servers.

As I was watching the network dump scrolling down the my screen I kept noticing all the DHCP Requests that was flying by my screen.
Something wasn’t right.
 
I contacted our networking department and they had already a case on that subnet with high amounts of Broadcasts. 
 
I decided to take the first DHCP Request and see what was happening.
I filtered the requests on the EthernetAddress and started looking for a pattern.
 
Image
 
After studying the DHCP Request I found that the client requesting didn’t get any¬†answers. And it kept sending request pretty often.¬†
After looking at the time stamp it sends out an Request and then after 2 seconds, it sends a new one doubling the number of seconds to time out. 
It is sending out in the following patteren: 0 Р2 Р4 Р8  and last 16 seconds before it started the same procedure again. 
 
Since the EthernetAddress didn’t get any address on our network, no DHCP on our Server scopes,¬†
I had to get our networking to find the port the MAC was sitting on. 
After they directed me to the correct networking port I could log on to the attached server and check it’s hardware.
 
When I was studying our Windows 2008 R2 server I couldn’t find any mac address with the getmac command on the server.
 
What could it be?
 
After looking into the packet abit more I saw that it had a VendocClassIdentifier called: brcmftsk.
brcmtfsk
 
Tried to google it but it didn’t return to many good results.
 
In the middle of the lunch break, as I was discussing the matter with an college, he tipped me of with an article[1] regarding Broadcom and DHCP.
 
 
The author of the article has¬†exactly¬†the same problem as me, but he solved it years before I knew it was a problem ūüôā
 
As it turns out, all our Windows servers are requesting an address for its iSCSI adapter on our network.
This was the default configuration on the iSCSI Adapter and according to the article it had to be turned of manually.
 
Since we didn’t use iSCSI or DHCP on the server segment we didn’t notice any depletion in the IP Scope or disruption on the iSCSI service.
But after having 200+ servers in the same segment requesting DHCP request for both of their iSCSI adapters, the amount of Broadcasts was questionable.
 
But how do you turn configure the iSCSI on 200+ servers? It is not manually at least that I know.
 
The best way of doing it is, doing it properly, install the DroadCom Managed Applications Control Suite and configure it there. 
But not all our servers has this suite installed and that would require extra downtime and planing. 
 
So our quick and dirty workaround for this issue was to disable the driver on our servers.
 
The driver service for the iSCSI Adapter is named BXOIS in the registry and is a Kernel loaded driver.
We decided that we should just disable this service and then the driver will be disabled.[3] 
As we have an Active Directory environment, we added this to the default configuration for all our Windows Servers.
 
After we applied the Group Policy we could se the registry had changed properly.
 
PS C:\> reg query  HKLM\System\CurrentControlSet\Services\BXOIS
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\BXOIS
    Start    REG_DWORD    0x4
 
After rebooting our server the following iSCSI devices was listed like this in Devmgmt.msc
 
DeviceDriver disabeld
 
The warnings sign says the following on the driver when you open it:
A driver (service) fir this device has been disabled. An alternate driver may be providing this functionality. (Code 32).
 
After we deployed this to all our servers we saw that the Broadcasts on our network dramaticly dropped.
 
Sources:
 
 
 
 
Investigations

DHCP and Failing Dynamic DNS Update

The Case:

We where having an issue with some of our clients not getting updated properly in our DNS.
Our clients somehow had the wrong IP registered in our DNS server.
We knew that our clients was accessing a 802.1x network and spent some time getting authenticated.
Since this is a newly¬†acquired¬†network it had it’s own DHCP server and that DHCP server was enabled to
DDNS update for our clients, we where only experiencing the issue when the clients where successfully
authenticated on 802.1x.

The investigation:

So where to start? Is it the new DHCP that is overwriting the entries after the client successfully authenticated?
Is it the client that is not able to update it’s own records due to DNS No-Refresh intervall on the DNS server?
Is our primary DNS Server that is not able to update the records due to DNS No-Refresh?
Anyways, I started comparing our DHCP Servers DNS Settings.
Both of them where configured identically for the DNS Settings and was using the same DNS Update Credentials.

PS C:\> Get-DhcpServerv4DnsSetting -ComputerName dns01.contoso.com
  DynamicUpdates             : Always
  DeleteDnsRROnLeaseExpiry   : True
  UpdateDnsRRForOlderClients : True
  DnsSuffix                  :
  DisableDnsPtrRRUpdate      : False
  NameProtection             : False

After checking all the scopes on the server, we could see that all the scopes had the same settings as our primary settings on the DHCP Server.
Here is a PowerShell Script to list all the scopes in a table with their DNS Settings (not perfect but gives you an overview,):

$srv = "."
$scopelist = Get-DHCPServerv4Scope -computername $srv
foreach ($item in $scopelist) {
Write-host $item.Name , $item.ScopeId.IPAddressToString
Get-DHCPServerv4DNSsetting -ComputerName $srv -ScopeId $item.Scoped.IPAddressToString | Format-Table
}

Well everything seamed to be as it should be.. but why didn’t our primary DNS Server update the latest DHCP Lease in DNS with the proper IP?
I decided to check out the DCHPLogs.. And check our client’s name. After opening the logfile I could see straigt away a bunch of error messages
related to DNS Updates.
The log file was filled up with ID 30 and ID 31 entries. They look something like this:
11,07/16/14,11:16:00,Renew,10.84.149.38,AP30f7.0d92.5ea1,30F70D925EA1,,1276772352,0,,,
31,07/16/14,11:16:00,DNS Update Failed,10.84.149.38,AP30f7.0d92.5ea1,,,0,6,,,
30,07/16/14,11:16:00,DNS Update Request,10.84.149.38,AP30f7.0d92.5ea1,,,0,6,,,
11,07/16/14,11:16:00,Renew,10.84.149.38,AP30f7.0d92.5ea1,30F70D925EA1,,1276772352,0,,,

After counting all the 31 Events in Notepad++ we had to many failed updates (aprox 60000 a day).
All kinds of clients failed to get updated. But also all kinds of clients succeed also. What could be doing this?

Discussing with Microsoft, we decided to increase the Que limit for DDNS updates on the DHCP server. [1]
Since we have Windows 2008 R2 DNS and DHCP Servers we increased the DynamicDNSQueueLength to 65536 as described in the article[2].

This didn’t actually help, as our que is still too big. So I stared to investigate what records are failing
and I was a bit surprised, but apparently we have a lot of units that have “invalid FQDN hostnames” reporting to the DHCP Server.¬†
As our DHCP Server is set to update “DynamicUpdates : Always and¬†UpdateDnsRRForOlderClients : True” it will update any type of client
that is reporting it’s hostname to our DHCP server. The¬†invalid¬†FQDN’s was like this:¬†AP30f7.0d92.5ea1.
It is two letters¬†(AP)¬†followed by an MAC address for our Cisco accesspoints. One of the issues with this is the use of “.” in the hostname.
If we where to allow the update to happen, we have to create the 0d92.5ea1 Zone in our DNS server and then the DHCP server will create
a record for AP30f7 inside that Zone. But we have to create a Zone for every Access Point because the last 9 characters is uniq to every Access Point.

So that is not an option. So how can we fix this or reduce the problem?
There is three ways of doing this:
1. Either we rename all our Access Points to a fitting standard
2. We adjust our settings for our DHCP Server globally
3. We adjust just the Scopes in question.

Our network admin wasn’t too glad about renaming several thousands of access points. So I decided to explore the other options.
If we are adjusting the DHCP DNS Update settings, what settings should we have?
Should we just adjust the scope settings or could we do it globally on the DHCP Server?

To figure our this I had to read up [3][4] on the DHCP protocol and how clients send information to the DHCP Server to do or not do DDNS.
In article “Using DNS with DHCP” [4] there is a nice schema of how it looks when the client is communicating with the DHCP Server:

DHCP and DNS Update interaction
DHCP and DNS Update interaction

This is when the client it self is updating the record via DDNS and not when the DHCP Server is doing it.
The setting in the DHCP Server is then “Dynamic Updates: OnClientRequest” which is default settings DDNS Updates from DHCP.
But ours has this set to “Always” and we also have the option “UpdateDNSRRForOlderClients: TRUE” also.

This means that all clients that are getting a lease from our DHCP Server will be updated in our DNS Server by our DHCP server.

So how does this work, DHCP has four steps that it goes through to give a lease and in an default DHCP setup decide if the client
or the DHCP server should do the DNS Update. Shown in the picture:

DHCP Message Exchange
DHCP Message Exchange

First the client does an DHCPDiscover and send the Hostname (ID 12) in the packet.
It then receives an packet from the DHCP Server with an IP Address, Subnet,DNSServer and other DHCPOptions in the DHCPOffer packet from the DHCP Server.
Then the Client sends a DHCPREQUEST packet, this packet contains RequestIP (ID 50), HostName (ID 12) and it may send FQDN (ID 81).
This last option is optional to send for the clients, but all the Windows XP/2003 and newer clients sends it.
The DHCPACK packet is the returned from the DHCP Server and it contains the DHCP DNS Settings in the FQDN Field (ID 81).
If the DNS Settings are default the client will update it self to the DNS Server.

So why is our Access Point not having a FQDN when we are looking in the DHCP Scope and in our logs?
As I was inspecting the packet dump from one of the Access Points i noted that the Access Points didn’t send
the option FQDN (ID 81) in the DHCPREQUEST packet to the DHCP server. So this client is actually not requesting to
be Updated in our DNS, but our DHCP Server still does it.

So this means we either adjust the Scope or set the settings globally on the DHCP Server.
If we change the settings to DynamicUpdates: OnClientRequest and set the option UpdateDNSRRForOlderClients: False.
We will avoid updates requests from clients that are trying to update to an non existing Zone on our DNS Server.

Conclution:

So we have experienced two things in this scenario.
1. Having a good naming standard for all your devices will help a lot and not get you into unexpected trouble like this.
2. Configuring the DHCP server to update everything to DNS might sound like a good idea, but only if you have done point 1 properly.

We have adjusted our DHCP Server globally to have the DynamicUpdates: OnClientRequest and UpdateDNSRRForFolderClients: False.
This is done at the root level of the IPv4 Settings, but you have to check all the scopes after as if it has been changed from the standard config.
The setting will not propegate to the Scopes. You can check with the earlier Powershell script,
and if the settings are inconsistent, it is easy to rewrite it so it sets the correct settings on the scopes that are not following the global config.
The DHCPServer module in Powershell is available in Windows 2012 and newer OSes. So for the Powershell script to work you have to have this OS.

Hope you feel this was worth your time reading and that you enjoyed it.

Regards,
Kenneth

Sources:
[3]Windows Server 2008 TCP/IP Protocols and Services