So where do you look when you have an performance issue on your Isilon?

There are three main areas to look for issues to narrow it down to where you should put your resources.
I’m going to cover the following areas and show some examples from our system.
General overview
To get an general overview of where to begin the hunt for our performance deamon,
using this command gives you an general and good indication of where to begin.
bgo-isilon01-1# isi statistics pstat
___________________________NFS3 Operations Per Second__________________________
null               0.00/s  getattr            0.35/s  setattr            0.00/s
lookup             0.00/s  access             0.00/s  readlink           0.00/s
read               0.00/s  write              0.00/s  create             0.00/s
mkdir              0.00/s  symlink            0.00/s  mknod              0.00/s
remove             0.00/s  rmdir              0.00/s  rename             0.00/s
link               0.00/s  readdir            0.00/s  readdirplus        0.00/s
statfs             0.00/s  fsinfo             0.00/s  pathconf           0.00/s
commit             0.00/s  noop               0.00/s
TOTAL              0.35/s

___CPU Utilization___                                     _____OneFS Stats_____
user             4.8%                                     In         90.63 MB/s
system          20.9%                                     Out        14.48 MB/s
idle            77.1%                                     Total     105.11 MB/s

____Network Input____        ____Network Output___        _______Disk I/O______
MB/s            79.94        MB/s           107.76        Disk     9979.20 iops
Pkt/s        63373.50        Pkt/s        70165.20        Read       25.98 MB/s
Errors/s         2.07        Errors/s         0.00        Write     163.05 MB/s

The interesting parts are the lower half that gives you an performance overview.
CPU Resources
After reading a lot of articles on EMC Support Community I’ve summed up some experiences that have been discussed there.
High CPU usage is not always a problem, but it can be.
To get an overview over the whole clusters CPU usage you can use the following command:
bgo-isilon01-1#  isi statistics system –nodes –top
isi statistics: Mon Dec  8 14:36:05 2014
————————————————————
Node   CPU   SMB   FTP  HTTP ISCSI   NFS  HDFS Total NetIn NetOut DiskIn DiskOut
LNN %Used   B/s   B/s   B/s   B/s   B/s   B/s   B/s   B/s    B/s    B/s     B/s
All  15.4   11M   0.0 918.3   0.0  2.7M   0.0   14M   13M    18M    62M     31M
1  22.2  596K   0.0 918.3   0.0  25.5   0.0  597K  1.5M   818K   5.8M    815K
2  16.1  2.0M   0.0   0.0   0.0  2.7M   0.0  4.7M  3.8M    16M   3.8M    1.7M
3  31.8  8.7M   0.0   0.0   0.0   0.0   0.0  8.7M  7.8M   1.6M   7.4M    1.2M
4   7.7   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    14M    6.9M
5   9.4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    11M    7.1M
6   9.9   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0    10M    6.8M
7  10.5   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0   9.9M    6.6M
From there you can dig deeper on the node in question, for example following is an example from my node 1:
bgo-isilon01-1# ps -fwulp `pgrep lwio`
bgo-isilon01-1: USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN
bgo-isilon01-1: root 13355 38.3  2.7 482600 336012  ??  S    24Oct14 24257:14.77 lw-container lwi     0 69387   0  96  0 ucond
Shows that there is some significant use of CPU on this one node. If you are checking the treads which can be used by adding a -H option,
you will see in this case that it the treads has the workload equally spread over the threads.
bgo-isilon01-1# ps -fwulHp `pgrep lwio`
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 864:03.41 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  I    24Oct14   0:00.03 lw-container lwi     0 69387   0  20  0 sigwait
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 292:37.04 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1014:51.42 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 167:07.17 lw-container lwi     0 69387   0   4  0 kqread
root 13355  0.0  2.7 482600 336012  ??  I    24Oct14   0:00.11 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:30.21 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14  74:34.93 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:15.84 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 4
82600 336012  ??  S    24Oct14 1364:34.30 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1363:41.60 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:59.26 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:10.67 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1367:52.95 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:28.46 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:23.53 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:52.03 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:56.50 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1363:53.73 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:06.38 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:52.35 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:25.73 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1364:50.94 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1366:15.88 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14 1365:23.59 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:28.71 lw-container lwi     0 69387   0  96  0 ucond
root 13355  0.0  2.7 482600 336012  ??  S    24Oct14   0:30.61 lw-container lwi     0 69387   0  96  0 ucond
If you have one thread running at 100% you are looking at a potential problem.
A quote from Tim Wright at EMC:
“It is entirely normal and expected to see multiple threads consuming 15, 20, 25% cpu at times. *If* you see one or more threads that are consistently and constantly consuming 100% cpu, *then* you probably have a problem. If you just see the sum of all the lwio threads consuming  >100% cpu, that is not likely to be a problem. Certain operations including auth can be somewhat cpu-intensive. Imagine hundreds of users connecting to the cluster in the morning.”
Network and Protocol Performance.
Here we are going to check the physical links first and then go on to check the protocols.
Your main command to check interfaces for Ierrs and Oerrs is netstat.
It is possible to use the isi_for_array command to issue this command but the output is sort of distorted to put here.
But if you run netstat -i on one of the nodes is looks pretty much like this:
bgo-isilon01-1# netstat -i
Name    Mtu Network       Address                      Ipkts Ierrs    Opkts Oerrs  Coll
cxgb0  1500 <Link#1>      00:07:43:09:90:bc         31448503823 25069 46349885580     0     0
cxgb0  1500 10.0.0.0   10.0.0.131             31448503823     – 46349885580     –     –
cxgb0  1500 10.0.0.0   10.0.0.123             31448503823     – 46349885580     –     –
cxgb0  1500 10.0.0.0   10.0.0.126             31448503823     – 46349885580     –     –
cxgb1  1500 <Link#2>      00:07:43:09:90:bd         58378279537 44629 79115225564     0     0
cxgb1  1500 10.0.0.0   10.0.0.128             58378279537     – 79115225564     –     –
cxgb1  1500 10.0.0.0   10.0.0.121             58378279537     – 79115225564     –     –
cxgb1  1500 10.0.0.0   10.0.0.112             58378279537     – 79115225564     –     –
em0    1500 <Link#3>      00:25:90:98:ee:82          8995333     0        0     0     0
em1    1500 <Link#4>      00:25:90:98:ee:83          8995333     0        0     0     0
lo0   16384 <Link#5>                                39779509     0 39778192     0     0
lo0   16384 your-net      localhost                 39779509     – 39778192     –     –
lo0   16384 128.221.254.0 bgo-isilon01-1            39779509     – 39778192     –     –
ib0    2004 <Link#6>      00:15:1b:00:10:82:02:52   296657289     0 302119159     0     0
ib0    2004 128.221.253.0 128.221.253.1             296657289     – 302119159     –     –
ib1    2004 <Link#7>      00:15:1b:00:10:82:02:53   325413619     0 327997270     0     0
ib1    2004 128.221.252.0 128.221.252.1             325413619     – 327997270     –     –
As we can see here we are actually having a network issue on this node on both of our interfaces.
cxgb0 (10gige-2) and cxgb1 (10gige-1).
It is not possible to reset these counters, the only way of doing this is to do an ifconfig cxgb0 down ; ifconfig cxgb0 up.
But that is not recommended.
Actually here our chase for the error is finished on the Isilon. But it is a good idea to get an overview of the protocol performance counters at the same time.
First of, get an overview over active clients on our system:
bgo-isilon01-1# isi statistics query –nodes=all –stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 –interval 5 –repeat 12 –degraded
  NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb2
       1                            540                            3                           37
       2                            697                            1                           27
       3                            928                            0                           25
       4                              0                            0                            0
       5                              0                            0                            0
       6                              0                            0                            0
       7                              0                            0                            0
 average                            309                            1                           13
This will repeat 12 times with 5 seconds interval.
Since we only have 3 access nodes the clients are distributed only on node 1 to 3.
If we run this command:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –total –interval 5 –repeat 12 –degraded
Ops    In   Out TimeAvg TimeStdDev Node Proto Class Op
N/s   B/s   B/s      us         us
6.9 880.2  1.7K   791.8     1102.1    1  smb1     *  *
369.6   74K   71K 31426.6   731763.0    1  smb2     *  *
0.2  29.7  43.9    63.0        0.0    2  smb1     *  *
2.0K  374K  3.0M  2207.0    83113.5    2  smb2     *  *
0.2  34.8  51.5   240.0        0.0    3  smb1     *  *
686.5  2.8M  1.0M  4543.0    40379.7    3  smb2     *  *
 
You get to see Network bandwidth that each protocol is using on the nodes. You can also see how the Time Average spent on the number of Ops.
Next place to look is:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –orderby=Class –interval 5 –repeat 12 –degraded
Ops    In   Out    TimeAvg TimeStdDev Node Proto           Class                    Op
N/s   B/s   B/s         us         us
192.0   70K   37K     2128.4     1693.9    1  smb2          create                create
31.0   12K  6.1K     8277.1    45135.5    2  smb2          create                create
158.9   68K   32K     1727.6     4361.0    3  smb2          create                create
0.3  30.4   0.0       86.5        7.8    1  smb1      file_state nttrans:notify_change
4.7 463.9 362.8 27473836.0 62147152.0    1  smb2      file_state         change_notify
185.0   17K   24K     2418.5     6679.5    1  smb2      file_state                 close
0.2  22.0  22.0      100.0        0.0    1  smb2      file_state          oplock_break
4.0 400.1 310.1  6400240.5  9013317.0    2  smb2      file_state         change_notify
28.8  2.7K  3.7K      887.8     5004.6    2  smb2      file_state                 close
15.6  1.8K  1.1K      146.1       90.3    2  smb2      file_state                  lock
0.2  20.8  20.8      125.0        0.0    2  smb2      file_state          oplock_break
4.1 407.4 336.3  3423766.0  6155676.0    3  smb2      file_state         change_notify
118.4   11K   15K      156.3       83.2    3  smb2      file_state                 close
11.5  1.3K 831.1      133.8      127.7    3  smb2      file_state                  lock
1.4 141.2 141.2      149.2       61.5    3  smb2      file_state          oplock_break
1.7 299.2  67.4     1916.0      800.9    1  smb1  namespace_read      trans2:findfirst
1.6 241.1 161.7     1707.0      531.9    1  smb1  namespace_read      trans2:qpathinfo
98.2   10K   58K      651.4     1182.7    1  smb2  namespace_read       query_directory
12.7  1.4K  1.1K      924.5     1640.6    1  smb2  namespace_read            query_info
13.2  1.4K  115K      936.7     1717.0    2  smb2  namespace_read       query_directory
5.8 626.3 507.3      246.4      348.3    2  smb2  namespace_read            query_info
121.3   12K  274K     9975.6    64047.6    3  smb2  namespace_read       query_directory
56.4  6.1K  5.6K      363.1      444.9    3  smb2  namespace_read            query_info
18.4  2.0K  1.3K      840.3     1904.4    1  smb2 namespace_write              set_info
5.2 556.1 364.1      128.0       40.0    2  smb2 namespace_write              set_info
5.2  1.6K 364.4     4016.6     6241.8    3  smb2 namespace_write              set_info
0.6  45.7   0.0       24.0        1.7    1  smb2           other                cancel
0.6  58.4  45.7     1944.7      503.8    1  smb2           other                 flush
0.8  57.6   0.0       35.2       11.8    2  smb2           other                cancel
1.6 114.1   0.0       30.4        2.6    3  smb2           other                cancel
2.4K  285K  524K    15549.1    11171.1    1  smb2            read                  read
180.8   21K  9.4M      346.8      112.4    2  smb2            read                  read
62.7  7.3K  1.0M      514.0     1628.2    3  smb2            read                  read
0.4  56.0  82.8      100.0       39.6    3  smb1   session_state               negprot
0.7  73.3 159.6       64.7       10.0    3  smb2   session_state             negotiate
0.7  1.8K 162.5   110202.7     8156.3    3  smb2   session_state         session_setup
0.7  96.4  57.0      782.3      214.5    3  smb2   session_state          tree_connect
820.6   54M   69K     2023.7     2729.1    1  smb2           write                 write
170.2   11M   14K     3056.0     3288.4    2  smb2           write                 write
145.3  9.5M   12K      763.7      406.8    3  smb2           write                 write

The following command is great to find the clients that is causing the load:
bgo-isilon01-1#  isi statistics client –orderby=Ops –top
isi statistics: Mon Dec  8 15:10:49 2014
————————————————————
Ops         In        Out    TimeAvg       Node      Proto            Class   UserName        LocalName       RemoteName
N/s        B/s        B/s         us
2.4K       278K       842K    11973.7          1       smb2             read DOMAIN\user7    10.0.0.131 app112.domain.com
491.6        58K        14M     7045.0          2       smb2             read DOMAIN\user6    10.0.0.129 clu086.domain.com
181.2        68K        35K     1294.1          1       smb2           create    DOMAIN\user5    10.0.0.131 73602.domain.com
181.2        17K        23K     2443.8          1       smb2       file_state    DOMAIN\user5    10.0.0.131 73602.domain.com
145.4       9.5M        12K     2204.3          1       smb2            write    UNKNOWN    10.0.0.131 app112.domain.com
134.8       8.8M        11K     4910.0          2       smb2            write    UNKNOWN    10.0.0.129 clu086.domain.com
120.0        13K        35K      286.8          2       smb2   namespace_read    DOMAIN\user4   10.0.0.132 83267.domain.com
99.2        10K        49K      314.3          2       smb2   namespace_read    DOMAIN\user3    10.0.0.129 74892.domain.com
95.4        11K       5.3M      434.6          2       smb2             read DOMAIN\user2    10.0.0.129 54392.domain.com
94.8       6.0M       8.0K      886.9          2       smb2            write    UNKNOWN    10.0.0.129 54392.domain.com
90.0        34K        16K     1476.5          2       smb2           create    DOMAIN\user    10.0.0.129 69728.domain.com
88.2       5.7M       7.4K      714.4          3       smb2            write    UNKNOWN    10.0.0.133 clu017.domain.com
83.2       8.4K        56K      197.1          3       smb2   namespace_read  DOMAIN\user    10.0.0.133 68602.domain.com
 
Disk IO
 
To look at Disk IO you have the following command:
bgo-isilon01-1# isi statistics drive -nall -t –long –orderby=OpsOut
 
This lists all the drives on all nodes and orders it by OpsOut.
A small snipet from the output of that command:
isi statistics: Mon Dec  8 14:39:17 2014
————————————————————
Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes
LNN:bay        N/s     B/s      B    N/s      B/s       B      ms  N/s      ms           %    %
2:10 SATA  25.0    545K    22K   49.6     230K    4.6K     0.1  0.0    11.9    0.9 12.1 21.0   2.1M
3:11 SATA  29.0    486K    17K   46.8     230K    4.9K     0.1  0.0    15.6    1.0 10.8 21.0   2.1M
2:6 SATA  34.2    899K    26K   46.6     290K    6.2K     0.1  0.0    11.0    1.0 11.7 26.6   2.0M
3:4 SATA  15.4     65K   4.2K   45.8     651K     14K     0.1  0.0     7.7    1.0 10.4 26.6   2.0M
2:4 SATA  20.4    256K    13K   44.8     427K    9.5K     0.1  0.0    10.6    1.1 10.7 26.6   2.0M
2:3 SATA  15.6    150K   9.6K   44.0     603K     14K     0.1  0.0    10.3    1.0 12.1 26.6   2.0M
2:11 SATA  17.8    219K    12K   43.4     200K    4.6K     0.1  0.0    17.5    1.2 14.1 21.0   2.1M
3:7 SATA  19.4    128K   6.6K   41.0     202K    4.9K     0.1  0.0    21.3    1.5  7.8 21.0   2.1M
3:1 SATA  36.0    902K    25K   40.8     555K     14K     0.1  0.0     9.9    1.1 11.2 26.6   2.0M
1:6 SATA  14.2    130K   9.2K   40.2     736K     18K     0.1  0.0    17.7    1.5  6.8 26.6   2.0M
 
Here you can identify if any of your drives are having trouble or if whole nodes in your cluster is not scaled enough
for the load you are having in your environment.
Sources:
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s