So where do you look when you have an performance issue on your Isilon?
There are three main areas to look for issues to narrow it down to where you should put your resources.
I’m going to cover the following areas and show some examples from our system.
General overview
To get an general overview of where to begin the hunt for our performance deamon,
using this command gives you an general and good indication of where to begin.
bgo-isilon01-1# isi statistics pstat
___________________________NFS3 Operations Per Second__________________________
null 0.00/s getattr 0.35/s setattr 0.00/s
lookup 0.00/s access 0.00/s readlink 0.00/s
read 0.00/s write 0.00/s create 0.00/s
mkdir 0.00/s symlink 0.00/s mknod 0.00/s
remove 0.00/s rmdir 0.00/s rename 0.00/s
link 0.00/s readdir 0.00/s readdirplus 0.00/s
statfs 0.00/s fsinfo 0.00/s pathconf 0.00/s
commit 0.00/s noop 0.00/s
TOTAL 0.35/s
___CPU Utilization___ _____OneFS Stats_____
user 4.8% In 90.63 MB/s
system 20.9% Out 14.48 MB/s
idle 77.1% Total 105.11 MB/s
____Network Input____ ____Network Output___ _______Disk I/O______
MB/s 79.94 MB/s 107.76 Disk 9979.20 iops
Pkt/s 63373.50 Pkt/s 70165.20 Read 25.98 MB/s
Errors/s 2.07 Errors/s 0.00 Write 163.05 MB/s
The interesting parts are the lower half that gives you an performance overview.
CPU Resources
After reading a lot of articles on EMC Support Community I’ve summed up some experiences that have been discussed there.
High CPU usage is not always a problem, but it can be.
To get an overview over the whole clusters CPU usage you can use the following command:
bgo-isilon01-1# isi statistics system –nodes –top
isi statistics: Mon Dec 8 14:36:05 2014
————————————————————
Node CPU SMB FTP HTTP ISCSI NFS HDFS Total NetIn NetOut DiskIn DiskOut
LNN %Used B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s
All 15.4 11M 0.0 918.3 0.0 2.7M 0.0 14M 13M 18M 62M 31M
1 22.2 596K 0.0 918.3 0.0 25.5 0.0 597K 1.5M 818K 5.8M 815K
2 16.1 2.0M 0.0 0.0 0.0 2.7M 0.0 4.7M 3.8M 16M 3.8M 1.7M
3 31.8 8.7M 0.0 0.0 0.0 0.0 0.0 8.7M 7.8M 1.6M 7.4M 1.2M
4 7.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14M 6.9M
5 9.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11M 7.1M
6 9.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10M 6.8M
7 10.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9.9M 6.6M
From there you can dig deeper on the node in question, for example following is an example from my node 1:
bgo-isilon01-1# ps -fwulp `pgrep lwio`
bgo-isilon01-1: USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND UID PPID CPU PRI NI MWCHAN
bgo-isilon01-1: root 13355 38.3 2.7 482600 336012 ?? S 24Oct14 24257:14.77 lw-container lwi 0 69387 0 96 0 ucond
Shows that there is some significant use of CPU on this one node. If you are checking the treads which can be used by adding a -H option,
you will see in this case that it the treads has the workload equally spread over the threads.
bgo-isilon01-1# ps -fwulHp `pgrep lwio`
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND UID PPID CPU PRI NI MWCHAN
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 864:03.41 lw-container lwi 0 69387 0 4 0 kqread
root 13355 0.0 2.7 482600 336012 ?? I 24Oct14 0:00.03 lw-container lwi 0 69387 0 20 0 sigwait
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 292:37.04 lw-container lwi 0 69387 0 4 0 kqread
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1014:51.42 lw-container lwi 0 69387 0 4 0 kqread
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 167:07.17 lw-container lwi 0 69387 0 4 0 kqread
root 13355 0.0 2.7 482600 336012 ?? I 24Oct14 0:00.11 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 0:30.21 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 74:34.93 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1365:15.84 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1364:34.30 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1363:41.60 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1365:59.26 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1364:10.67 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1367:52.95 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 0:28.46 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1364:23.53 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1364:52.03 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1364:56.50 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1363:53.73 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1365:06.38 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1365:52.35 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1365:25.73 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1364:50.94 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1366:15.88 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 1365:23.59 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 0:28.71 lw-container lwi 0 69387 0 96 0 ucond
root 13355 0.0 2.7 482600 336012 ?? S 24Oct14 0:30.61 lw-container lwi 0 69387 0 96 0 ucond
If you have one thread running at 100% you are looking at a potential problem.
A quote from Tim Wright at EMC:
“It is entirely normal and expected to see multiple threads consuming 15, 20, 25% cpu at times. *If* you see one or more threads that are consistently and constantly consuming 100% cpu, *then* you probably have a problem. If you just see the sum of all the lwio threads consuming >100% cpu, that is not likely to be a problem. Certain operations including auth can be somewhat cpu-intensive. Imagine hundreds of users connecting to the cluster in the morning.”
Network and Protocol Performance.
Here we are going to check the physical links first and then go on to check the protocols.
Your main command to check interfaces for Ierrs and Oerrs is netstat.
It is possible to use the isi_for_array command to issue this command but the output is sort of distorted to put here.
But if you run netstat -i on one of the nodes is looks pretty much like this:
bgo-isilon01-1# netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
cxgb0 1500 <Link#1> 00:07:43:09:90:bc 31448503823 25069 46349885580 0 0
cxgb0 1500 10.0.0.0 10.0.0.131 31448503823 – 46349885580 – –
cxgb0 1500 10.0.0.0 10.0.0.123 31448503823 – 46349885580 – –
cxgb0 1500 10.0.0.0 10.0.0.126 31448503823 – 46349885580 – –
cxgb1 1500 <Link#2> 00:07:43:09:90:bd 58378279537 44629 79115225564 0 0
cxgb1 1500 10.0.0.0 10.0.0.128 58378279537 – 79115225564 – –
cxgb1 1500 10.0.0.0 10.0.0.121 58378279537 – 79115225564 – –
cxgb1 1500 10.0.0.0 10.0.0.112 58378279537 – 79115225564 – –
em0 1500 <Link#3> 00:25:90:98:ee:82 8995333 0 0 0 0
em1 1500 <Link#4> 00:25:90:98:ee:83 8995333 0 0 0 0
lo0 16384 <Link#5> 39779509 0 39778192 0 0
lo0 16384 your-net localhost 39779509 – 39778192 – –
lo0 16384 128.221.254.0 bgo-isilon01-1 39779509 – 39778192 – –
ib0 2004 <Link#6> 00:15:1b:00:10:82:02:52 296657289 0 302119159 0 0
ib0 2004 128.221.253.0 128.221.253.1 296657289 – 302119159 – –
ib1 2004 <Link#7> 00:15:1b:00:10:82:02:53 325413619 0 327997270 0 0
ib1 2004 128.221.252.0 128.221.252.1 325413619 – 327997270 – –
As we can see here we are actually having a network issue on this node on both of our interfaces.
cxgb0 (10gige-2) and cxgb1 (10gige-1).
It is not possible to reset these counters, the only way of doing this is to do an ifconfig cxgb0 down ; ifconfig cxgb0 up.
But that is not recommended.
Actually here our chase for the error is finished on the Isilon. But it is a good idea to get an overview of the protocol performance counters at the same time.
First of, get an overview over active clients on our system:
bgo-isilon01-1# isi statistics query –nodes=all –stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 –interval 5 –repeat 12 –degraded
NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb2
1 540 3 37
2 697 1 27
3 928 0 25
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
average 309 1 13
This will repeat 12 times with 5 seconds interval.
Since we only have 3 access nodes the clients are distributed only on node 1 to 3.
If we run this command:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –total –interval 5 –repeat 12 –degraded
Ops In Out TimeAvg TimeStdDev Node Proto Class Op
N/s B/s B/s us us
6.9 880.2 1.7K 791.8 1102.1 1 smb1 * *
369.6 74K 71K 31426.6 731763.0 1 smb2 * *
0.2 29.7 43.9 63.0 0.0 2 smb1 * *
2.0K 374K 3.0M 2207.0 83113.5 2 smb2 * *
0.2 34.8 51.5 240.0 0.0 3 smb1 * *
686.5 2.8M 1.0M 4543.0 40379.7 3 smb2 * *
You get to see Network bandwidth that each protocol is using on the nodes. You can also see how the Time Average spent on the number of Ops.
Next place to look is:
bgo-isilon01-1# isi statistics protocol –nodes=all –protocols=smb1,smb2 –orderby=Class –interval 5 –repeat 12 –degraded
Ops In Out TimeAvg TimeStdDev Node Proto Class Op
N/s B/s B/s us us
192.0 70K 37K 2128.4 1693.9 1 smb2 create create
31.0 12K 6.1K 8277.1 45135.5 2 smb2 create create
158.9 68K 32K 1727.6 4361.0 3 smb2 create create
0.3 30.4 0.0 86.5 7.8 1 smb1 file_state nttrans:notify_change
4.7 463.9 362.8 27473836.0 62147152.0 1 smb2 file_state change_notify
185.0 17K 24K 2418.5 6679.5 1 smb2 file_state close
0.2 22.0 22.0 100.0 0.0 1 smb2 file_state oplock_break
4.0 400.1 310.1 6400240.5 9013317.0 2 smb2 file_state change_notify
28.8 2.7K 3.7K 887.8 5004.6 2 smb2 file_state close
15.6 1.8K 1.1K 146.1 90.3 2 smb2 file_state lock
0.2 20.8 20.8 125.0 0.0 2 smb2 file_state oplock_break
4.1 407.4 336.3 3423766.0 6155676.0 3 smb2 file_state change_notify
118.4 11K 15K 156.3 83.2 3 smb2 file_state close
11.5 1.3K 831.1 133.8 127.7 3 smb2 file_state lock
1.4 141.2 141.2 149.2 61.5 3 smb2 file_state oplock_break
1.7 299.2 67.4 1916.0 800.9 1 smb1 namespace_read trans2:findfirst
1.6 241.1 161.7 1707.0 531.9 1 smb1 namespace_read trans2:qpathinfo
98.2 10K 58K 651.4 1182.7 1 smb2 namespace_read query_directory
12.7 1.4K 1.1K 924.5 1640.6 1 smb2 namespace_read query_info
13.2 1.4K 115K 936.7 1717.0 2 smb2 namespace_read query_directory
5.8 626.3 507.3 246.4 348.3 2 smb2 namespace_read query_info
121.3 12K 274K 9975.6 64047.6 3 smb2 namespace_read query_directory
56.4 6.1K 5.6K 363.1 444.9 3 smb2 namespace_read query_info
18.4 2.0K 1.3K 840.3 1904.4 1 smb2 namespace_write set_info
5.2 556.1 364.1 128.0 40.0 2 smb2 namespace_write set_info
5.2 1.6K 364.4 4016.6 6241.8 3 smb2 namespace_write set_info
0.6 45.7 0.0 24.0 1.7 1 smb2 other cancel
0.6 58.4 45.7 1944.7 503.8 1 smb2 other flush
0.8 57.6 0.0 35.2 11.8 2 smb2 other cancel
1.6 114.1 0.0 30.4 2.6 3 smb2 other cancel
2.4K 285K 524K 15549.1 11171.1 1 smb2 read read
180.8 21K 9.4M 346.8 112.4 2 smb2 read read
62.7 7.3K 1.0M 514.0 1628.2 3 smb2 read read
0.4 56.0 82.8 100.0 39.6 3 smb1 session_state negprot
0.7 73.3 159.6 64.7 10.0 3 smb2 session_state negotiate
0.7 1.8K 162.5 110202.7 8156.3 3 smb2 session_state session_setup
0.7 96.4 57.0 782.3 214.5 3 smb2 session_state tree_connect
820.6 54M 69K 2023.7 2729.1 1 smb2 write write
170.2 11M 14K 3056.0 3288.4 2 smb2 write write
145.3 9.5M 12K 763.7 406.8 3 smb2 write write
The following command is great to find the clients that is causing the load:
bgo-isilon01-1# isi statistics client –orderby=Ops –top
isi statistics: Mon Dec 8 15:10:49 2014
————————————————————
Ops In Out TimeAvg Node Proto Class UserName LocalName RemoteName
N/s B/s B/s us
2.4K 278K 842K 11973.7 1 smb2 read DOMAIN\user7 10.0.0.131 app112.domain.com
491.6 58K 14M 7045.0 2 smb2 read DOMAIN\user6 10.0.0.129 clu086.domain.com
181.2 68K 35K 1294.1 1 smb2 create DOMAIN\user5 10.0.0.131 73602.domain.com
181.2 17K 23K 2443.8 1 smb2 file_state DOMAIN\user5 10.0.0.131 73602.domain.com
145.4 9.5M 12K 2204.3 1 smb2 write UNKNOWN 10.0.0.131 app112.domain.com
134.8 8.8M 11K 4910.0 2 smb2 write UNKNOWN 10.0.0.129 clu086.domain.com
120.0 13K 35K 286.8 2 smb2 namespace_read DOMAIN\user4 10.0.0.132 83267.domain.com
99.2 10K 49K 314.3 2 smb2 namespace_read DOMAIN\user3 10.0.0.129 74892.domain.com
95.4 11K 5.3M 434.6 2 smb2 read DOMAIN\user2 10.0.0.129 54392.domain.com
94.8 6.0M 8.0K 886.9 2 smb2 write UNKNOWN 10.0.0.129 54392.domain.com
90.0 34K 16K 1476.5 2 smb2 create DOMAIN\user 10.0.0.129 69728.domain.com
88.2 5.7M 7.4K 714.4 3 smb2 write UNKNOWN 10.0.0.133 clu017.domain.com
83.2 8.4K 56K 197.1 3 smb2 namespace_read DOMAIN\user 10.0.0.133 68602.domain.com
Disk IO
To look at Disk IO you have the following command:
bgo-isilon01-1# isi statistics drive -nall -t –long –orderby=OpsOut
This lists all the drives on all nodes and orders it by OpsOut.
A small snipet from the output of that command:
isi statistics: Mon Dec 8 14:39:17 2014
————————————————————
Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes
LNN:bay N/s B/s B N/s B/s B ms N/s ms % %
2:10 SATA 25.0 545K 22K 49.6 230K 4.6K 0.1 0.0 11.9 0.9 12.1 21.0 2.1M
3:11 SATA 29.0 486K 17K 46.8 230K 4.9K 0.1 0.0 15.6 1.0 10.8 21.0 2.1M
2:6 SATA 34.2 899K 26K 46.6 290K 6.2K 0.1 0.0 11.0 1.0 11.7 26.6 2.0M
3:4 SATA 15.4 65K 4.2K 45.8 651K 14K 0.1 0.0 7.7 1.0 10.4 26.6 2.0M
2:4 SATA 20.4 256K 13K 44.8 427K 9.5K 0.1 0.0 10.6 1.1 10.7 26.6 2.0M
2:3 SATA 15.6 150K 9.6K 44.0 603K 14K 0.1 0.0 10.3 1.0 12.1 26.6 2.0M
2:11 SATA 17.8 219K 12K 43.4 200K 4.6K 0.1 0.0 17.5 1.2 14.1 21.0 2.1M
3:7 SATA 19.4 128K 6.6K 41.0 202K 4.9K 0.1 0.0 21.3 1.5 7.8 21.0 2.1M
3:1 SATA 36.0 902K 25K 40.8 555K 14K 0.1 0.0 9.9 1.1 11.2 26.6 2.0M
1:6 SATA 14.2 130K 9.2K 40.2 736K 18K 0.1 0.0 17.7 1.5 6.8 26.6 2.0M
Here you can identify if any of your drives are having trouble or if whole nodes in your cluster is not scaled enough
for the load you are having in your environment.
Sources: