Index ¦ Archives ¦ Atom

OpenIndiana 10Gb Ethernet performance

I've really only used the Intel ixgbe 10 Gb cards (up to this point) with Solaris (or Linux for that matter). I discovered the source of a long-standing problem we were having a couple of months ago. I had some of the clues, but I only put it together earlier this year.

Using a vnic or vlan device will kill your network performance at 10 Gb speeds, even if it's not active or configured.

It all comes down to the use of hardware rings on the card. If you have a vnic installed above your physical or link-aggregate interface and your card has a number of hardware rings, you will see something like this:

# dlstat show-link
        LINK  TYPE      ID  INDEX     PKTS    BYTES
       aggr1    rx   local     --        0        0
       aggr1    rx   bcast     --    1.74M  104.35M
       aggr1    rx      sw     --  118.75M   48.20G
       aggr1    tx   bcast     --    1.83M   76.95M
       aggr1    tx      hw      0   22.68M    5.74G
       aggr1    tx      hw      1   22.65M    5.77G
       aggr1    tx      hw      2  830.57K  225.81M
       aggr1    tx      hw      3  226.90K  171.57M
       aggr1    tx      hw      4    1.20M  197.86M
       aggr1    tx      hw      5   21.25M    6.41G
       aggr1    tx      hw      6  989.00K  598.06M
       aggr1    tx      hw      7  606.00K  368.05M
       aggr1    tx      hw      8    1.09M  849.04M
             [snip]
        nfs1    rx   local     --        0        0
        nfs1    rx   bcast     --  939.90K   56.39M
        nfs1    rx      sw     --  135.70G  378.82T
        nfs1    tx   bcast     --  604.81K   27.82M
        nfs1    tx      hw      0  375.13M  531.59G
        nfs1    tx      hw      1  379.57M  540.94G
        nfs1    tx      hw      2    1.25G  921.20G
        nfs1    tx      hw      3  229.59M  331.10G
        nfs1    tx      hw      4  125.08M  172.25G
        nfs1    tx      hw      5  266.42M  375.07G
        nfs1    tx      hw      6    7.21G    8.74T
        nfs1    tx      hw      7    8.24G    1.70T
        nfs1    tx      hw      8  786.11M    1.09T
             [snip]

So, you can see that I removed some of the transmission rings from the output because I put so many of them in ixgbe.conf (probably way more than is required or recommended). (These boxes are HP DL360 Gen 8.)

Most importantly, you can see that I have basically a single software receive (rx) ring for both aggr1 and nfs1. nfs1 existed in order to put NFS traffic on a separate VLAN. Not my idea, just something that we were asked to do for security reasons.

If you look in the illumos or openindiana code long enough, you can determine that the vnic is the root cause of that. Once it's removed, the same command will output something like this:

# dlstat show-link
         LINK  TYPE      ID  INDEX     PKTS    BYTES
        aggr1    rx   local     --        0        0
        aggr1    rx   bcast     --        0        0
        aggr1    rx      hw      0   13.70M   65.01G
        aggr1    rx      hw      1   10.84M   57.96G
        aggr1    rx      hw      2   12.64M   64.92G
        aggr1    rx      hw      3    8.79M   46.62G
        aggr1    rx      hw      4    8.78M   46.61G
        aggr1    rx      hw      5   11.24M   57.29G
        aggr1    rx      hw      6   10.84M   57.96G
        aggr1    rx      hw      7   84.71M  311.88G
        aggr1    rx      hw      8   21.28M   90.69G
        aggr1    rx      hw      9   10.57M   53.57G
        aggr1    rx      hw     10    8.80M   46.62G
        aggr1    rx      hw     11   12.67M   64.92G
        aggr1    rx      hw     12   14.06M   59.05G
        aggr1    rx      hw     13    8.79M   46.62G
        aggr1    rx      hw     14    1.97M    7.71G
        aggr1    rx      hw     15   22.24M  101.22G
        aggr1    tx   bcast     --        0        0
        aggr1    tx      hw      0    7.23M  763.33M
        aggr1    tx      hw      1    6.74M    7.56G
        aggr1    tx      hw      2    1.73M  463.86M
        aggr1    tx      hw      3  420.66K   80.03M
        aggr1    tx      hw      4  392.61K   74.03M
        aggr1    tx      hw      5    5.83M  543.86M
        aggr1    tx      hw      6    5.65M    7.46G
        aggr1    tx      hw      7    7.45M   14.39G
        aggr1    tx      hw      8    7.24M    7.62G
        aggr1    tx      hw      9    6.20M    7.46G
        aggr1    tx      hw     10    1.29M  380.93M
        aggr1    tx      hw     11    2.61M  220.95M
        aggr1    tx      hw     12   53.14M   17.73G
        aggr1    tx      hw     13  876.49K  108.12M
        aggr1    tx      hw     14    6.81M    1.50G
        aggr1    tx      hw     15    6.27M    1.43G

OK, so nfs1 has been removed, and now you can see rx and tx lanes/hardware rings visible. If the box is also configured correctly, to continue processing packets on the CPU that initially serviced the interrupt (ip:ip_squeue_fanout? ip_tcp_squeue_wput = 2?), you can see a radical increase in performance.

Here is the box before the change (and this is an unexpectedly good run):

# iperf -s -i 1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.00 MByte (default)
------------------------------------------------------------
[  4] local x.x.x.x port 5001 connected with x.x.x.x port 53698
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 1.0 sec   353 MBytes  2.96 Gbits/sec
[  4]  1.0- 2.0 sec   499 MBytes  4.19 Gbits/sec
[  4]  2.0- 3.0 sec   500 MBytes  4.19 Gbits/sec
[  4]  3.0- 4.0 sec   504 MBytes  4.23 Gbits/sec
[  4]  4.0- 5.0 sec   500 MBytes  4.20 Gbits/sec
[  4]  5.0- 6.0 sec   500 MBytes  4.19 Gbits/sec
[  4]  6.0- 7.0 sec   487 MBytes  4.08 Gbits/sec
[  4]  7.0- 8.0 sec   492 MBytes  4.13 Gbits/sec
[  4]  8.0- 9.0 sec   498 MBytes  4.17 Gbits/sec
[  4]  9.0-10.0 sec   502 MBytes  4.21 Gbits/sec
[  4]  0.0-10.0 sec  4.72 GBytes  4.05 Gbits/sec

And here is the after:

[  5] local x.x.x.x port 5001 connected with x.x.x.x port 58292
[  5]  0.0- 1.0 sec   888 MBytes  7.45 Gbits/sec
[  5]  1.0- 2.0 sec  1.09 GBytes  9.40 Gbits/sec
[  5]  2.0- 3.0 sec  1.09 GBytes  9.40 Gbits/sec
[  5]  3.0- 4.0 sec  1.09 GBytes  9.40 Gbits/sec
[  5]  4.0- 5.0 sec  1.09 GBytes  9.40 Gbits/sec
[  5]  5.0- 6.0 sec  1.09 GBytes  9.39 Gbits/sec
[  5]  6.0- 7.0 sec  1.09 GBytes  9.40 Gbits/sec
[  5]  7.0- 8.0 sec  1.09 GBytes  9.40 Gbits/sec
[  5]  8.0- 9.0 sec  1.09 GBytes  9.40 Gbits/sec
[  5]  9.0-10.0 sec  1.07 GBytes  9.17 Gbits/sec
[  5]  0.0-10.0 sec  10.7 GBytes  9.14 Gbits/sec

It's not always quite that fast for a single stream, which I suspect has more to do with which CPUs are handling the interrupts. I have a bit more tuning to do with this test cluster. With the LACP aggregate (2x10Gb ixgbe 82599 built into the motherboard), I have seen 18.9 Gb/s with multiple iperf connections.

In case someone is curious, I'm using OpenIndiana hipster on HP DL360 G8s. 24 cores, 128 GB of RAM, 2 onboard 82599 ixgbe ethernet, and two Emulex boards with 8 Gb optics for the Fibre channel attachment to Brocade switches. The storage is driven by an EMC Vplex (VS2), but we do use some direct attachment to flash arrays for ZILs (to get the absolute minimum latency.) The clustering I'll detail in another post.

© Scott McClung. Built using Pelican. Theme by Giulio Fidente on github.