I've really only used the Intel ixgbe 10 Gb cards (up to this point) with Solaris (or Linux for that matter). I discovered the source of a long-standing problem we were having a couple of months ago. I had some of the clues, but I only put it together earlier this year.
Using a vnic or vlan device will kill your network performance at 10 Gb speeds, even if it's not active or configured.
It all comes down to the use of hardware rings on the card. If you have a vnic installed above your physical or link-aggregate interface and your card has a number of hardware rings, you will see something like this:
# dlstat show-link
LINK TYPE ID INDEX PKTS BYTES
aggr1 rx local -- 0 0
aggr1 rx bcast -- 1.74M 104.35M
aggr1 rx sw -- 118.75M 48.20G
aggr1 tx bcast -- 1.83M 76.95M
aggr1 tx hw 0 22.68M 5.74G
aggr1 tx hw 1 22.65M 5.77G
aggr1 tx hw 2 830.57K 225.81M
aggr1 tx hw 3 226.90K 171.57M
aggr1 tx hw 4 1.20M 197.86M
aggr1 tx hw 5 21.25M 6.41G
aggr1 tx hw 6 989.00K 598.06M
aggr1 tx hw 7 606.00K 368.05M
aggr1 tx hw 8 1.09M 849.04M
[snip]
nfs1 rx local -- 0 0
nfs1 rx bcast -- 939.90K 56.39M
nfs1 rx sw -- 135.70G 378.82T
nfs1 tx bcast -- 604.81K 27.82M
nfs1 tx hw 0 375.13M 531.59G
nfs1 tx hw 1 379.57M 540.94G
nfs1 tx hw 2 1.25G 921.20G
nfs1 tx hw 3 229.59M 331.10G
nfs1 tx hw 4 125.08M 172.25G
nfs1 tx hw 5 266.42M 375.07G
nfs1 tx hw 6 7.21G 8.74T
nfs1 tx hw 7 8.24G 1.70T
nfs1 tx hw 8 786.11M 1.09T
[snip]
So, you can see that I removed some of the transmission rings from the output because I put so many of them in ixgbe.conf (probably way more than is required or recommended). (These boxes are HP DL360 Gen 8.)
Most importantly, you can see that I have basically a single software receive (rx) ring for both aggr1 and nfs1. nfs1 existed in order to put NFS traffic on a separate VLAN. Not my idea, just something that we were asked to do for security reasons.
If you look in the illumos or openindiana code long enough, you can determine that the vnic is the root cause of that. Once it's removed, the same command will output something like this:
# dlstat show-link
LINK TYPE ID INDEX PKTS BYTES
aggr1 rx local -- 0 0
aggr1 rx bcast -- 0 0
aggr1 rx hw 0 13.70M 65.01G
aggr1 rx hw 1 10.84M 57.96G
aggr1 rx hw 2 12.64M 64.92G
aggr1 rx hw 3 8.79M 46.62G
aggr1 rx hw 4 8.78M 46.61G
aggr1 rx hw 5 11.24M 57.29G
aggr1 rx hw 6 10.84M 57.96G
aggr1 rx hw 7 84.71M 311.88G
aggr1 rx hw 8 21.28M 90.69G
aggr1 rx hw 9 10.57M 53.57G
aggr1 rx hw 10 8.80M 46.62G
aggr1 rx hw 11 12.67M 64.92G
aggr1 rx hw 12 14.06M 59.05G
aggr1 rx hw 13 8.79M 46.62G
aggr1 rx hw 14 1.97M 7.71G
aggr1 rx hw 15 22.24M 101.22G
aggr1 tx bcast -- 0 0
aggr1 tx hw 0 7.23M 763.33M
aggr1 tx hw 1 6.74M 7.56G
aggr1 tx hw 2 1.73M 463.86M
aggr1 tx hw 3 420.66K 80.03M
aggr1 tx hw 4 392.61K 74.03M
aggr1 tx hw 5 5.83M 543.86M
aggr1 tx hw 6 5.65M 7.46G
aggr1 tx hw 7 7.45M 14.39G
aggr1 tx hw 8 7.24M 7.62G
aggr1 tx hw 9 6.20M 7.46G
aggr1 tx hw 10 1.29M 380.93M
aggr1 tx hw 11 2.61M 220.95M
aggr1 tx hw 12 53.14M 17.73G
aggr1 tx hw 13 876.49K 108.12M
aggr1 tx hw 14 6.81M 1.50G
aggr1 tx hw 15 6.27M 1.43G
OK, so nfs1 has been removed, and now you can see rx and tx lanes/hardware rings visible. If the box is also configured correctly, to continue processing packets on the CPU that initially serviced the interrupt (ip:ip_squeue_fanout? ip_tcp_squeue_wput = 2?), you can see a radical increase in performance.
Here is the box before the change (and this is an unexpectedly good run):
# iperf -s -i 1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.00 MByte (default)
------------------------------------------------------------
[ 4] local x.x.x.x port 5001 connected with x.x.x.x port 53698
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 1.0 sec 353 MBytes 2.96 Gbits/sec
[ 4] 1.0- 2.0 sec 499 MBytes 4.19 Gbits/sec
[ 4] 2.0- 3.0 sec 500 MBytes 4.19 Gbits/sec
[ 4] 3.0- 4.0 sec 504 MBytes 4.23 Gbits/sec
[ 4] 4.0- 5.0 sec 500 MBytes 4.20 Gbits/sec
[ 4] 5.0- 6.0 sec 500 MBytes 4.19 Gbits/sec
[ 4] 6.0- 7.0 sec 487 MBytes 4.08 Gbits/sec
[ 4] 7.0- 8.0 sec 492 MBytes 4.13 Gbits/sec
[ 4] 8.0- 9.0 sec 498 MBytes 4.17 Gbits/sec
[ 4] 9.0-10.0 sec 502 MBytes 4.21 Gbits/sec
[ 4] 0.0-10.0 sec 4.72 GBytes 4.05 Gbits/sec
And here is the after:
[ 5] local x.x.x.x port 5001 connected with x.x.x.x port 58292
[ 5] 0.0- 1.0 sec 888 MBytes 7.45 Gbits/sec
[ 5] 1.0- 2.0 sec 1.09 GBytes 9.40 Gbits/sec
[ 5] 2.0- 3.0 sec 1.09 GBytes 9.40 Gbits/sec
[ 5] 3.0- 4.0 sec 1.09 GBytes 9.40 Gbits/sec
[ 5] 4.0- 5.0 sec 1.09 GBytes 9.40 Gbits/sec
[ 5] 5.0- 6.0 sec 1.09 GBytes 9.39 Gbits/sec
[ 5] 6.0- 7.0 sec 1.09 GBytes 9.40 Gbits/sec
[ 5] 7.0- 8.0 sec 1.09 GBytes 9.40 Gbits/sec
[ 5] 8.0- 9.0 sec 1.09 GBytes 9.40 Gbits/sec
[ 5] 9.0-10.0 sec 1.07 GBytes 9.17 Gbits/sec
[ 5] 0.0-10.0 sec 10.7 GBytes 9.14 Gbits/sec
It's not always quite that fast for a single stream, which I suspect has more to do with which CPUs are handling the interrupts. I have a bit more tuning to do with this test cluster. With the LACP aggregate (2x10Gb ixgbe 82599 built into the motherboard), I have seen 18.9 Gb/s with multiple iperf connections.
In case someone is curious, I'm using OpenIndiana hipster on HP DL360 G8s. 24 cores, 128 GB of RAM, 2 onboard 82599 ixgbe ethernet, and two Emulex boards with 8 Gb optics for the Fibre channel attachment to Brocade switches. The storage is driven by an EMC Vplex (VS2), but we do use some direct attachment to flash arrays for ZILs (to get the absolute minimum latency.) The clustering I'll detail in another post.