Apr 122014
 

Recently I decided it was time to beef up my storage link between my demonstration vSphere environment and my storage system. My existing setup included a single HP DL360p Gen8, connected to a Synology DS1813+ via NFS.

I went out and purchased the appropriate (and compatible) HP 4 x 1Gb Server NIC (Broadcom based, 4 ports), and connected the Synology device directly to the new server NIC (all 4 ports). I went ahead and configured an iSCSI Target using a File LUN with ALUA (Advanced LUN features). Configured the NICs on both the vSphere side, and on the Synology side, and enabled Jumbo frames of 9000 bytes.

I connected to the iSCSI LUN, and created a VMFS volume. I then configured Round Robin MPIO on the vSphere side of things (as always I made sure to enable “Multiple iSCSI initators” on the Synology side).

I started to migrate some VMs over to the iSCSI LUN. At first I noticed it was going extremely slow. I confirmed that traffic was being passed across all NICs (also verified that all paths were active). After the migration completed I decided to shut down the VMs and restart to compare boot times. Booting from the iSCSI LUN was absolutely horrible, the VMs took forever to boot up. Keep in mind I’m very familiar with vSphere (my company is a VMWare partner), so I know how to properly configure Round Robin, iSCSI, and MPIO.

I then decided to tweak some settings on the ESXi side of things. I configured the Round Robin policy to IOPS=1, which helped a bit. Then changed the RR policy to bytes=8800 which after numerous other tweaks, I determined achieved the highest performance to the storage system using iSCSI.

This config was used for a couple weeks, but ultimately I was very unsatisfied with the performance. I know it’s not very accurate, but looking at the Synology resource monitor, each gigabit link over iSCSI was only achieving 10-15MB/sec under high load (single contiguous copies) that should have resulted in 100MB/sec and higher per link. The combined LAN throughput as reported by the Synology device across all 4 gigabit links never exceeded 80MB/sec. File transfers inside of the virtual machines couldn’t get higher then 20MB/sec.

I have a VMWare vDP (VMWare Data Protection) test VM configured, which includes a performance analyzer inside of the configuration interface. I decided to use this to test some specs (I’m too lazy to actually configure a real IO/throughput test since I know I won’t be continuing to use iSCSI on the Synology with the horrible performance I’m getting). The performance analyzer tests run for 30-60 minutes, and measure writes and reads in MB/sec, and Seeks in seconds. I tested 3 different datastores.

Synology  DS1813+ NFS over 1 X Gigabit link (1500MTU):

  • Read 81.2MB/sec, Write 79.8MB/sec, 961.6 Seeks/sec

Synology DS1813+ iSCSI over 4 x Gigabit links configured in MPIO Round Robin BYTES=8800 (9000MTU):

  • Read 36.9MB/sec, Write 41.1MB/sec, 399.0 Seeks/sec

Custom built 8 year old computer running Linux MD Raid 5 running NFS with 1 X Gigabit NIC (1500MTU):

  • Read 94.2MB/sec, Write 97.9MB/sec, 1431.7 Seeks/sec

Can someone say WTF?!?!?!?! As you can see, it appears there is a major performance hit with the DS1813+ using 4 Gigabit MPIO iSCSI with Round Robin. It’s half the speed of a single link 1 X Gigabit NFS connection. Keep in mind I purchased the extra memory module for my DS1813+ so it has 4GB of memory.

I’m kind of choked I spent the money on the extra server NIC (as it was over $500.00), I’m also surprised that my custom built NFS server from 8 years ago (drives are 4 years old) with 5 drives is performing better then my 8 drive DS1813+. All drives used in both the Synology and Custom built NFS box are Seagate Barracuda 7200RPM drives (Custom box has 5 X 1TB drives configured RAID5, the Synology has 8 x 3TB drives configured in RAID 5).

I won’t be using iSCSI  or iSCSI MPIO again with the DS1813+ and actually plan on retiring it as my main datastore for vSphere. I’ve finally decided to bite the bullet and purchase an HP MSA2024 (Dual Controller with 4 X 10Gb SFP+ ports) to provide storage for my vSphere test/demo environment. I’ll keep the Synology DS1813+ online as an NFS vDP backup datastore.

Feel free to comment and let me know how your experience with the Synology devices using iSCSI MPIO is/was. I’m curious to see if others are experiencing the same results.

UPDATE – June 6th, 2014

The other day, I finally had time to play around and do some testing. I created a new FileIO iSCSI Target, I connected it to my vSphere test environment and configured round robin. Doing some tests on the newly created datastore, the iSCSI connections kept disconnecting. It got to the point where it wasn’t usable.

I scratched that, and tried something else.

I deleted the existing RAID volume and I created a new RAID 5 volume and dedicated it to Block I/O iSCSI target. I connected it to my vSphere test environment and configured round robin MPIO.

At first all was going smoothly, until again, connection drops were occurring. Logging in to the DSM, absolutely no errors were being reported and everything was fine. Yet, I was at a point where all connections were down to the ESXi host.

I shut down the ESXi host, and then shut down and restarted the DS1813+. I waited for it to come back up however it wouldn’t. I let it sit there and waited for 2 hours for the IP to finally be pingable. I tried to connect to the Web interface, however it would only load portions of the page over extended amounts of time (it took 4 hour to load the interface). Once inside, it was EXTREMELY slow. However it was reporting that all was fine, and everything was up, and the disks were fine as well.

I booted the ESXi host and tried to connect to it, however it couldn’t make the connection to the iSCSI targets. Finally the Synology unit became un-responsive.

Since I only had a few test VMs loaded on the Synology device, I decided to just go ahead and do a factory reset on the unit (I noticed new firmware was available as of that day). I downloaded the firmware, and started the factory reset (which again, took forever since the web interface was crawling along).

After restarting the unit, it was not responsive. I waited a couple hours and again, the web interface finally responded but was extremely slow. It took a couple hours to get through the setup page, and a couple more hours for the unit to boot.

Something was wrong, so I restarted the unit yet again, and again, and again.

This time, the alarm light was illuminated on the unit, also one of the drive lights wouldn’t come on. Again, extreme unresponsiveness. I finally got access to the web interface and it was reporting the temperature of one of the drives as critical, but it said it was still functioning and all drives were OK. I shut off the unit, removed the drive, and restarted it again, all of a sudden it was extremely responsive.

I removed the drive, hooked it up to another computer and confirmed that it was failed (which it was).

I replaced the drive with a new one (same model), and did three tests. One with NFS, one with FileIO iSCSI, and one with BlockIO iSCSI. All of a sudden the unit was working fine, and there was absolutely NO iSCSI connections dropping. I tested the iSCSI targets under load for some time, and noticed considerable performance increases with iSCSI, and no connection drops.

Here are some thoughts:

  • Two possible things fixed the connection drops, either the drive was acting up all along, or the new version of DSM fixed the iSCSI connection drops.
  • While performance has increased with FileIO to around ~120-160MB/sec from ~50MB/sec, I’m still not even close to maxing out the 4 X 1Gb interfaces.
  • I also noticed a significant performance increase with NFS, so I’m leaning towards the fact that the drive had been acting up since day one (seeks per second increased by 3 fold after replacing the drive and testing NFS). I/O wait has been significantly reduced
  • Why did the Synology unit just freeze up once this drive really started dying? It should have been marked as failed instead of causing the entire Synology unit not to function.
  • Why didn’t the drive get marked as failed at all? I regularly performed SMART tests, and checked drive health, there was absolutely no errors. Even when the unit was at a standstill, it still reported the drive as working fine.

Either way, the iSCSI connection drops aren’t occurring anymore, and performance with iSCSI is significantly better. However, I wish I could hit 200MB+/sec.

At this point it is usable for iSCSI using FileIO, however I was disappointed with BlockIO performance (BlockIO should be faster, no idea why it isn’t).

For now, I have an NFS datastore configured (using this for vDP backup), although I will be creating another FileIO iSCSI target and will do some more testing.

Update – August 16, 2019: Please see these additional posts regarding performance and optimization:

Apr 112014
 

Earlier today I was doing some work in my demonstration vSphere environment, when I had to modify some settings of one of my VMs that are setup as the latest version (which means you can only edit the settings inside of the vSphere Web Client).

To my surprise, when logging in, immediately I received an error: “ManagedObjectReference: type = Datastore, value = datastore-XXXX, serverGuid = XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXX refers to a managed object that no longer exists or has never existed“. Also, after clicking OK, I noticed that lots of information being presented inside of the vSphere web client was inaccurate. Some Virtual Machines were being reported as sitting on different datastores (they were at one point weeks ago, however since were moved). Also, it was reporting that some Virtual Machines were off, when in fact they were on and running.

Symptoms:

-Errors about missing datastores on log on to the vSphere Web Client.

-Virtual Machines were being reported as off (turned off) even though they were running.

-Viewing VMs in vSphere client, reporting they are being stored on a different datastore then they actually are.

-Disconnecting and (re) connecting hosts have no effect on issue.

 

This freaked me out, it was a true “Uhh Ohh” moment. Something was corrupt. Keep in mind that ALL information in the vSphere client was correct and accurate, it was only the vSphere Web client that was having issues.

 

Anyways, I tried a bunch of things to fix it, and spent hours working on the problem. FINALLY I came up with a fix. If you are running in to this issue, PLEASE take a snapshot of your vCenter Server before attempting to fix it, so that you can roll back if you screw anything up (which I had to do multiple times, lol).

The Fix:

1) Stop the “VMWare vCenter Inventory Service”.

2) Delete the “data” folder inside of “Program Files\VMware\Infrastructure\Inventory Service”.

3) Open a Command Prompt with elevated privileges. Change your working directory to “Program Files\VMware\Infrastructure\Inventory Service\scripts”.

4) Run “createDB.bat”, this will reset and create a Inventory Service database.

5) Run “is-change-sso.bat https://computername.domain.com:7444/lookupservice/sdk “[email protected]” “SSO_PASSWORD”. Change the computername.domain.com to your FQDN for your vCenter server, and change the SSO_PASSWORD to your Single Signon Admin password.

6) Start the “VMWare vCenter Inventory Service”. At this point, if you try to log on to the vSphere Web Client, it will error with: “Client is not authenticated to VMware Inventory Service”. We’ve already won half the battle.

7) We now need to register the vCenter Server with the newly reset Inventory Service. In an elevated Command Prompt (that we opened above), changed the working path to: “Program Files\VMware\Infrastructure\VirtualCenter Server\isregtool”.

8) Run “register-is.bat https://computername.domain.com:443/sdk https://computername.domain.com:10443 https://computername.domain.com:7444/lookupservice/sdk”. Change computername.domain.com to your FQDN for your vCenter server.

9) Restart the “VMware VirtualCenter Server” service. This will also restart the Management Web services.

 

BAM, it’s fixed! I went ahead and restarted the entire server that the vCenter server was running on. After this, all was good, and everything looked great inside of the vSphere Web Client. I’m actually noticing it’s running WAY faster, and isn’t as glitchy as it was before.

Happy Virtualizing! 🙂