Nov 172015
 

I recently had a reader reach out to me for some assistance with an issue they were having with a VMWare implementation. They were experiencing issues with uploading files, and performing I/O on Linux based virtual machines.

Originally it was believed that this was due to networking issues, since the performance issues were only one way (when uploading/writing to storage), and weren’t experienced with all virtual machines. Another particular behaviour notice was slow uploading speeds to the vSphere client file browser, and slow Physical to Virtual migrations.

After troubleshooting and exploring the issue with them, it was noticed that cache was not enabled on the RAID array that was providing the storage for the vSphere implementation.

Please note, that in virtual environments with storage based off RAID arrays, RAID cache is a must (for performance reasons). Further, Battery backed RAID cache is a must (for protection and data integrity). This allows write operations to be cached and performed on multiple disks at once, sometimes even optimizing the write procedures as they are processed. This allows writes to occur simultaneously to multiple disks, and also dramatically increases observed performance since the ESXi hosts, and virtual machines aren’t waiting for write operations to commit before proceeding to the next.

You’ll notice that under Windows virtual machines, this issue won’t be observed on writes since the Windows VMs typically cache file transfers to RAM, which then write to disk. This could give the impression that there are no storage issues when typically troubleshooting these issues (making one believe that it’s related to the Linux VMs, the ESXi hosts themselves, or some odd networking issue).

 

Again, I cannot stress enough that you should have a battery backed cache module, or capacitor backed flash module providing cache functions.

If you do implement cache without backing it with a battery, corruption can occur on the RAID array if there is a power failure, or if the RAID controller freezes. The battery backed cache allows cached write procedures to be committed to disk on next restart of the storage unit/storage controller thus providing protection.

  5 Responses to “RAID Cache considerations for Storage (VMWare, HyperV, etc…) – Quick note on performance”

  1. Stephen,

    This is exactly what I am trying to implement right now on my MSA 2040.
    I have 2 clustered Hyper-V servers with 2 hosts and a large MSSQL DB.
    The DB sits on a RAID 10 SSD array and has a lot of reads and writes performed to it over the day.

    Do you think Write-Through is enough for this setup specifically SSD, or should I be considering something else?
    I also have dual controllers, by default does the cache span across both controllers, or is this something I need to define?

  2. Hi Dave,

    On the MSA2040, I would specifically recommend not to touch any of the cache settings. Also, further in to this you’ll find some info at the links below on best practices for virtualization on the MSA 2040.

    There used to be a document specifically for “MSA 2040 Best Practices for Virtualization”, but I can’t find it at the moment.

    http://h20565.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=5386548&docId=emr_na-c03983029&docLocale=en_US

    http://h20195.www2.hp.com/V2/getpdf.aspx/4AA4-6892ENW.pdf?ver=Rev%206

    http://h50146.www5.hpe.com/products/storage/whitepaper/pdfs/4AA4-6892ENW.pdf

    As for your other questions:

    The default on the array is “Write Back”, I’m assuming you changed it to “Write Through”. This will result in a performance decrease.

    By default, on an MSA 2040 with dual controllers, the cache is mirrored between both controllers. This provides protection in the event of a controller failure. This protects against data loss. You can change these settings, however I strictly advise NOT doing so. By default, the unit is configured so that in a dual controller environment with multipathing, no data loss will occur due to a power failure, controller failure, cache module failure or path failure.

    Once you start changing the settings, you remove the redundancy systems in place, and also open the door for data loss and/or corruption in the event of certain types of failures.

    I hope this helps, let me know if you have any other questions!

    Cheers,
    Stephen

  3. Stephen,
    Thanks for the response and links.

    I actually have Write-Back enabled at the moment.
    I was merely curious as this method seems like it was primarily aimed at slow platter disks, and it was meant to improve performance and efficiency.
    Then I could not find any tests or benchmarks between the the two cache settings and an Enterprise SSD setup in RAID 10.
    But then again with a 4GB cache it seems like Write-Back is the logical choice.

    On another note, I noticed that mappings in v3 of the interface appear as 1,2,3,4 but the v2 of the interface they appear as A1,A2, B1.. etc, is this a bug, it seems the v3 is a bit limited in it’s settings?
    I’m not sure how the numbering works in V3, what’s your take on this?

    Thanks,
    Dave

  4. Hi Dave,

    No worries! Always glad if I can help! 🙂 If it was mostly just read operations, I might say it might be worthwhile to investigate changing to write-through, however if there’s writes being performed I’d recommend just staying with the defaults.

    And you’re right! There was and still is some differences between version 2 and 3 of the web based interface. Some functions still can only be performed in V2, so I regularly find myself switching back and forth.

    Originally, the port mapping was based off controllers and their ports (as you mentioned with A1, B1, A2, B2), however since having two controllers is meant for redundancy they just changed to port 1,2,3,4 whereas each number reflects a port on both controller A and B.

    There’s some notes I’ve come across in HPe docs specifying that each port on both controllers have a relationship, and there’s special guidelines for multi-pathing when using the same or multiple subnets, and I believe the move to just having 1,2,3,4 was either because of this, or had a significant role in the change of naming.

  5. Stephen,

    Ahh, that explains it, I didn’t notice this when going through the documentation.
    Appreciate the explanation.

    Thanks

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)