Aug 122019
 
DS1813+

Around a month ago I decided to turn on and start utilizing NFS v4.1 (Version 4.1) in DSM on my Synology DS1813+ NAS. As most of you know, I have a vSphere cluster with 3 ESXi hosts, which are backed by an HPe MSA 2040 SAN, and my Synology DS1813+ NAS.

The reason why I did this was to test the new version out, and attempt to increase both throughput and redundancy in my environment.

If you’re a regular reader you know that from my original plans (post here), and than from my issues later with iSCSI (post here), that I finally ultimately setup my Synology NAS to act as a NFS datastore. At the moment I use my HPe MSA 2040 SAN for my hot storage, and I use the Synology DS1813+ for cold storage. I’ve been running this way for a few years now.

Why NFS?

Some of you may ask why I chose to use NFS? Well, I’m an iSCSI kinda guy, but I’ve had tons of issues with iSCSI on DSM, especially MPIO on the Synology NAS. The overhead was horrible on the unit (result of the lack of hardware specs on the NAS) for both block and file access to iSCSI targets (block target, vs virtualized (fileio) target).

I also found a major issue, where if one of the drives were dying or dead, the NAS wouldn’t report it as dead, and it would bring the iSCSI target to a complete halt, resulting in days spending time finding out what’s going on, and then finally replacing the drive when you found out it was the issue.

After spending forever trying to tweak and optimize, I found that NFS was best for the Synology NAS unit of mine.

What’s this new NFS v4.1 thing?

Well, it’s not actually that new! NFS v4.1 was released in January 2010 and aimed to support clustered environments (such as virtualized environments, vSphere, ESXi). It includes a feature called Session trunking mechanism, which is also known as NFS Multipathing.

We all love the word multipathing, don’t we? As most of you iSCSI and virtualization people know, we want multipathing on everything. It provides redundancy as well as increased throughput.

How do we turn on NFS Multipathing?

According to the VMware vSphere product documentation (here)

While NFS 3 with ESXi does not provide multipathing support, NFS 4.1 supports multiple paths.


NFS 3 uses one TCP connection for I/O. As a result, ESXi supports I/O on only one IP address or hostname for the NFS server, and does not support multiple paths. Depending on your network infrastructure and configuration, you can use the network stack to configure multiple connections to the storage targets. In this case, you must have multiple datastores, each datastore using separate network connections between the host and the storage.


NFS 4.1 provides multipathing for servers that support the session trunking. When the trunking is available, you can use multiple IP addresses to access a single NFS volume. Client ID trunking is not supported.

So it is supported! Now what?

In order to use NFS multipathing, the following must be present:

  • Multiple NICs configured on your NAS with functioning IP addresses
  • A gateway is only configured on ONE of those NICs
  • NFS v4.1 is turned on inside of the DSM web interface
  • A NFS export exists on your DSM
  • You have a version of ESXi that supports NFS v4.1

So let’s get to it! Enabling NFS v4.1 Multipathing

  1. First log in to the DSM web interface, and configure your NIC adapters in the Control Panel. As mentioned above, only configure the default gateway on one of your adapters.Synology Multiple NICs Configured Screenshot
  2. While still in the Control Panel, navigate to “File Services” on the left, expand NFS, and check both “Enable NFS” and “Enable NFSv4.1 support”. You can leave the NFSv4 domain blank.Enabling NFSv4.1 on Synology DSM
  3. If you haven’t already configured an NFS export on the NAS, do so now. No further special configuration for v4.1 is required other than the norm.
  4. Log on to your ESXi host, go to storage, and add a new datastore. Choose to add an NFS datastore.
  5. On the “Select NFS version”, select “NFS 4.1”, and select next.Selecting the NFS version on the Add Datastore Dialog box on ESXi
  6. Enter the datastore name, the folder on the NAS, and enter the Synology NAS IP addresses, separated by commas. Example below:New NFS Datastore details and configuration on ESXi dialog box
  7. Press the Green “+” and you’ll see it spreads them to the “Servers to be added”, each server entry reflecting an IP on the NAS. (please note I made a typo on one of the IPs).List of Servers/IPs for NFS Multipathing on ESXi Add Datastore dialog box
  8. Follow through with the wizard, and it will be added as a datastore.

That’s it! You’re done and are now using NFS Multipathing on your ESXi host!

In my case, I have all 4 NICs in my DS1813+ configured and connected to a switch. My ESXi hosts have 10Gb DAC connections to that switch, and can now utilize it at faster speeds. During intensive I/O loads, I’ve seen the full aggregated network throughput hit and sustain around 370MB/s.

After resolving the issues mentioned below, I’ve been running for weeks with absolutely no problems, and I’m enjoying the increased speed to the NAS.

Additional Important Information

After enabling this, I noticed that RAM and Memory usage had drastically increased on the Synology NAS. This would peak when my ESXi hosts would restart. This issue escalated to the NAS running out of memory (both physical and swap) and ultimately crashing.

After weeks of troubleshooting I found the processes that were causing this. While the processes were unrelated, this issue would only occur when using NFS Multipathing and NFS v4.1. To resolve this, I had to remove the “pkgctl-SynoFinder” package, and disable the services. I could do this in my environment because I only use the NAS for NFS and iSCSI. This resolved the issue. I created a blog post here to outline how to resolve this. I also further optimized the NAS and memory usage by disabling other unneeded services in a post here, targeted for other users like myself, who only use it for NFS/iSCSI.

Leave a comment and let me know if this post helped!

Jul 312019
 

If you’re like me and use a Synology NAS as an NFS or iSCSI datastore for your VMware environment, you want to optimize it as much as possible to reduce any hardware resource utilization.

Specifically we want to disable any services that we aren’t using which may use CPU or memory resources. On my DS1813+ I was having issues with a bug that was causing memory overflows (the post is here), and while dealing with that, I decided to take it a step further and optimize my unit.

Optimize the NAS

In my case, I don’t use any file services, and only use my Synology NAS (Synology DS1813+) as an NFS and iSCSI datastore. Specifically I use multipath for NFSv4.1 and iSCSI.

If you don’t use SMB (Samba / Windows File Shares), you can make some optimizations which will free up substantial system resources.

Disable and/or uninstall unneeded packages

First step, open up the “Package Center” in the web GUI and either disable, or uninstall all the packages that you don’t need, require, or use.

To disable a package, select the package in Package Center, then click on the arrow beside “Open”. A drop down will open up, and “Disable” or “Stop” will appear if you can turn off the service. This may or may not be persistent on a fresh boot.

To uninstall a package, select the packet in Package Center, then click on the arrow beside “Open”. A drop down will open up, and “Uninstall” will appear. Selecting this will uninstall the package.

Disable the indexing service

As mentioned here, the indexing service can consume quite a bit of RAM/memory and CPU on your Synology unit.

To stop this service, SSH in to the unit as admin, then us the command “sudo su” to get a root shell, and finally run this command.

synoservice --disable pkgctl-SynoFinder

The above command will probably not persist on boot, and needs to be ran each fresh boot. You can however uninstall the package with the command below to completely remove it.

synopkg uninstall SynoFinder

Doing this will free up substantial resources.

Disable SMB (Samba), and NMBD

I noticed that both smbd and nmbd (Samba/Windows File Share Services) were consuming quite a bit of CPU and memory as well. I don’t use these, so I can disable them.

To disable them, I ran the following command in an SSH session (remember to “sudo su” from admin to root).

synoservice --disable nmbd
synoservice --disable samba

Keep in mind that while this should be persistent on boot, it wasn’t on my system. Please see the section below on how to make it persistent on booth.

Disable thumbnail generation (thumbd)

When viewing processes on the Synology NAS and sorting by memory, there are numerous “thumbd” processes (sometimes over 10). These processes deal with thumbnail generation for the filestation viewer.

Since I’m not using this, I can disable it. To do this, we either have to rename or delete the following file. I do recommend making a backup of the file.

/var/packages/FileStation/target/etc/conf/thumbd.conf

I’m going to rename it so that the service daemon can’t find it when it initializes, which causes the process not to start on boot.

cd /var/packages/FileStation/target/etc/conf/
mv thumbd.conf thumbd.conf.bak

Doing the above will stop it from running on boot.

Make the optimizations persistent on boot

In this section, I will show you how to make all the settings above persistent on boot. Even though I have removed the SynoFinder package, I still will create a startup script on the Synology NAS to “disable” it just to be safe.

First, SSH in to the unit, and run “sudo su” to get a root shell.

Run the following commands to change directory to the startup script, and open a text editor to create a startup script.

cd /usr/local/etc/rc.d/
vi speedup.sh

While in the vi file editor, press “i” to enter insert mode. Copy and paste the code below:

case "$1" in
    start)
                echo "Turning off memory garbage"
                        synoservice --disable nmbd
                        synoservice --disable samba
                        synoservice --disable pkgctl-SynoFinder
                        ;;
    stop)
                        echo "Pertend we care and are turning something on"
                        ;;
        *)
        echo "Usage: $1 {start|stop}"
                exit 1
esac
exit 0

Now press escape, then type “:wq” and hit enter to save and close the vi text editor. Run the following command to make the script executable.

chmod 755 speedup.sh

That’s it!

Conclusion

After making the above changes, you should see a substantial performance increase and reduction in system resources!

In the future I plan on digging deeper in to optimization as I still see other services I may be able to trim down, after confirming they aren’t essential to the function of the NAS.

Feel like you can add anything? Leave a comment!

Jul 312019
 

Once I upgraded my Synology NAS to DSM 6.2 I started to experience frequent lockups and freezing on my DS1813+. The Synology DS1813+ would become unresponsive and I wouldn’t be able to SSH or use the web GUI to access it. In this state, NFS sometimes would become unresponsive.

When this occured, I would need to press and hold the power button to force it to shutdown, or pull the power. This is extremely risky as it can cause data corruption.

I’m currently running DSM 6.2.2-24922 Update 2.

The cause

This occurred for over a month until it started to interfere with ESXi hosts. I also noticed that the issue would occur when restarting any of my 3 ESXi hosts, and would definitely occur if I restarted more than one.

During the restarting, while logged in to the web GUI and SSH, I was able to see that the memory (RAM) usage would skyrocket. Finally the kernel would panic and attempt to reduce memory usage once the swap file had filled up (keep in mind my DS1813+ has 4GB of memory).

Analyzing “top” as well as looking at processes, I noticed the Synology index service was causing excessive memory and CPU usage. On a fresh boot of the NAS, it would consume over 500MB of memory.

The fix

In my case, I only use my Synology NAS for an NFS/iSCSI datastore for my ESXi environment, and do not use it for SMB (Samba/File Shares), so I don’t need the indexing service.

I went ahead and SSH’ed in to the unit, and ran the following commands to turn off the service. Please note, this needs to be run as root (use “sudo su” to elevate from admin to root).

synoservice --disable pkgctl-SynoFinder

While it did work, and the memory was instantly freed, the setting did not stay persistant on boot. To uninstalling the Indexing service, run the following command.

synopkg uninstall SynoFinder

Doing this resolved the issue and freed up tons of memory. The unit is now stable.

Update – August 16th, 2019

My Synology NAS has been stable since I applied the fix, however after an uptime of a few weeks, I noticed that when restarting servers, the memory usage does hike up (example, from 6% to 46%). However, with the fixes applied above, the unit is stable and no longer crashes.

Jan 182018
 

The Problem

I run a Sophos UTM firewall appliance in my VMware vSphere environment and noticed the other day that I was getting warnings on the space used on the ESXi host for the thin-provisioned vmdk file for the guest VM. I thought “Hey, this is weird”, so I enabled SSH and logged in to check my volumes. Everything looked fine and my disk usage was great! So what gives?

After spending some more time troubleshooting and not finding much, I thought to myself “What if it’s not unmapping unused blocks from the vmdk to the host ESXi machine?”. What is unampping you ask? When files get deleted in a guest VM, the free blocks aren’t automatically “unmapped” and released back to the host hypervisor in some cases.

Two things need to happen:

  1. The guest VM has to release these blocks (notify the hypervisor that it’s not using them, making the vmdk smaller)
  2. The host has to reclaim these and issue the unmap command to the storage (freeing up the space on the SAN/storage itself)

On a side note: In ESXi 6.5 and when using VMFS version 6 (VMFS6), the datastores can be configured for automatic unmapping. You can still kick it off manually, but many administrators would prefer it to happen automatically in the background with low priority (low I/O).

Most of my guest VMs automatically do the first step (releasing the blocks back to the host). On Windows this occurs with the defrag utility which issues trim commands and “trims” the volumes. On linux this occurs with the fstrim command. All my guest VMs do this automatically with the exception being the Sophos UTM appliance.

The fix

First, a warning: Enable SSH on the Sophos UTM at your own risk. You need to know what you are doing, this also may pose a security risk and should be disabled once your are finished. You’ll need to “su” to root once you log in with the “loginuser” account.

This fix not only applies to the Sophos UTM, but most other linux based guest virtual machines.

Now to fix the issue, I used the “df” command which provides a list of the filesystems, their mount points, and storage free for those fileystems. I’ve included an example below (this wasn’t the full list):

hostname:/root # df
Filesystem                       1K-blocks     Used Available Use% Mounted on
/dev/sda6                          5412452  2832960   2281512  56% /
udev                               3059712       72   3059640   1% /dev
tmpfs                              3059712      100   3059612   1% /dev/shm
/dev/sda1                           338875    15755    301104   5% /boot
/dev/sda5                         98447760 13659464  79513880  15% /var/storage
/dev/sda7                        129002700  4624468 117474220   4% /var/log
/dev/sda8                          5284460   274992   4717988   6% /tmp
/dev                               3059712       72   3059640   1% /var/storage/chroot-clientlessvpn/dev


You’ll need to run the fstrim command on every mountpoint for file systems “/dev/sdaX” (X means you’ll be doing this for multiple mountpoints). In the example above, you’ll need to run it on “/”, “/boot”, “/var/storage”, “/var/log”, “/tmp”, and any other mountpoints that use “/dev/sdaX” filesytems.

Two examples:

fstrim / -v

fstrim / -v

 

 

fstrim /var/storage -v

fstrim /var/storage -v

 

 

Again, you’ll repeat this for all mount points for your /dev/sdaX storage (X is replaced with the volume number). The command above only works with mountpoints, and not the actual device mappings.

Time to release the unused blocks to the SAN:

The above completes the first step of releasing the storage back to the host. Now you can either let the automatic unmap occur slowly overtime if you’re using VMFS6, or you can manually kick it off. I decided to manually kick it off using the steps I have listed at: https://www.stephenwagner.com/2017/02/07/vmfs-unmap-command-on-vsphere-6-5-with-vmfs-6-runs-repeatedly/

You’ll need to use esxcli to do this. I simply enabled SSH on my ESXi hosts temporarily.

Please note: Using the unmap command on ESXi hosts is very storage I/O intensive. Do this during maintenance window, or at a time of low I/O as this will perform MAJOR I/O on your hosts…

I issue the command (replace “DATASTORENAME” with the name of your datastore):

esxcli storage vmfs unmap --volume-label=DATASTORENAME --reclaim-unit=8

This could run for hours, possibly days depending on your “reclaim-unit” size (this is the block size of the unit you’re trying to reclaim from the VMFS file-system). In this example I choose 8, but most people do something larger like 100, or 200 to reduce the load and time for the command to complete (lower values look for smaller chunks of free space, so the command takes longer to execute).

I let this run for 2 hours on a 10TB datastore, however it may take way longer (possibly 6+ hours, to days).

Finally, not only are we are left with a smaller vmdk file, but we’ve released the space back to the SAN as well!

Feb 142017
 

Years ago, HPe released the GL200 firmware for their HPe MSA 2040 SAN that allowed users to provision and use virtual disk groups (and virtual volumes). This firmware came with a whole bunch of features such as Read Cache, performance tiering, thin provisioning of virtual disk group based volumes, and being able to allocate and commission new virtual disk groups as required.

(Please Note: On virtual disk groups, you cannot add a single disk to an already created disk group, you must either create another disk group (best practice to create with the same number of disks, same RAID type, and same disk type), or migrate data, delete and re-create the disk group.)

The biggest thing with virtual storage, was the fact that volumes created on virtual disk groups, could span across multiple disk groups and provide access to different types of data, over different disks that offered different performance capabilities. Essentially, via an automated process internal to the MSA 2040, the SAN would place highly used data (hot data) on faster media such as SSD based disk groups, and place regularly/seldom used data (cold data) on slower types of media such as Enterprise SAS disks, or archival MDL SAS disks.

(Please Note: To use the performance tier either requires the purchase of a performance tiering license, or is bundled if you purchase an HPe MSA 2042 which additionally comes with SSD drives for use with “Read Cache” or “Performance tier.)

 

When the firmware was first released, I had no impulse to try it out since I have 24 x 900GB SAS disks (only one type of storage), and of course everything was running great, so why change it? With that being said, I’ve wanted and planned to one day kill off my linear storage groups, and implement the virtual disk groups. The key reason for me being thin provisioning (the MSA 2040 supports the “DELETE” VAAI function), and virtual based snapshots (in my environment, I require over-commitment of the volume). As a side-note, as of ESXi 6.5, ESXi now regularly unmaps unused blocks when using the VMFS-6 filesystem (if left enabled), which is great for SANs using thin provision that support the “DELETE” VAAI function.

My environment consisted of 2 linear disk groups, 12 disks in RAID5 owned by controller A, and 12 disks in RAID5 owned by controller B (24 disks total). Two weekends ago, I went ahead and migrated all my VMs to the other datastore (on the other volume), deleted the linear disk group, created a virtual disk group, and then migrated all the VMs back, deleted my second linear volume, and created a virtual disk group.

Overall the process was very easy and fast. No downtime is required for this operation if you’re licensed for Storage vMotion in your vSphere environment.

During testing, I’ve noticed absolutely no performance loss using virtual vs linear, except for some functions that utilize the VAAI storage providers which of course run faster on the virtual disk groups since it’s being offloaded to the SAN. This was a major concern for me as block linear based storage is accessed more directly, then virtual disk groups which add an extra level of software involvement between the controllers and disks (block based access vs file based access for the iSCSI targets being provided by the controllers).

Unfortunately since I have no SSDs and no extra room for disks, I won’t be able to try the performance tiering, but I’m looking forward to it in the future.

I highly recommend implementing virtual disk groups on your HPe MSA 2040 SAN!

Feb 082017
 

When running vSphere 6.5 or 6.7 and utilizing a VMFS-6 datastore, we now have access to automatic LUN reclaim (this unmaps unused blocks on your LUN), which automatically unmaps unused storage on your LUNs. This is very handy for thin provisioned storage.

Essentially when you unmap blocks, it “tells” the storage (SAN) that unused (deleted or moved data) blocks aren’t being used anymore and to unmap them (which decreases the allocated size on the storage layer). Your storage LUN must support VAAI and the “Delete” function.

Now taking this a step further, most of you have noticed that storage reclaim in the vSphere client has two settings for priority in the web client; none, or low.

For those of you who feel daring or want to spice life up a bit, you can manually increase the priority through the esxcli command. While I can’t recommend this (obviously VMware chose to hide these options due to performance considerations), you can follow these instructions to change the priority higher.

Manually Configure Storage Reclaim (UNMAP) Priority

To view the current settings:

esxcli storage vmfs reclaim config get --volume-label=DATASTORENAME

To set ESXi reclaim/unmap priority to medium:

esxcli storage vmfs reclaim config set --volume-label=DATASTORENAME --reclaim-priority=medium

To set ESXi reclaim/unmap priority to high:

esxcli storage vmfs reclaim config set --volume-label=DATASTORENAME --reclaim-priority=high

You can confirm these settings took effect by running the first “get” command to view the settings, or view the datastore in the storage section of the vSphere client. While the vSphere client will reflect the higher priority setting, if you change it lower and then want to change it back higher, you’ll need to use the esxcli command to bring it up to a higher priority again.

Happy Virtualizing! Leave a comment!

Feb 072017
 

With vSphere 6.5 came VMFS 6, and with VMFS 6 came the auto unmap feature. This is a great feature, and very handy for those of you using thin provisioning on your datastores hosted on storage that supports VAAI.

I noticed something interesting when running the manual unmap command for the first time. It isn’t well documented, but I thought I’d share for those of you who are doing a manual LUN unmap for the first time. This document will also provide you with the command to perform a manual unmap on a VMFS datastore.

Reason:

Automatic unmap (auto space reclamation) is on, however you want to speed it up or have a large chunk of block’s you want unmapped immediately, and don’t want to wait for the auto feature.

Problem:

I wasn’t noticing any unmaps were occurring automatically and I wanted to free up some space on the SAN, so I decided to run the old command to forcefully run the unmap to free up some space:

esxcli storage vmfs unmap --volume-label=DATASTORENAME --reclaim-unit=200

(The above command runs a manual unmap on a datastore)

After kicking it off, I noticed it wasn’t completing as fast as I thought it should be. I decided to enable SSH on the host and took a look at the /var/log/hostd.log file. To my surprise, it wasn’t stopping at a 200 block reclaim, it just kept cycling running over and over (repeatedly doing 200 blocks):

2017-02-07T14:12:37.365Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:37.978Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:38.585Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:39.191Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:39.808Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:40.426Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:41.050Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:41.659Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:42.275Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-9XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:42.886Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX

That’s just a small segment of the logs, but essentially it just kept repeating the unmap/reclaim over and over in 200 block segments. I waited hours, tried to issue a “CTRL+C” to stop it, however it kept running.

I left it to run overnight and it did eventually finish while I was sleeping. I’m assuming it attempted to unmap everything it could across the entire datastore. Initially I thought this command would only unmap the specified block size.

When running this command, it will continue to cycle in the block size specified until it goes through the entire LUN. Be aware of this when you’re planning on running the command.

Essentially, I would advise not to manually run the unmap command unless you’re prepared to unmap and reclaim ALL your unused allocated space on your VMFS 6 datastore. In my case I did this because I had 4TB of deleted data that I wanted to unmap immediately, and didn’t want to wait for the automatic unmap.

I thought this may have been occurring because the automatic unmap function was on, so I tried it again after disabling auto unmap. The behavior was the same and it just kept running.

 

If you are tempted to run the unmap function, keep in mind it will continue to scan the entire volume (despite what block count you set). With this being said, if you are firm on running this, choose a larger block count (200 or higher) since smaller blocks will take forever (tested with a block size of 1 and after analyzing the logs and rate of unmaps, it would have taken over 3 months to complete on a 9TB array).

Update May 11th 2018: When running the manual unmap command with smaller “reclaim-unit” values (such as 1), your host may become unresponsive due to a memory overflow. vMotion’s will cease to function, and your ESXi host may need a restart to become fully functional. I’ve experienced this behavior twice. I highly suggest that if you perform this command, you do so while the host is in maintenance mode, and that your restart the host after a successful unmap sweep.

Nov 052016
 

Yesterday, I had a reader (Nicolas) leave a comment on one of my previous blog posts bringing my attention to the MTU for Jumbo Frames on the HPe MSA 2040 SAN.

MSA 2040 MTU Comment

 

 

 

 

 

 

 

 

Since I first started working with the MSA 2040. Looking at numerous HPe documents outlining configuration and best practices, the documents did confirm that the unit supported Jumbo Frames. However, the documentation on the MTU was never clearly stated and can be confusing. I was under the assumption that the unit supported 9000 MTU, while reserving 100 bytes for overhead. This is not necessarily the case.

Nicolas chimed in and provided details on his tests which confirmed the HPe MSA 2040 does actually have a working MTU of 8900. In my configuration I did the tests (that Nicolas outlined), and confirmed that the MTU would cause packet fragmentation if the MTU was greater than 8900.

ESXi vmkping usage: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003728

This is a big discovery because packet fragmentation will not only degrade performance, but flood the links with lots of packet fragmentation.

I went ahead and re-configured my ESXi hosts to use an MTU of 8900 on the network used with my SAN. This immediately created a MASSIVE performance increase (both speed, and IOPS). I highly recommend that users of the MSA 2040 SAN confirm this on their own, and update the MTUs as they see fit.

Also, this brings up another consideration. Ideally, on a single network, you want all devices to be running the same MTU. If your MSA 2040 SAN is on a storage network with other SAN devices (or any other device), you may want to configure all of them to use the MTU of 8900 if possible (and of course, don’t forget your servers).

A big thank you to Nicolas for pointing this out!

Sep 232016
 

Well, recently one of the servers I monitor and maintain in a remote oil town recently started throwing out a Windows event log warning:

Event ID: 129

Source: HpCISSs2

Description: Reset to device, \Device\RaidPort0, was issued.

 

The server is an HP ML350p Gen8 (Windows Server 2008 R2) running latest firmware and management software. It has 2 RAID Arrays (RAID1, and RAID5), and a total of 6 disks.

Researching this error, I read that most people had this occur when running the latest HP WBEM providers, as well as anti-virus software. In our case, I actually tried to downgrade to an older version, but noticed the warning still occurs. While we do have anti-virus, it’s not actively scanning (only weekly scheduled scans).

In the process of troubleshooting, I noticed that under the HP Systems Management Homepage, one of the drives in the RAID1 array, had the following stats:

Hard Read Erros:  150
Recovery Read Errors:  7
Total Seeks:  0
Seek Errors:  0

I found these numbers to be very high in my experience. None of the other drives had anything close to this (in 4 years of running, only one other disk had a read error (a single one), this disk however had tons. For some reason the drive is still reporting as operational, when I’d expect it to be marked as a predicted failure, or failed.

While all online documentation was pointing towards at locks on the array by software, from my own experience I think it was actually the array waiting for a read operation on the array, and it was this single disk that was causing a threshold to be hit in the driver, that caused a retry to recover the read operation.

Called up HPe support, I mentioned I’d like to have the drive replaced. The support engineer consulted her senior engineer and reviewed the evidence I presented (along with ADU reports, and Active Monitoring health reports), the senior engineer concurred that the drive should be replaced.

Replacing the drive resolved the issue. I’m also noticing a performance increase on the array as well.

Make sure to always check the stats on the individual components of your RAID arrays, even if everything is operating sound.

Sep 102016
 

When initiating manual backups or occasionally when automatic/scheduled backups run, a user may notice that Windows Server Backup may appear to “hang” when the status is reporting: “Preparing media to store backups…”.

In some rare cases, it may actually be in a hang state, however most of the time, it’s actually consolidating and/or checking previous backups on the destination media.

To Confirm this:

Open the Task Manager as Administrator, then click on the “Performance” tab, click on “Open Resource Monitor”. Flip over to the “Disk” tab, expand “Disk Activity”, and sort by name. You should see the read requests on the destination media, you’ll also notice that it is slowly progressing consecutively through each backup set (increments of 1, accessing multiple at a time).

This confirms that the Windows Server Backup services are functioning and it is in fact running. In one case, I had 723 previous backups, and it took around 50 minutes to count from 1 to 723, and then the backup finally proceeded.

I have also seen this occur when a previous backup failed or was cancelled. This occurs with Windows Server Backup on Windows Server 2008, Windows Server 2008 R2, and Windows Server 2012 R2.