Apr 292018
 
Directory Services Restore Mode

Running Veeam Backup and Replication, a Microsoft Windows Server Domain Controller may boot in to safe mode and directory services restore mode.

About a week ago, I loaded up Veeam Backup and Replication in to my test environment. It’s a fantastic product, and it’s working great, however today I had a little bit of an issue with a DC running Windows Server 2016 Server Core.

I woke up to a notification that the backup failed due to a VSS snapshot issue. Now I know that VSS can be a little picky at times, so I decided to restart the guest VM. Upon restarting, she came back up, was pingable, and appeared to be running fine, however the backup kept failing with new errors, the event log was looking very strange on the server, and numerous services that were set to automatic were not starting up.

This specific server was installed using Server Core mode, so it has no GUI and is administered via command prompt over RDP, or via remote management utilities. Once RDP’ing in to the server, I noticed the “Safe Mode” branding on each corner of the display, this was very odd. I restarted the server again, this time manually trying to start Active Directory Services manually via services.msc.

This presented:

Event ID: 16652
Source: Directory-Services-SAM
General Description: The domain controller is booting to directory services restore mode.

Screenshot:

Directory Services Restore Mode

The domain controller is booting to directory services restore mode.

 

This surprised me (and scared me for that matter). I immediately started searching the internet to find out what would have caused this…

To my relief, I read numerous sites that advise that when an active backup is running on a guest VM which is a domain controller, Veeam activates directory services restore mode temporarily, so in the event of a restore, it will boot to this mode automatically. In my case, the switch was not changed back during the backup failure.

Running the following command in a command prompt, verifies that the safeboot switch is set to dsrepair enabled:

bcdedit /v

To disable directory services restore mode, type the following in a command prompt:

bcdedit /deletevalue safeboot

Restart the server and the issue should be resolved!

Apr 172018
 

With the news of VMware vSphere 6.7 being released today, a lot of you are looking for the download links for the 6.7 download (including vSphere 6.7, ESXi 6.7, etc…). I couldn’t find it myself, but after doing some scouring through alternative URLs, I came across the link.

VMware vSphere 6.7 Download

VMware vSphere 6.7 Download Link

Here’s the link: https://my.vmware.com/web/vmware/info/slug/datacenter_cloud_infrastructure/vmware_vsphere/6_7

HPe Specific (HPe Customization for ESXi) Version 6.7 is available at: https://www.hpe.com/us/en/servers/hpe-esxi.html

Unfortunately the page is blank at the moment, however you can bet the download and product listing will be added shortly!

UPDATE 10:15AM MST: The Download link is now live!

More information on the release of vSphere 6.7 can be found here, here, here, here, here, and here.

An article on the upgrade can be found at: https://blogs.vmware.com/vsphere/2018/05/upgrading-vcenter-server-appliance-6-5-6-7.html

Happy Virtualizing!

Jan 202018
 
10ZiG 5948q Zero Client

As promised in my previous post which covered my first impressions, here are some pictures and video of the 10ZiG 5948q zero client in action! In the videos I demonstrate video playback as well as the USB redirection capabilities of the 10ZiG Zero Client and VMware Horizon View. Scroll down to the bottom for videos!

Pictures

10ZiG 5948q Zero Client

10ZiG 5948q Zero Client

 

10ZiG 5948qv Zero Client VMware Horizon View

10ZiG 5948qv Zero Client VMware Horizon View

 

10ZiG 5900 Series Zero Client VMware Horizon View Login

10ZiG 5900 Series Zero Client VMware Horizon View Login

 

10ZiG 5948q Series Zero Client Configuration Menu

10ZiG 5948q Series Zero Client Configuration Menu

 

Videos

In this video, I demonstrate the capability of the 10ZiG 5948q zero client connected to a VMware Horizon View server (via a Unified Access Gateway) playing a video from YouTube. Please note that the ESXi server does not have a GPU and 3D rendering is disabled for the test (this is as low performance as it gets).

In this video, I demonstrate the capability of the 10ZiG 5948q zero client using USB redirection on a live VDI session.

And finally, here’s a video of a 10ZiG zero client cold boot for those that are interested.

And remember, my company Digitally Accurate Inc. is a VMware Solutions Provider and 10ZiG Partner. I’m also regularly posting content on these on the corporate blog as well!

Jan 182018
 

The Problem

I run a Sophos UTM firewall appliance in my VMware vSphere environment and noticed the other day that I was getting warnings on the space used on the ESXi host for the thin-provisioned vmdk file for the guest VM. I thought “Hey, this is weird”, so I enabled SSH and logged in to check my volumes. Everything looked fine and my disk usage was great! So what gives?

After spending some more time troubleshooting and not finding much, I thought to myself “What if it’s not unmapping unused blocks from the vmdk to the host ESXi machine?”. What is unampping you ask? When files get deleted in a guest VM, the free blocks aren’t automatically “unmapped” and released back to the host hypervisor in some cases.

Two things need to happen:

  1. The guest VM has to release these blocks (notify the hypervisor that it’s not using them, making the vmdk smaller)
  2. The host has to reclaim these and issue the unmap command to the storage (freeing up the space on the SAN/storage itself)

On a side note: In ESXi 6.5 and when using VMFS version 6 (VMFS6), the datastores can be configured for automatic unmapping. You can still kick it off manually, but many administrators would prefer it to happen automatically in the background with low priority (low I/O).

Most of my guest VMs automatically do the first step (releasing the blocks back to the host). On Windows this occurs with the defrag utility which issues trim commands and “trims” the volumes. On linux this occurs with the fstrim command. All my guest VMs do this automatically with the exception being the Sophos UTM appliance.

The fix

First, a warning: Enable SSH on the Sophos UTM at your own risk. You need to know what you are doing, this also may pose a security risk and should be disabled once your are finished. You’ll need to “su” to root once you log in with the “loginuser” account.

This fix not only applies to the Sophos UTM, but most other linux based guest virtual machines.

Now to fix the issue, I used the “df” command which provides a list of the filesystems, their mount points, and storage free for those fileystems. I’ve included an example below (this wasn’t the full list):

hostname:/root # df
Filesystem                       1K-blocks     Used Available Use% Mounted on
/dev/sda6                          5412452  2832960   2281512  56% /
udev                               3059712       72   3059640   1% /dev
tmpfs                              3059712      100   3059612   1% /dev/shm
/dev/sda1                           338875    15755    301104   5% /boot
/dev/sda5                         98447760 13659464  79513880  15% /var/storage
/dev/sda7                        129002700  4624468 117474220   4% /var/log
/dev/sda8                          5284460   274992   4717988   6% /tmp
/dev                               3059712       72   3059640   1% /var/storage/chroot-clientlessvpn/dev


You’ll need to run the fstrim command on every mountpoint for file systems “/dev/sdaX” (X means you’ll be doing this for multiple mountpoints). In the example above, you’ll need to run it on “/”, “/boot”, “/var/storage”, “/var/log”, “/tmp”, and any other mountpoints that use “/dev/sdaX” filesytems.

Two examples:

fstrim / -v

fstrim / -v

 

 

fstrim /var/storage -v

fstrim /var/storage -v

 

 

Again, you’ll repeat this for all mount points for your /dev/sdaX storage (X is replaced with the volume number). The command above only works with mountpoints, and not the actual device mappings.

Time to release the unused blocks to the SAN:

The above completes the first step of releasing the storage back to the host. Now you can either let the automatic unmap occur slowly overtime if you’re using VMFS6, or you can manually kick it off. I decided to manually kick it off using the steps I have listed at: https://www.stephenwagner.com/2017/02/07/vmfs-unmap-command-on-vsphere-6-5-with-vmfs-6-runs-repeatedly/

You’ll need to use esxcli to do this. I simply enabled SSH on my ESXi hosts temporarily.

Please note: Using the unmap command on ESXi hosts is very storage I/O intensive. Do this during maintenance window, or at a time of low I/O as this will perform MAJOR I/O on your hosts…

I issue the command (replace “DATASTORENAME” with the name of your datastore):

esxcli storage vmfs unmap --volume-label=DATASTORENAME --reclaim-unit=8

This could run for hours, possibly days depending on your “reclaim-unit” size (this is the block size of the unit you’re trying to reclaim from the VMFS file-system). In this example I choose 8, but most people do something larger like 100, or 200 to reduce the load and time for the command to complete (lower values look for smaller chunks of free space, so the command takes longer to execute).

I let this run for 2 hours on a 10TB datastore, however it may take way longer (possibly 6+ hours, to days).

Finally, not only are we are left with a smaller vmdk file, but we’ve released the space back to the SAN as well!

Jan 062018
 

Last night I updated my VMware VDI envionrment to VMware Horizon 7.4.0. For the most part the upgrade went smooth, however I discovered an issue (probably unrelated to the upgrade itself, and more so just previously overlooked). When connecting with Google Chrome to  VMware Horizon HTML Access via the UAG (Unified Access Gateway), an error pops up after pressing the button saying “Failed to connected to the connection server”.

The Problem:

This error pops up ONLY when using Chrome, and ONLY when connecting through the UAG. If you use a different browser (Firefox, IE), this issue will not occur. If you connect using Chrome to the connection server itself, this issue will not occur. It took me hours to find out what was causing this as virtually nothing popped up when searching for a solution.

Finally I stumbled across a VMware document that mentions on View Connection Server instances and security servers that reside behind a gateway (such as a UAG, or Access Point), the instance must be aware of the address in which browsers will connect to the gateway for HTML access.

The VMware document is here: https://docs.vmware.com/en/VMware-Horizon-7/7.0/com.vmware.horizon-view.installation.doc/GUID-FE26A9DE-E344-42EC-A1EE-E1389299B793.html

To resolve this:

On the view connection server, create a file called “locked.properties” in “install_directory\VMware\VMware View\Server\sslgateway\conf\”.

If you have a single UAG/Access Point, populate this file with:

portalHost=view-gateway.example.com

If you have multiple UAG/Access Points, populate the file with:

portalHost.1=view-gateway-1.example.com
portalHost.2=view-gateway-2.example.com

Restart the server

The issue should now be resolved!

On a side note, I also deleted my VMware Unified Access Gateways VMs and deployed the updated version that ship with Horizon 7.4.0. This means I deployed VMware Unified Access Gateway 3.2.0. There was an issue importing the configuration from the export backup I took from the previous version, so I had to configure from scratch (installing certificates, configuring URLs, etc…), be aware of this issue importing configuration.

 

Oct 272017
 

I went to re-deploy some vDP appliances today and noticed a newer version was made available a few months ago (vSphere Data Protection 6.1.5). After downloading the vSphereDataProtection-6.1.5.ova file, I went to deploy it to my vSphere cluster and it failed due to an invalid certificate and a message reading “The OVF package is signed with an invalid certificate”.

I went ahead and downloaded the certificate to see what was wrong with it. While the publisher was valid, the certificate was only valid from September 5th, 2016 to September 7th, 2017, and today was October 27th, 2017. It looks like the guys at VMware should have generated a new cert before releasing it.

 

 

To resolve this, you need to repackage the OVA file and skip the certificate using the VMware Open Virtualization Format Tool (ovftool) available at https://code.vmware.com/tool/ovf/4.1.0

Once you download and install this, the executable can be found in your Program Files\VMware\VMware OVF Tool folder.

Open a command prompt and change to the above directory and run the following:

ovftool.exe --skipManifestCheck c:\folder\vSphereDataProtection-6.1.5.ova c:\folder\vdpgood.ova

This command will repackage and remove the certificate from the OVA and save it as the new file named vdpgood.ova above. Afterwards deploy it to your vSphere environment and all should be working!

 

Feb 182017
 

This is an issue that effects quite a few people and numerous forum threads can be found on the internet by those searching for the solution.

This can occur both when taking manual snapshots of virtual machines when one chooses “Quiesce guest filesystem”, or when using snapshot based backup applications such as vSphere Data Protection (vSphere vDP).

 

For the last couple days, one of my test VMs (Windows Server 2012 R2) has been experiencing this issue and the snapshot has been failing with the following errors:

An error occurred while taking a snapshot: Failed to quiesce the virtual machine.
An error occurred while saving the snapshot: Failed to quiesce the virtual machine.

As always with standard troubleshooting, I restarted the VM, checked for VSS provider errors, and insured that the Windows Services involved with snapshots were in their correct state and configuration. Unfortunately this had no effect, and everything was configured the way it should be.

I also tried to re-install VMWare tools, which had no effect.

PLEASE NOTE: If you experience this issue, you should confirm the services are in their correct state and configuration, as outlined in VMware KB: 1007696. Source: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007696

 

The Surprise Fix:

In the days leading up to the failure when things were running properly, I did notice that the quiesced snapshots for that VM were taking a long time process, but were still functioning correctly before the failure.

This morning during troubleshooting, I went ahead and deleted all the Windows Volume Shadow Copies which are internal and inside of the Virtual Machine itself. These are the shadow copies that the Windows guest operating system takes on it’s own filesystem (completely unrelated to VMware).

To my surprise after doing this, not only was I able to create a quiesced snapshot, but the snapshot processed almost instantly (200x faster than previously when it was functioning).

I’m assuming this was causing a high load for the VMware snapshot to process and a timeout was being hit on snapshot creation which caused the issue. While Windows volume shadow copies are unrelated to VMware snapshots, they both utilize the same VSS (Volume Shadow Copy Service) system inside of windows to function and process. One must also keep in mind that the Windows volume shadow copies will of course be part of a VMware snapshot.

PLEASE NOTE: Deleting your Windows Volume Shadow copies will delete your Windows volume snapshots inside of the virtual machine. You will lose the ability to restore files and folders from previous volume shadow copy snapshots. Be aware of what this means and what you are doing before attempting this fix.

Feb 142017
 

Years ago, HPe released the GL200 firmware for their HPe MSA 2040 SAN that allowed users to provision and use virtual disk groups (and virtual volumes). This firmware came with a whole bunch of features such as Read Cache, performance tiering, thin provisioning of virtual disk group based volumes, and being able to allocate and commission new virtual disk groups as required.

(Please Note: On virtual disk groups, you cannot add a single disk to an already created disk group, you must either create another disk group (best practice to create with the same number of disks, same RAID type, and same disk type), or migrate data, delete and re-create the disk group.)

The biggest thing with virtual storage, was the fact that volumes created on virtual disk groups, could span across multiple disk groups and provide access to different types of data, over different disks that offered different performance capabilities. Essentially, via an automated process internal to the MSA 2040, the SAN would place highly used data (hot data) on faster media such as SSD based disk groups, and place regularly/seldom used data (cold data) on slower types of media such as Enterprise SAS disks, or archival MDL SAS disks.

(Please Note: To use the performance tier either requires the purchase of a performance tiering license, or is bundled if you purchase an HPe MSA 2042 which additionally comes with SSD drives for use with “Read Cache” or “Performance tier.)

 

When the firmware was first released, I had no impulse to try it out since I have 24 x 900GB SAS disks (only one type of storage), and of course everything was running great, so why change it? With that being said, I’ve wanted and planned to one day kill off my linear storage groups, and implement the virtual disk groups. The key reason for me being thin provisioning (the MSA 2040 supports the “DELETE” VAAI function), and virtual based snapshots (in my environment, I require over-commitment of the volume). As a side-note, as of ESXi 6.5, ESXi now regularly unmaps unused blocks when using the VMFS-6 filesystem (if left enabled), which is great for SANs using thin provision that support the “DELETE” VAAI function.

My environment consisted of 2 linear disk groups, 12 disks in RAID5 owned by controller A, and 12 disks in RAID5 owned by controller B (24 disks total). Two weekends ago, I went ahead and migrated all my VMs to the other datastore (on the other volume), deleted the linear disk group, created a virtual disk group, and then migrated all the VMs back, deleted my second linear volume, and created a virtual disk group.

Overall the process was very easy and fast. No downtime is required for this operation if you’re licensed for Storage vMotion in your vSphere environment.

During testing, I’ve noticed absolutely no performance loss using virtual vs linear, except for some functions that utilize the VAAI storage providers which of course run faster on the virtual disk groups since it’s being offloaded to the SAN. This was a major concern for me as block linear based storage is accessed more directly, then virtual disk groups which add an extra level of software involvement between the controllers and disks (block based access vs file based access for the iSCSI targets being provided by the controllers).

Unfortunately since I have no SSDs and no extra room for disks, I won’t be able to try the performance tiering, but I’m looking forward to it in the future.

I highly recommend implementing virtual disk groups on your HPe MSA 2040 SAN!

Feb 082017
 

When running vSphere 6.5 and utilizing a VMFS-6 datastore, we now have access to automatic LUN reclaim (this unmaps unused blocks on your LUN), which is very handy for thin provisioned storage LUNs.

Essentially when you unmap blocks, it “tells” the storage that unused (deleted or moved data) blocks aren’t being used anymore and to unmap them (which decreases the allocated size on the storage layer). Your storage LUN must support VAAI and the “Delete” function.

Most of you have noticed that storage reclaim in the vSphere client has two settings for priority; none, or low.

For those of you who feel daring or want to spice life up a bit, you can increase the priority through the esxcli command. While I can’t recommend this (obviously VMware chose to hide these options due to performance considerations), you can follow these instructions to change the priority higher.

 

To view current settings:

esxcli storage vmfs reclaim config get –volume-label=DATASTORENAME

To set reclaim priority to medium:

esxcli storage vmfs reclaim config set –volume-label=DATASTORENAME –reclaim-priority=medium

To set reclaim priority to high:

esxcli storage vmfs reclaim config set –volume-label=DATASTORENAME –reclaim-priority=high

 

You can confirm these settings took effect by running the command to view settings above, or view the datastore in the storage section of the vSphere client. While the vSphere client will reflect the higher priority setting, if you change it lower and then want to change it back higher, you’ll need to use the esxcli command to bring it up to a higher priority again.

Feb 072017
 

With vSphere 6.5 came VMFS 6, and with VMFS 6 came the auto unmap feature. This is a great feature, and very handy for those of you using thin provisioning on your datastores hosted on storage that supports VAAI.

I noticed something interesting when running the manual unmap command for the first time. It isn’t well documented, but I thought I’d share for those of you who are doing a manual LUN unmap for the first time.

Reason:

Automatic unmap (auto space reclamation) is on, however you want to speed it up or have a large chunk of block’s you want unmapped immediately, and don’t want to wait for the auto feature.

Problem:

I wasn’t noticing any unmaps were occurring automatically and I wanted to free up some space on the SAN, so I decided to run the old command to forcefully run the unmap to free up some space:

esxcli storage vmfs unmap --volume-label=DATASTORENAME --reclaim-unit=200

After kicking it off, I noticed it wasn’t completing as fast as I thought it should be. I decided to enable SSH on the host and took a look at the /var/log/hostd.log file. To my surprise, it wasn’t stopping at a 200 block reclaim, it just kept cycling running over and over (repeatedly doing 200 blocks):

2017-02-07T14:12:37.365Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:37.978Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:38.585Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:39.191Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:39.808Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:40.426Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:41.050Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:41.659Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:42.275Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-9XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:42.886Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX

That’s just a small segment of the logs, but essentially it just kept repeating the unmap/reclaim over and over in 200 block segments. I waited hours, tried to issue a “CTRL+C” to stop it, however it kept running.

I left it to run overnight and it did eventually finish while I was sleeping. I’m assuming it attempted to unmap everything it could across the entire datastore. Initially I thought this command would only unmap the specified block size.

When running this command, it will continue to cycle in the block size specified until it goes through the entire LUN. Be aware of this when you’re planning on running the command.

Essentially, I would advise not to manually run the unmap command unless you’re prepared to unmap and reclaim ALL your unused allocated space on your VMFS 6 datastore. In my case I did this because I had 4TB of deleted data that I wanted to unmap immediately, and didn’t want to wait for the automatic unmap.

I thought this may have been occurring because the automatic unmap function was on, so I tried it again after disabling auto unmap. The behavior was the same and it just kept running.

 

If you are tempted to run the unmap function, keep in mind it will continue to scan the entire volume (despite what block count you set). With this being said, if you are firm on running this, choose a larger block count (200 or higher) since smaller blocks will take forever (tested with a block size of 1 and after analyzing the logs and rate of unmaps, it would have taken over 3 months to complete on a 9TB array).

Update May 11th 2018: When running the manual unmap command with smaller “reclaim-unit” values (such as 1), your host may become unresponsive due to a memory overflow. vMotion’s will cease to function, and your ESXi host may need a restart to become fully functional. I’ve experienced this behavior twice. I highly suggest that if you perform this command, you do so while the host is in maintenance mode, and that your restart the host after a successful unmap sweep.