Jun 092012
 

Recently, I’ve started to have some issues with the HP MSA20 units attached to my SAN server at my office. These MSA20 units stored all my Virtual Machines inside of a VMFS filesystem which was presented to my vSphere cluster hosts over iSCSI using Lio-Target. In the last while, these logical drive has just been randomly disappearing, causing my 16+ virtual machines to just halt. This always requires me to shut off the physical hosts, shut off the SAN server, shut off the MSA20s, and bring everything all the way back up. This causes huge amounts of downtime, and it just a pain in the butt…

I decided it was time for me to re-do my storage system. Preferably, I would have purchased a couple HP MSA60s and P800 controllers to hook it up to my SAN server, but unfortunately right now it’s not in the budget.

A few years ago, I started using software RAID. In the past I was absolutely scared of it, thought it was complete crap, and would never have touched it, but my opinion drastically changed after playing with it, and regularly using it. While I still recommend businesses to use Hardware based RAID systems, especially for mission critical applications, I felt I could try out software RAID for the above situation since it’s more of a “hobby” setup.

I read that most storage enthusiasts use either the Super Micro AOC-SASLP-MV8, or the LSI SAS 9211-8i. Both are based off different chipsets (both of which are widely used in other well known cards), and both have their own pro’s and con’s.

During my research, I noticed a lot of people who run Windows Home Server were utilizing the AOC Super Micro Card. And while using WHS, most reported no issues whatsoever, however it was a different story when reading posts/blog articles from people using Linux. I don’t know how accurate this was, but apperently a lot of people had issues with this card under heavy load, and some just couldn’t get it running inside of linux.

Then there is the LSA 9211-8i (which is the same as the extremely popular IBM M1015). This bad boy supports basic RAID operations (1, 0, 10), but most people use it with JBOD and simply use Linux MD Software RAID. While there was numerous complaints about users having issues with their systems even detecting their card, other users also reported issues caused by the BIOS of this card (too much memory for the system to boot). When people did get this card working though, I read of mostly NO issues under Linux. Spent a few days confirming what I already had read and finally decided to make the purchase.

Both cards support SAS/SATA, however the LSI card supports 6Gb/sec SAS/SATA. Both also have 2 internal SFF8087 Mini-SAS connectors to hook up a total of 8 drives directly, or more using an SAS expander. The LSI card uses a PCIe (V.2) 8x slot, vs the AOC-SASLP which uses PCIe (V.1) 4x slot.

I went to NCIX.com and ordered the LSI 9211-8i along with 2 breakout cables (Card Part#: LSI00194, Cable Part#: CBL-SFF8087OCF-06M). This would allow me to hook up a total of 8 drives (even though I only plan to use 5). I already have an old computer I already use with an eSATA connector to a Sans Digital SATA Expander for NFS, etc… that I plan on installing the card in to. I also have an old Startech SATABAY5BK enclosure which will hold the drives and connect to the controller. Finished case:

Server with disk enclosures (StarTech SATABAY5BK)

(At this point I have the enclosure installed along with 5 X 1TB Seagate 7200.12 Barracuda drives)

Finally the controller showed up from NCIX:

LSI SAS 9211-8i

I popped this card in the computer (which unfortunately only had PCIe V1), and connected the cables! This is when I ran in to a few issues…

-If no drives were connected, the system would boot and I could succesfully boot to CentOS 6.

-If at all I pressed CTRL+C to get in to the cards interface, the system would freeze during BIOS POST.

-If any drives were connected and detected by the cards BIOS, the system would freeze during BIOS POST.

I went ahead and booted in to CentOS 6. Downloaded the updated firmware and BIOS and flashed the card. The flashing manual was insane, but had to read it all to make sure I didn’t break anything. First I updated both the firmware and BIOS (which went ok), however I couldn’t convert the card from IR firmware to IT firmware due to errors. I google’d this and came up with a bunch of articles, but this one: http://brycv.com/blog/2012/flashing-it-firmware-to-lsi-sas9211-8i/ was the only one that helped and pointed me in the right direction. Essentially just stating you have to use the DOS flasher, erase the card (MAKING SURE NOT TO REBOOT OR YOU’D BRICK IT), and then flashing the IT Firmware. This worked for me, check out his post! Thanks Bryan!

Anyways, after updating the card and converting it to the IT firmware. I still had the BIOS issue. I tried the card in another system, and still had a bunch of issues. I finally removed 1 of 2 video cards and populated the card in a Video Card slot, and I finally could get in to the BIOS. First I enabled staggered spin-up (to make sure I don’t blow the PSU on the computer with a bunch of drives starting up at once), changed some other settings to optimize, and finally disabled the boot BIOS, and changed the option for the adapter to be disabled for boot, and only available to the OS. When removing the card, and putting it in the target computer, this worked. Also noticed that the staggered spin-up started during the Linux kernel startup when initializing the card. Here’s a copy of the kernel log:

mpt2sas version 08.101.00.00 loaded
mpt2sas 0000:06:00.0: PCI INT A -> Link[LNKB] -> GSI 18 (level, low) -> IRQ 18
mpt2sas 0000:06:00.0: setting latency timer to 64
mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (3925416 kB)
mpt2sas 0000:06:00.0: irq 24 for MSI/MSI-X
mpt2sas0: PCI-MSI-X enabled: IRQ 24
mpt2sas0: iomem(0x00000000dfffc000), mapped(0xffffc900110f0000), size(16384)
mpt2sas0: ioport(0x000000000000e000), size(256)
mpt2sas0: sending message unit reset !!
mpt2sas0: message unit reset: SUCCESS
mpt2sas0: Allocated physical memory: size(7441 kB)
mpt2sas0: Current Controller Queue Depth(3305), Max Controller Queue Depth(3432)
mpt2sas0: Scatter Gather Elements per IO(128)
mpt2sas0: LSISAS2008: FWVersion(13.00.57.00), ChipRevision(0x03), BiosVersion(07.25.00.00)
mpt2sas0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
mpt2sas0: sending port enable !!
mpt2sas0: host_add: handle(0x0001), sas_addr(0x5000000080000000), phys(8)
mpt2sas0: port enable: SUCCESS

SUCCESS! Lot’s of SUCCESS! Just the way I like it! Haha, card intialized, had access to drives, etc…

Configured the RAID 5 Array using a 256kb chunk size. I also changed the “stripe_cache_size” to 2048 (the system has 4GB of RAM) to increase the RAID 5 performance.

cd /sys/block/md0/md/

echo 2048 > stripe_cache_size

At this point I simply formatted the drive using EXT4. Configured some folders, NFS exports, and then used Storage vMotion to migrate the Virtual Machines from the iSCSI target, to the new RAID5 array (currently using NFS). The main priority right now was to get the VMs off the MSA20 so I could at least create a backup after they have been moved. Next step, I’ll be re-doing the RAID5 array, configuring the md0 device as a iSCSI target using Lio-Target, and formatting it with VMFS. The performance of this Software RAID5 array is already blowing the MSA20 out of the water!

So there you have it! Feel free to post a comment if you have any questions or need any specifics. This setup is rocking away now under high I/O with absolutely no problems whatsoever. I think I may go purchase another 1-2 of these cards!

Jun 062012
 

So, as most of you know, I have TONS of articles pertaining to getting Lio-Target running on Cent OS. In the beginning, things seemed rather “hit or miss” due to weird errors when either building lio-target or lio-utils…

Turns out, most of the issues I’ve had are related to Python and the current running version. Recently I updated one of my storage boxes using yum, and it completely broke Lio-Target, and Lio-Utils when I had to rebuild them for the new kernel. I was in a panic to mount an old CentOS 6 ISO to get one of the first Python version for CentOS. After downgrading, I was able to build and install both.

 

Just a heads up for you people getting weird python errors.

Jun 032012
 

Well, for the longest time I have been running a vSphere 4.x cluster (1 X ML350 G5, 2 X DL360 G5) off of a pair of HP MSA20’s connected to a SuperMicro server running Lio-Target as a iSCSI Target on CentOS 6.

This configuration has worked perfectly for me for almost a year, requiring absolutely no maintenance/work at all.

Recently I moved, so I had to move all my servers, storage units, etc… When I got in to the new place, and went to power everything up, I noticed that my first drive had failed upon initializing one of the MSA20 units.. I replaced this drive, let it rebuild, and thought this would be the end of the issue, but I was incorrect. (Just so everyone knows, these units had been on continuously for 8+ months before turning them off to move).

For months since this happened and a successful rebuild, at times of high I/O (backing up to a NFS share using GhettoVCB), the logical drive in the array just disappears. I have each MSA20 connected to it’s own HP SmartArray 6400/6402 controller. When the logical drive disappears, I notice that the “Drive Failure” LED on the SA 640x controller illuminates. When this happens, I have to shut off all physical servers, the storage server (running Lio-Target), and the MSA’s, and restart everything.

Sometimes it is worse than others (example I’ve been dealing with this issue non-stop all weekend with no sleep). Even under low I/O I’ll be starting the VMs and it will just lose it again. While other times, I can run it for weeks as long as I keep the I/O to a minimum.

I’ve read numerous other articles and posts of other people having the same issue. These people have had HP replace every single item inside of the MSA20 unit (except the drives) and the issue still occurs. This made me start thinking.

This weekend, I NEEDED to get a backup done. While doing this, the issue occurred, and it got to the point where I couldn’t even start the VMs, let alone back them up. I figured that since other people have had this issue, and since replacing all hardware hasn’t fixed it (I even moved a RAID array from one MSA20 to the other I have with no affect), then it has to be the drives themselves.

There are two possibilities, either the drives are failed, and the MSA20 isn’t reporting them as failed (which I’ve seen happen), or the way the MSA20 creates the RAID array has issues. I ran a ADU report and carefully read the entire report. Absolutely no issues reading/writing to the drives, and the MSA20 has no events in it’s log. It HAS to be the way the RAID array is created on the disks.

Please do not try this unless you know exactly what you’re doing. This is very dangerous and you can lose all your data. This is just me reporting on my experience and usage.

In desperation I thought to myself this all happened when a drive failed and I put a new disk in the array. Since I couldn’t back any of my data up, or let alone even start the VMs, I decided to start behaving dangerously. I kept all my vSphere hosts offline, and just turned on the MSA20 units and the SuperMicro server that is attached to them. I then proceeded to remove a drive, re-insert it, let it rebuild over 3 hours, then when healthy and rebuilt, do it to the next drive. I have a RAID 5 Array containing 4 X 500GB disks, so this actually took me a day to do (had to remove/re-insert, rebuild, then next drive).

After finally removing and rebuilding each drive in the array, I finally decided to boot up the vSphere servers, and run a backup of my 12 VMs. Not only did everything seem faster, but I completed the backup without any problems. This shows that there’s a very high chance that the RAID stuff on the drives is either corrupt, damaged, or just wasn’t implemented very nicely. Rebuilding each drive seemed to fix this. I’ll report in a few weeks to let you know for sure that it’s resolved!

 

Hope this helps someone out there!

May 112012
 

For the longest time I’ve been dealing with a server that hasn’t been playing nice. Regularly the server will freeze when either creating VSS snapshots, or deleting them!

These usually happen at 6:00AM or 12:00PM (when I have them scheduled) and can sometimes lock the server up for close to 30 minutes. I’ve spent HOURS investigating this, resulting with absolutely no errors, no nothing, except for that some services might fail due to the freeze if I’m actually logged in to the server.

Typically, this behavior only starts happening 1-2 weeks after a fresh reboot. Rebooting the server stops this issue for 1-2 weeks. And keep in mind, as I said absolutely no errors in the event log that point to what is causing this.

The Server runs fully updated/patched Windows Server 2008, has 16GB of RAM, 2 X 6-core processors and SAS disks, so it’s nothing related to performance.

Finally after months I have found out what the culprit is in my case: Turns out that Symantec Endpoint Manager (not the anti-virus, but the management software) was actually causing or agitating this issue. When logging in, I noticed that Symantec Endpoint Protection Manager was somewhat sluggish, and not functioning properly, I restarted the services, and BAM out of nowhere VSS process decides to deleted the oldest snapshot for C:. When this happened the server freezed. I repeated this 4 times to confirm, all in the same morning. I’m not sure why it was triggering snapshot removal, but it was odd.

I proceed to upgrade Symantec Endpoint Protection Manager on that server later that week. During the upgrade (I upgraded to a new 11.x released, then later to 12.x), I noticed that every time the services were restarted automatically as part of the database upgrade process, that the VSS issue would occur and the server would become unresponsive.

We are now running at 12.x on that system, and have not had any reported freeze-ups. It’s been over a week and a half, and it looks like the issue is resolved.

Apr 142012
 

The other day I received a notification that one of my clients were running out of space on their SAS RAID Array which contained their Exchange 2007 mailbox data store database. While I have every plan to increase the size of this partition, I still have to temporarily fix things so we don’t run out of space. Technically, to put a temporary fix on this, I had to move the Exchange Server Data to another partition on the server which had plenty of space. Typically, this is very easy on Microsoft Small Business Server 2008. However, in this specific scenario we were getting an error when trying to run the wizard to move the data:

 

Move Exchange Data Error Message

You cannot use the Windows SBS Console to move the Exchange Server data. – You may have used the Exchange Server Management Console to perform advanced configuration tasks. For information about how to reconfigure move your data using the Exchange Server Management Console, see the documentation for Microsoft Exchange Server

 

 

 

 

 

After receiving this error I went ahead and looked for the logs pertaining to the move wizards. The error log mentioned that configuration was altered from the default (which is acceptable since we have done some modifications to our Exchange config), and I also believe this is occurred due to both our “First Storage Group” and “Second Storage Group” already being hosted on different logical partitions. From what I have read, you cannot modify your Exchange configuration too heavily, nor have your different storage groups on different partitions for the wizard to work.

Since this happened, we have to move the Exchange data manually using the Exchange Management Console. These instructions will work for both Microsoft Windows Small Business Server 2008, and also Microsoft Exchange 2007 running on a standard Microsoft Windows Server (only if your not using any replication to other Exchange Servers). Please note that during this move, all move functions will require the database to be dismounted from the information store. Only Exchange 2010 (or later) supports live moving.

Instructions to move the Exchange database (First Storage Group – Mailbox Database):

Important: Always back up your server before doing heavy operations like this in case something goes wrong. To back Microsoft Exchange up, you have to have backup software that is “Exchange Aware” and can properly back it up.

 

1) Launch the Microsoft Exchange Management Console and locate the Database Management information – You should be able to find the Exchange Management console in your start menu. When opening it should prompt for a UAC (run as Administrator) privileges, grant it. If it does not prompt you to run as Administrator, right click on “Exchange Management Console” and select “Run as Administrator”. Once you have opened the console, expand “Server Configuration” and “Mailbox”.

Exchange Server 2007 Management Console

Server Configuration -> Mailbox

 

 

 

 

 

 

 

 

 

2) Move Storage Group Path -First we need to move the “Storage Group Path” for the “First Storage Group” (which contains our Exchange Mailboxes). This will move the files that are related to logs, transaction files, etc… To do this, right click on “First Storage Group”, and select “Move Storage Group Path…”. Follow the wizard. Inside of the wizard, you will choose the new location in both the “Log files path” and “System files path”. Finally after you have specified the location, it will dismount the database and perform the move function.

Move Storage Group Path Wizard

Move Storage Group Path Wizard

 

 

 

 

 

 

 

 

 

 

3) Move Database Path – Now we need to move the actual database path of the “Mailbox Database”. This will actually move the Exchange mailboxes on our server to a new location. To do this, right click on “Mailbox Database” and select “Move database path…”. Follow the wizard. Inside of the wizard, you will choose the new location for the “Database file path”. Finally after you have specified the location, it will dismount the database and perform the move function.

Move Database Path Wizard

Move Database Path Wizard

 

 

 

 

 

 

 

 

 

 

4) Move Public Folders (If desired) – If you desire, you can also move your “Public Folders” by performing the same steps for the “Second Storage Group” and the “Public Folder Database”. In my case, our public folders are very small, so I didn’t bother.

 

You have now moved your Exchange 2007 mailbox database.

If you need any assistance or help with SBS, please don’t hesitate to reach out. I provide SBS Consulting Services, more information can be found here: https://www.stephenwagner.com/2020/02/28/microsoft-small-business-server-migration-upgrade/.