top of page
Search
  • Writer's pictureNathan

My Azure Stack HCI Home Lab - Part 2

Updated: 2 days ago

If you haven't already, please check out Part 1 of this series first. In part 1, I discuss the physical infrastructure being used in the cluster.


Let's start the process of actually building it. The Microsoft documentation is quite good, and this is exactly what I followed in order to build my cluster.


If I simply re-typed these instructions into my blog there would be no real benefit. So, I will go over the steps in a very high-level fashion. But, I will make sure to point out any notes and 'gotchas' that I learned along the way.


 

Deciding on an Architecture


The first step was to figure out what kind of cluster I was going to build. This choice is driven mainly by how many physical nodes you have and how many physical switches you have. Based on my setup, I went with the "2-node, storage switchless, single switch" option. The full architectural details of this option can be found in the Microsoft docs here.


Below is a diagram showing how I tried to architect my cluster (spoiler alert: it doesn't quite work in the end). I intended to put all 3 types of traffic (management, compute, and storage) onto their own dedicated NICs. The cluster management traffic would be on the Intel 2.5Gb NIC, the compute traffic would be on the Intel 10Gb NIC, and the storage traffic would be on the Mellanox 25Gb NIC.



I'll describe later why this didn't work out and why I had to tweak things slightly. But, for now, let's continue on with the cluster build.


 

Active Directory Domain Services


Clusters running Azure Stack HCI 23H2 require a Windows AD DS domain in order to function. While this requirement may be lifted in future versions, for now its still needed.


I won't go into full details here. But, I spun up a new VM running an evaluation copy of Windows Server 2022. I promoted the VM into a Domain Controller, and I built a brand new AD DS forest & domain.


The domain needs to be properly prepared. To do that, I downloaded and ran a special PowerShell module. This module does the following:

  • Creates a new Organizational Unit (OU)

  • Creates a new user account, and places it inside the new OU

  • Grants the new user account special permissions to the new OU

  • Blocks Group Policy inheritance on the new OU


Again, this is all clearly laid out in the docs, so please read that if you would like more information.


 

Installing the Stack HCI Operating System


The next step was to install the Azure Stack HCI operating system onto both of my physical nodes. The installation ISO file can be downloaded from the Azure Portal quite easily. The OS is a special, customized version of Windows Server Core. Just like Server Core editions, it comes with only a CLI interface, and no GUI.


The install is straightforward. Afterwards, I did some typical configuration steps, such as changing the computer name, setting time zone settings, enabling remote desktop, and configuring drivers.


Note: do not add your servers to the domain, leave them un-joined for now. Later in the process, they'll be automatically added to the domain for you.


Note: for the storage disks that will be used for cluster storage, make sure to leave those disks completely empty. No drive letters, no partitions, they must be completely wiped and not initialized in any way. If needed, you can use diskpart to clean a disk.


I got stuck here for a bit when I was trying to install drivers. I'm so accustomed to the Device Manager GUI, and honestly, I didn't want to bother installing drivers via the commandline. So, I took the easy way out and installed an optional "AppCompatibility" pack that installs a few GUI-based programs, such as Windows Explorer and Device Manager. This made my life much easier. It can be installed with the following command: Add-WindowsCapability -Online -Name ServerCore.AppCompatibility~~~~0.0.1.0


Important: when you're ready, you must install the following onto each system:


  • The Hyper-V feature

  • 4 different PowerShell modules

  • Azure Arc

    • This will connect each machine to your Azure Subscription

    • Inside your Azure Resource Group, you'll see new resources created for each of your machines. The resource types will be "Machine - Azure Arc"

      • You'll also see 4 different Arc Extensions added to each machine

    • Azure Stack HCI only supports a small subset of Azure Regions, so it is important that you pick one of those regions in this step.


 

Intel i226 Problems - Drivers, VLANs, RSS


The drivers for the Intel 2.5Gb adapters did not come preinstalled with the OS, so I had to manually install these drivers. And, boy, what a pain these adapters and their drivers were. They caused me multiple problems over the course of my deployment.


Quick note: the 2.5Gb ports on the MS-01 are slightly different from each other (one is the Intel i226-V, the other is the i226-LM).


My first instinct was to try the latest version of the driver, which I downloaded directly from Intel. After the install, I noticed that one of the NICs had a driver installed, while the other did not. After a bit of Googling it appears that the i226-V does not have its own unique driver. I saw references to people using the i226-LM driver on the i226-V adapter. So, that's what I did. I just had to accept a warning message stating that the driver might not be compatible. After that, it installed successfully.


However, after the drivers were installed I was immediately hit with a "Code 39" error on both NICs, as you can see below. I did some Googling, but quickly realized I was getting nowhere fast. I needed another solution to fix this problem.



I tried to use the same version of the driver again. But, instead of using the Windows Server driver I tried the Windows 11 driver. Guess what? It actually worked, and I was not getting the Code 39 errors anymore. Problem solved, right? Nope.


Unfortunately, the Windows 11 version of the driver removes VLAN capability/settings. Looking into the future here, my first Azure Stack HCI deployment fails because the Win11 version of the driver doesn't support VLANs. After realizing this, I was determined to get the Windows Server version of the driver working. I decided to test out previous versions. I tried each version, one-by-one, until I found one that actually worked (version 28.2.1). This one finally allowed me use the Windows Server version of the driver without the Code 39 errors.


My problems didn't end there, however. At this point, I tried a second deployment of Azure Stack HCI. But, it failed with yet another networking-related error tied to the i226 adapters. This time the error said that Receive-Side Scaling (RSS) was not enabled.


I went back to Google once again, where I found myself in yet another rabbit hole. I could not find any hard evidence telling me if the i226-V/LM adapters supported RSS. All I could find was an Intel product brief that made a quick mention of i226 supporting RSS. But, that doesn't really mean much. There are multiple versions of the i226 adapter (even standalone PCI-Express cards) and they don't specify which version of the adapter they are referring to. I also found that starting with driver version 28.2, Intel decided to remove Windows RSS support for many of their devices, including the i226-V and i226-LM. The readme for driver 28.2 states "End of support for RSS on Microsoft Windows operating systems." I quickly attempted to try out driver version 28.1.1, which is the version right before they removed Windows RSS support. But, unfortunately, it did not help.


At this point, I had wasted enough time on these i226 adapters, and I gave up on using them. I changed my architecture to use the Intel 10Gb adapters for combined management & compute traffic. The newly revised architecture looks like this:



 

One final networking 'gotcha' that I will discuss is that you can only configure 1 Default Gateway for your system. Even though it is configured on the per-adapter level, the Default Gateway is a global configuration. If you have more than 1 Default Gateway configured on a system, then you will get an error during the validation phase.


How do you remove a default gateway from the commandline? Use this command: Remove-NetRoute -InterfaceAlias "YourAdapterAlias" -NextHop x.x.x.x


Where "YourAdapterAlias" is the alias of the adapter you'd like to modify. And "x.x.x.x" is the IP address of the Default Gateway that you'd like to remove.


 

We're almost done with the prerequisites. There's one more thing you'll need to do. You'll need to grant Azure RBAC roles to the account that is going to create the cluster. The account will need certain roles at the Subscription level as well as the Resource Group level. Check the docs for full details.


 

In Part 3 of the series, I will finally deploy the Azure Stack HCI cluster.

139 views

Comments


bottom of page