MakeItZone ESXi Setup

infrastructure

#1

ESXi Server

Goal: one server class machine providing multiple student seats with a dedicated GPU for each seat.

Also tried hand rolled Linux + KVM, Unraid, Proxmox - several worked, but not reliable. ESXi, once working, has been solid. Also had issues with USB passthrough setup.

Hardware

ACS & PCI Lanes

  • be careful of many Intel processors, including Xeons; many have relatively few PCI lanes, meaning they’re shared between devices. Devices on shared PCI lanes can only be passed-through to the same VM
  • you also need properly working and setup ACS

USB

  • most of the virtualization systems have quirks with USB. E.g. unable to dynamically allocate. Or will pass through everything except HID (keyboard, mouse)
  • many people end up using virtualhere
  • XHCI (USB 1, 2, 3) is (probably) your enemy if you want to passthrough chipset/motherboard USB to VMs
  • make sure to configure USB to be EHCI (ie USB 1, 2)
  • reasons:
    • our X99 chipset based motherboard present only one XHCI controller, and it’s ineligible for pass through. Set up as EHCI, present two controllers, both of which can be passed through independently to different VMs
    • XHCI typically supports far fewer endpoints (remember one USB device can have multiple endpoints)
    • USB 3 endpoints require 2x memory c.f. USB 2
    • this really affects larger multiseat setups
    • lots of interesting details in this discussion thread

ESXi Setup

  • using version 6.5 due to problems passing through USB cards to VMs

  • enable swap

  • May need to revert SATA AHCI driver if disk performance is bad

    • Article with instructions, VMWare KB re disabling native drivers
    • steps from admin shell:
      1. verify available AHCI packages: esxcli software vib list | grep ahci
      2. disable the ESXi native AHCI driver: esxcli system module set --enabled=false --module="vmw_ahci"
      3. verify it’s disabled in config: esxcli system module list | less
      4. reboot ESXi server, verify it’s not loaded (repeat last command)
  • to get USB PCI Card passthrough to fully work (not just be indicated as working in the WebGUI):

    • disable XHCI, enable EHCI
    • may need to disable vmkusb native usb driver and use legacy driver. Similar steps to AHCI driver above. KB article
    • Note: tried switching to legacy driver, didn’t work. Disabled XHCI and everything started working. Tried reenabling vmkusb but the system refused. Don’t know if vmkusb + EHCI will work.
    • Don’t need to fiddle with passthru.map. That’s for indicating if individual functions of multifunction cards can be assigned to separate VMs, and how to reset them. E.g. a NIC card with two sockets: can you assign each socket to separate VMs, such that traffic to one is not visible to the other; that changing the mode of one doesn’t affect the other, etc. Details.
  • don’t extend the local datastore across the two SSDs by adding the 2nd one as an extent; there is no load balancing across extents. Instead, create a second datastore and take your best guess as to which drive to store VMs backing files.

    • eg assign workstation 1 & 3 to one drive, and 2 & 4 to the other.
  • if datastore not removable, check to see if swap is enabled and set to the datastore. Also search the advanced settings for scratch and log and make sure they’re not set to the datastore. If scratch is, you have to set it to a valid path- eg /tmp. Reboot.

How to Upgrade

Rolling Back

  • Reboot and hit Shift-R, enter recovery mode, follow prompts. From virtubytes.

VM Management and Setup

Editing Config Files

Finding the VMWare Tools ISO

Moving and Cloning VMs

Removing VMWare SVGA Adapter

  • once GPU passthrough is working, have seen some big performance hits when the VMWare Video adapter driver is sending a copy of the screen content to the ESXi console. Sometimes random, and happens even if there is no remote console active.
  • disabling in device manager doesn’t always fix this
  • removing the driver reverts the driver to being the MS generic one. But still see random performance hits.
  • VMware Tools updates may reinstall the driver
  • most effective fix is to remove it from the VM. Shutdown the VM and using the advanced config, set svga.present to FALSE. Reference.

VMs Randomly Lock Up

  • The USB PCI passthrough appears to fail once the VM is asleep; so there’s no way to wake up the VM(!)
  • check the power management settings in the VM
  • Windows defaults to going to sleep after ~15 minutes
  • solution is to use the high performance energy profile in windows or adjust the active power profile to be always on.