Run VH Sandbox Example

Disclaimer: Preview version; content and features subject to change. 


The VH Sandbox is the main example project of the VHToolkit 2.0,  showcasing how characters can be configured, instantiated, and edited.
This tutorial is geared towards any researcher and developer who wants to gain familiarity with VHToolkit 2.0.

Page Contents

VH Sandbox Example


Shows how Virtual Humans use a flexible set of backend solutions to demonstrate conversational behavior.

Leverages RIDE’s generic C# interfaces for Speech Recognition (ASR), Natural Language Processing (NLP), Text-To-Speech (TTS) and Non-verbal Behavior Generation (NVBG). Optionally includes sentiment analysis and entity analysis for NLP and Sensing (computer vision) with support for emotion and landmark recognition.

Initially supported services:






Windows Dictation Recognition (Native Windows service)

Azure QnA

AWS Polly Text To Speech and TTS Voice

NVBG System (External Windows Process)

Azure Face

Azure Speech Recognition (Google Speech SDK)

AWS Lex v1 & v2

Eleven Labs Text To Speech and TTS Voice

NVBG System (Local Integrated Class Library, Windows Only)

AWS Rekognition

Mobile Speech Recognition (KKSpeech Recognizer Asset)


  • GPT3
  • Turbo GPT (3.5)
  • GPT4
Windows TTS (External Windows Process)NVBG System (Remote RESTful Service) 

Supplemental Virtual Human Conversational Behavior:

  • Gaze, Saccade
  • Listening
  • Emotion Mirroring

Review the Supported Services by Character and Platform section for which combinations of the above are currently expected to function.

Additionally, this scenario has the capability of utilizing custom user created backend implementations for any of the above systems. Requires knowledge of RIDE’s generic C# interfaces.

How to Use

Run the scene and use the sandbox menu to control the scenario parameters.

Environment Tab 

  • Choose between six flat background images, including USC ICT logo, 3D lobby scene, and landscape photos.
  • Three locations within the 3D SeaView environment.

VH Config Tab

  • Create VH: In this sub-tab, you can configure the backends you want to test before creating the Virtual Human by selecting the Create button.
    • Character ID is prepopulated for each VH and is customizable; note, each VH must have a unique name.
    • TTS Voice list is based on the selected TTS backend.
    • Select Compare & Contrast to randomize the parameters prior to creation.
    • Virtual Humans are created and placed within player FOV.

  • Edit VH:  In this sub-tab, you can swap the backends for already created Virtual Humans.
    • VH model template cannot be modified after creation.
    • VH position and rotation can be modified here.
    • Select between instanced VH characters from the dropdown. 
    • Delete problematic or unwanted VH characters.

  • Save/Load VH: In this sub-tab, you can save and load configured Virtual Humans to/from local storage.
    • Saved VH configurations are identified by the VH name.
    • VH configurations are saved as JSON. If in-scene backend objects have their name changed, the previously saved configuration will not load properly.
    • If you load a VH configuration for an already created or loaded VH, the existing VH config will be overwritten by the loaded configuration.

Conversation Tab

  • This tab acts as conversation analysis/debugging.
  • Tracks the current VH the user is having a conversation with.
  • Has 2 sub-tabs, Analysis and Debug Info, for more in-depth information about the current conversation.

User Input

  • Text Input: Use the designated input field and the Submit button to send a query to the current conversational VH.
  • Audio Input: Press and hold the Push-To-Talk button and begin speaking.
    • Input field will display detected partial speech (if supported by the current ASR backend).
    • Query will be sent either once you release the push-to-talk button, or automatically if complete speech is detected, based on the current ASR system.

Toggle Webcam

  • Turns on the webcam and begins sensing for the backend of the current conversational VH.
  • Requires a created VH with an assigned Sensing backend.
  • Webcam overlay displays the following information:
    • Detected face rectangle
    • Detected landmarks
    • Detected head orientation
    • Detected emotion

Key bindings:




Toggle Sandbox Menu


Push-to-talk shortcut


Toggle mouse cursor control


Toggle webcam shortcut


Cycle Sandbox Menu tab


Create Virtual Human shortcut

Scene Location & Name


Setup Requirements 

The main script name and location: Assets/Scripts/VHSandboxExample.cs.

Below are the steps to add a custom VH backend implementation to the sandbox:

  1. Create your custom backend script, that implements one of the following Ride C# interfaces:
    1. ASR – ISpeechRecognitionSystem
    2. NLP – INLPQnASystem
    3. TTS – ILipsyncedTextToSpeechSystem
    4. NVBG – INonverbalGeneratorSystem
    5. Sensing – ISensingSystem
  2. Create a Unity gameobject with a user-friendly name and attach the created script.
    1. The Gameobject should be placed in the scene under the  “VH Backends” object, and should not have the same name as another backend gameobject.
  3. To confirm the previous steps have been completed correctly, play the scene and see if the created gameobject name is displayed under the relevant backend dropdown.

Known Issues and Troubleshooting


Windows speech recognizer unavailable
  1. Ensure under Settings > Speech,  Online speech recognition option is switched to On


App dependencies blocked from running by versions of macOS

Due to macOS security features, the OS may block certain dependencies at run-time. Use the following workaround for any prompts received:

  1. Open a terminal window at the IVA project folder
  2. Input the following to edit the file attributes: xattr -cr “Assets/Ride_Dependencies (local)/SpeechSDK/Plugins/MacOS/libMicrosoft.CognitiveServices.Speech.core.dylib”
  3. Input the following to edit the file attributes: xattr -cr “Assets/Ride_Dependencies (local)/Oculus/Oculus/LipSync/Plugins/MacOSX/OVRLipSync.bundle”
  4. Play the VHSandbox – Minimal scene again
Speech input and camera feed not available
  1. If prompted by the OS, enable access to your microphone and webcam


Config file out of date and errors when running any example scene 
  1. Update the config file by launching LevelSelect > ExampleLLM scene
  2. Open the debug menu:
    • Windows, press F11 key
    • Mac, press (Command +) F11 key; note, may need to enable the option, System Preferences > Keyboard > “Use F1… as standard function keys”
  3. Config menu appears; if not, click header or arrow of debug menu
  4. Click Reset to Defaults button
  5. Return to LevelSelect and choose desired scene again
Services unavailable and various errors when running any example scene 
  1. Ensure PC has an active Internet connection; majority of all services require Internet except for “local” or “external” Microsoft/Windows services

Example Scene

  • AWS Lex v1 selected as both NLP Main and NLP Sentiment will fail and soft-lock input section of UI; as workaround, create new VH, then delete the problematic VH
    • Also occurs with NLP Main as AWS Lex v1 and NLP Sentiment as AWS Lex v2; as workaround, do not use AWS Lex as options in both fields
      • Best practice if using AWS Lex, populate both fields with AWS Lex v2
  • Character utterance audio may repeat and not match text for ElevenLabs voices
  • Delay from click & hold until mic input status change to “listening”
  • Mic input may fail to function with initial click & hold
  • SeaView environment locations run very slow on lower-end hardware
  • Rocketbox characters do not display face mirroring
  • RenderPeople Test character does not display lipsync
  • Mobile Speech Recognizer does not function on desktop systems
  • Webcam view may fail to update with creation and/or deletion of multiple characters
  • AWS Polly TTS, corresponding TTS Voice options may fail to populate
  • Windows TTS (External Windows Process) may cause strong head nod at start of utterance for characters
  • NLP Sentiment as AWS Lex v1 or v2 and TTS as ElevenLabs TTS causes VH unresponsiveness and soft-lock input section of UI
  • Rocketbox characters with NLP Main set as AWS Lex v1 or v2 and TTS  as ElevenLabs TTS may cause first utterance to repeat twice
  • Binary only: NVBG System (Local Integrated Class Library, Windows Only) causes VH unresponsiveness and soft-lock input section of UI

Supported Services by Character and Platform

Note: * denotes a known issue.

 Character ModelPlatform
 Rocketbox – Male  
Kevin CivilianRocketbox – Female

Davis OCP

Rocketbox – Male_2 

Render People TestWindows 10/11MacOS
Speech RecognizerWindows Speech Recognizer




 Azure Speech RecognizerN/AN/AN/AYesYes
 Mobile Speech RecognizerN/AN/AN/AN/AN/A
NLP MainOpenAI TurboGPTYesYesYesYesYes
 Azure QnAYesYesYesYesYes
 AWS Lex v1Yes*Yes*Yes*Yes*Yes*
 AWS Lex v2Yes*Yes*Yes*Yes*Yes*
 OpenAI GPT3YesYesYesYesYes
 OpenAI GPT4YesYesYesYesYes
NLP SentimentAzure TA SentimentYesYesYesYesYes
 AWS Lex v1Yes*Yes*Yes*Yes*Yes*
 AWS Lex v2Yes*Yes*Yes*Yes*Yes*
NLP EntitiesAzure TA EntitiesYesYesYesYesYes
Nonverbal GenerationNVBG System (External Windows Process)YesYesYesYesN/A
 NVBG System (Local Integrated Class Library, Windows Only)No*No*No*No*N/A
 NVBG System (Remote RESTful Service)YesYesYesYesYes
TTSAWS Polly TTSYesYesYes*YesYes
 ElevenLabs TTS – ProxyYesYesYes*YesYes
 ElevenLabs TTS – AutoYesYesYes*YesYes
 ElevenLabs v2 TTS – AutoYesYesYes*YesYes
 Windows TTS (External Windows Process)Yes*Yes*Yes*YesN/A
TTS VoiceAWS – Kevin YesYesYesYesYes
 AWS – Salli




 Eleven Labs P – Rachel YesYesYesYesYes
 Eleven Labs P – ClydeYesYesYesYesYes
 Eleven Labs P – Arno




 Eleven Labs P – Barack_ObamaYesYesYesYesYes
 Eleven Labs A – Rachel YesYesYesYesYes
 Eleven Labs A – ClydeYesYesYesYesYes
 Eleven Labs A – ArnoYesYesYesYesYes
 Eleven Labs A – Barack_ObamaYesYesYesYesYes
 Eleven Labs v2 A – Rachel YesYesYesYesYes
 Eleven Labs v2 A – ClydeYesYesYesYesYes
 Eleven Labs v2 A – ArnoYesYesYesYesYes
 Eleven Labs v2 A – Barack_ObamaYesYesYesYesYes
 Microsoft – DavidYesYesYesYesN/A
 Microsoft – ZiraYesYesYesYesN/A
SensingAWS Rekognition




 Azure FaceYesNo*No*YesYes