CUDADECODER USAGE AND TUNING GUIDE INTRODUCTION: The CudaDecoder was developed by NVIDIA with coordination from Johns Hopkins. This work was intended to demonstrate efficient GPU utilization across a range of NVIDIA hardware from SM_35 and on. The following guide describes how to use and tune the decoder for your models. A single speech-to-text is not enough work to fully saturate any NVIDIA GPUs. To fully saturate GPUs we need to decode many audio files concurrently. The solution provide does this through a combination of batching many audio files into a single speech pipeline, running multiple pipelines in parallel on the device, and using multiple CPU threads to perform feature extraction and determinization. Users of the decoder will need to have a high level understanding of the underlying implementation to know how to tune the decoder. The interface to the decoder is defined in "batched-threaded-cuda-decoder.h". A binary example can be found in cudadecoderbin/batched-wav-nnet3-cuda.cc". Below is a simple usage example. /* * BatchedThreadedCudaDecoderConfig batchedDecoderConfig; * batchedDecoderConfig.Register(&po); * po.Read(argc, argv); * ... * BatchedThreadedCudaDecoder CudaDecoder(batchedDecoderConfig); * CudaDecoder.Initialize(*decode_fst, am_nnet, trans_model); * ... * * for (; !wav_reader.Done(); wav_reader.Next()) { * std::string key = wav_reader.Key(); * CudaDecoder.OpenDecodeHandle(key, wave_reader.Value()); * ... * } * * while (!processed.empty()) { * CompactLattice clat; * CudaDecoder.GetLattice(key, &clat); * CudaDecoder.CloseDecodeHandle(key); * ... * } * * CudaDecoder.Finalize(); */ In the code above we first declare a BatchedThreadedCudaDecoderConfig and register its options. This enables us to tune the configuration options. Next we declare the CudaDecoder with that configuration. Before we can use the CudaDecoder we need to initalize it with an FST, AmNnetSimple, and TransitionModel. Next we iterate through waves and enqueue them into the decoder by calling OpenDecodeHandle. Note the key must be unique for each decode. Once we have enqueued work we can query the results by calling GetLattice on the same key we opened the handle on. This will automatticaly wait for processing to complete before returning. The key to get performance is to have many decodes active at the same time by opening many decode handles before querying for the lattices. PERFORMANCE TUNING: The CudaDecoder has a lot of tuning parameters which should be used to increase performance on various models and hardware. Note that it is expected that the optimal parameters will vary according to both the hardware, model, and data being decoded. The following will briefly describe each parameter: BatchedThreadedCudaDecoderOptions: cuda-control-threads: Number of CPU threads simultaniously submitting work to the device. For best performance this should be between 2-4. cuda-worker-threads: CPU threads for worker tasks like determinization and feature extraction. For best performance this should take up all spare CPU threads available on the system. max-batch-size: Maximum batch size in a single pipeline. This should be as large as possible but is expected to be between 50-200. batch-drain-size: How far to drain the batch before getting new work. Draining the batch allows nnet3 to be better batched. Testing has indicated that 10-30% of max-batch-size is ideal. determinize-lattice: Use cuda-worker-threads to determinize the lattice. if this is true then GetRawLattice can no longer be called. max-outstanding-queue-length: The maximum number of decodes that can be queued and not assigned before OpenDecodeHandle will automatically stall the submitting thread. Raising this increases CPU resources. This should be set to a few thousand at least. Decoder Options: beam: The width of the beam during decoding lattice-beam: The width of the lattice beam ntokens-preallocated: number of tokens allocated in host buffers. If this size is exceeded the buffer will reallocate larger consuming more resources max-tokens-per-frame: maximum tokens in GPU memory per frame. If this value is exceeded the beam will tighten and accuracy may decrease. max-active: at the end of each frame computation, we keep only its best max-active tokens (arc instantiations) Device Options: use-tensor-cores: Enables tensor core (fp16 math) for gemms. This is faster but less accurate. For inference the loss of accuracy is marginal GPU MEMORY USAGE: GPU memory is limited. Large GPUs have between 16-32GB of memory. Consumer GPUs have much less. For best performance users should have as many concurrent decodes as possible. Thus users should purchase GPUs with as much memory as possible. GPUs with less memory may have to sacrifice either performance or accuracy. On 16GB GPUs for example we are able to support around 200 concurrent decodes at a time. This translates into 4 cuda-control-threads and a max-batch-size of 50 (4x50). If your model is larger or smaller than the models our models when testing you may have to raise or lower this. There are a number of parameters which can be used to control GPU memory usage. How they impact memory usage and accuracy is discussed below: max-tokens-per-frame: Controls how many buffers can be stored on the GPU for each frame. This buffer size cannot be exceed or reallocated. As this buffer gets closer to being exhausted the beam is reduced possibly reducing quality. This should be tuned according to the model and data. For example, a highly accurate model could set this values smaller to enable more concurrent decodes. cuda-control-threads: Each control thread is a concurrent pipeline. Thus the GPU memory scales linearly with this parameter. This should always be at least 2 but should probably not be higher than 4 as more concurrent pipelines leads to more driver contention reducing performance. max-batch-size: The number of concurrent decodes in each pipeline. The memory usage also scales linear with this parameter. Setting this smaller will reduce kernel runtime while increase launch latency overhead. Ideally this should be as large as possible while still fitting into memory. Note that currently the maximum allowed is 200. == Acknowledgement == We would like to thank Daniel Povey, Zhehuai Chen and Daniel Galvez for their help and expertise during the review process.