ONNX, ONNX Runtime, and TensortRT
- Data, AI & Analytics
ONNX, ONNX Runtime, and TensortRT
What is ONNX?
ONNX(Open Neural Network Exchange) defines a common set of operators – the building blocks of machine learning and deep learning models – and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.
ONNX Design Principles
- Support DNN but also allows for traditional ML
- Flexible enough to keep up with rapid advances
- Compact and cross-platform representation
- Standardized list of well-defined operators informed by real-world usage
Export to ONNX
- Tensorflow to ONNX:
- Pytorch to ONNX:
Load ONNX model
What is ONNX Runtime?
It is the High-Performance Inference engine for ONNX models founded and open-sourced by Microsoft under MIT License. It is designed to accelerate machine learning across a wide range of frameworks, operating systems, and hardware platforms.
While designing ONNX Runtime, they mainly focus on performance and scalability in order to support heavy workloads in high-scale production scenarios. So, it is supported on different Operating Systems and hardware platforms. The Execution Provider enables easy integration with Hardware accelerators.
Installation
Create Inference session to run onnx model in onnxruntime
Run the session
ONNX Tool (Netron)
Netron is an open-source multi-platform visualizer of saved models. It supports many extensions for deep learning, machine learning, and neural network models.
NVIDIA TensorRT
It is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications.
TensorRT based applications are 40 times faster than CPU-only based platforms during inference. With this, we can optimize performance Neural Network models trained in all major frameworks.
Features
Precision Calibration
-
- Maximizes throughput with FP16 or INT8 by quantizing models while preserving accuracy
- Quantization is an optimization method in which model parameters and activations are converted from a floating-point to a lower-precision representation i.e., from FP32 to FP16 or INT8.
Layer & Tensor Fusion
-
- It Combines the several kernels so it executes at ones, so it is also called kernel fusion
- Kernel Fusion further classified into two types: Vertical Fusion and Horizontal Fusion
- In Vertical Fusion, layers with unused output are eliminated to avoid unnecessary computation
- In Horizontal layer fusion, layers that take the same source tensor and apply the same operations with similar parameters, result in a single larger layer for higher computational efficiency.
Kernel Auto-Tuning
-
- Selects best data layers and algorithms based on the target GPU platform
Multi-Stream Execution
-
- Process multiple inputs streams in parallel
Dynamic Tensor Memory
-
- Memory is allocated for each tensor and only for the duration of its usage.
TensorRT is also integrated with application-specific SDKs such as NVIDIA DeepStream, Riva, Merlin™, Maxine™, and Broadcast Engine to provide developers a unified path to deploy intelligent video analytics, conversational AI, recommender systems, video conference, and streaming apps in production.
Related content
Auriga: Leveling Up for Enterprise Growth!
Auriga’s journey began in 2010 crafting products for India’s