Nvidia has announced its acceleration of Microsoft’s new Phi-3 Mini open language model with Nvidia TensorRT-LLM, an open-source library for optimizing large language model inference when running on Nvidia GPUs from PC to cloud.

Phi-3 Mini packs the capability of 10x larger models and is licensed for both research and broad commercial usage, advancing Phi-2 from its research-only roots. Workstations with Nvidia RTX GPUs or PCs with GeForce RTX GPUs have the performance to run the model locally using Windows DirectML or TensorRT-LLM.

The model has 3,8-billion parameters and was trained on 3,3-trillion tokens in only seven days on 512 Nvidia H100 Tensor Core GPUs.

Phi-3 Mini has two variants, with one supporting 4 000 tokens and the other supporting 128 000 tokens, which is the first model in its class for very long contexts. This allows developers to use 128,000 tokens — the atomic parts of language that the model processes — when asking the model a question, which results in more relevant responses from the model.

Developers working on autonomous robotics and embedded devices can learn to create and deploy generative AI through community-driven tutorials, like on Jetson AI Lab, and deploy Phi-3 on Nvidia Jetson.

With only 3,8-billion parameters, the Phi-3 Mini model is compact enough to run efficiently on edge devices. Parameters are like knobs, in memory, that have been precisely tuned during the model training process so that the model can respond with high accuracy to input prompts.

Phi-3 can assist in cost- and resource-constrained use cases, especially for simpler tasks. The model can outperform some larger models on key language benchmarks while delivering results within latency requirements.

TensorRT-LLM will support Phi-3 Mini’s long context window and uses many optimisations and kernels such as LongRoPE, FP8 and inflight batching, which improve inference throughput and latency.

The TensorRT-LLM implementations will soon be available in the examples folder on GitHub. There, developers can convert to the TensorRT-LLM checkpoint format, which is optimised for inference and can be easily deployed with Nvidia Triton Inference Server.