NVIDIA Teams Up With Google on Gemma Open Language Models

NVIDIA, in collaboration with Google, has launched optimizations across all NVIDIA AI platforms for <a href="https&colon;//blog&period;google/technology/developers/gemma-open-models/">Gemma</a> — Google’s state-of-the-art new lightweight <a href="https&colon;//catalog&period;ngc&period;nvidia&period;com/orgs/nvidia/teams/ai-foundation/models/gemma-2b">2 billion</a>– and <a href="https&colon;//catalog&period;ngc&period;nvidia&period;com/orgs/nvidia/teams/ai-foundation/models/gemma-7b">7 billion</a>-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases&period;&NewLine;Teams from the companies worked closely together to accelerate the performance of Gemma — built from the same research and technology used to create the Gemini models — with <a href="https&colon;//github&period;com/NVIDIA/TensorRT-LLM">NVIDIA TensorRT-LLM</a>, an open-source library for optimizing large language model inference, when running on NVIDIA GPUs in the data center, in the cloud and on PCs with <a href="https&colon;//www&period;nvidia&period;com/en-us/geforce/rtx/">NVIDIA RTX</a> GPUs&period;&NewLine;This allows developers to target the installed base of over 100 million NVIDIA RTX GPUs available in high-performance AI PCs globally&period;&NewLine;Developers can also run Gemma on NVIDIA GPUs in the cloud, including on Google Cloud’s A3 instances based on the H100 Tensor Core GPU and soon, NVIDIA’s <a href="https&colon;//nvidianews&period;nvidia&period;com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform">H200 Tensor Core GPUs</a> — featuring 141GB of HBM3e memory at 4&period;8 terabytes per second — which Google will deploy this year&period;&NewLine;Enterprise developers can additionally take advantage of NVIDIA’s rich ecosystem of tools — including <a href="https&colon;//www&period;nvidia&period;com/en-us/data-center/products/ai-enterprise/">NVIDIA AI Enterprise</a> with the <a href="https&colon;//github&period;com/NVIDIA/NeMo">NeMo framework</a> and <a href="https&colon;//github&period;com/NVIDIA/TensorRT-LLM">TensorRT-LLM</a> — to fine-tune Gemma and deploy the optimized model in their production application&period;&NewLine;Learn more about how <a href="https&colon;//developer&period;nvidia&period;com/blog/nvidia-tensorrt-llm-revs-up-inference-for-google-gemma/">TensorRT-LLM is revving up inference for Gemma</a>, along with additional information for developers&period; This includes several model checkpoints of Gemma and the FP8-quantized version of the model, all optimized with TensorRT-LLM&period;&NewLine;Experience <a href="https&colon;//catalog&period;ngc&period;nvidia&period;com/orgs/nvidia/teams/ai-foundation/models/gemma-2b">Gemma 2B</a> and <a href="https&colon;//catalog&period;ngc&period;nvidia&period;com/orgs/nvidia/teams/ai-foundation/models/gemma-7b">Gemma 7B</a> directly from your browser on the NVIDIA AI Playground&period;&NewLine;<h2>Gemma Coming to Chat With RTX</h2>&NewLine;Adding support for Gemma soon is <a href="https&colon;//blogs&period;nvidia&period;com/blog/chat-with-rtx-available-now/">Chat with RTX</a>, an NVIDIA tech demo that uses <a href="https&colon;//blogs&period;nvidia&period;com/blog/what-is-retrieval-augmented-generation/">retrieval-augmented generation</a> and TensorRT-LLM software to give users generative AI capabilities on their local, RTX-powered Windows PCs&period;&NewLine;The Chat with RTX lets users personalize a chatbot with their own data by easily connecting local files on a PC to a large language model&period;&NewLine;Since the model runs locally, it provides results fast, and user data stays on the device&period; Rather than relying on cloud-based LLM services, Chat with RTX lets users process sensitive data on a local PC without the need to share it with a third party or have an internet connection&period;&NewLine;

Editor