Apple Partners with NVIDIA to Explore Enhanced LLM Performance

0
12
Apple Partners with NVIDIA to Explore Enhanced LLM Performance

In a recent blog entry, Apple engineers revealed fresh insights on their partnership with NVIDIA aimed at enhancing text generation speeds using large language models.

This year, Apple introduced and open-sourced its Recurrent Drafter (ReDrafter) methodology. This innovative approach to text generation with LLMs is notably quicker and claims to deliver “state of the art performance.” It merges two strategies: beam search for exploring various options and dynamic tree attention for efficient choice management.

While initial research yielded promising outcomes, Apple joined forces with NVIDIA to implement ReDrafter in a real-world setting. This collaboration involved integrating ReDrafter into NVIDIA TensorRT-LLM, a tool designed to increase the speed of LLM operations on NVIDIA GPUs.

Here are the findings:

To facilitate the integration of ReDrafter, NVIDIA introduced new operators and enhanced existing ones, significantly elevating TensorRT-LLM’s ability to support complex models and decoding techniques. Machine learning developers utilizing NVIDIA GPUs can now effortlessly take advantage of ReDrafter’s expedited token generation for their production LLM applications via TensorRT-LLM.

Benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, employing the NVIDIA TensorRT-LLM inference acceleration framework alongside ReDrafter, has shown an impressive 2.7x increase in tokens generated per second during greedy decoding. These benchmark findings suggest that this technology could notably minimize user latency while requiring fewer GPUs and reducing power consumption.

“As large language models become integral to production applications, enhancing inference efficiency has the potential to lower computational costs and diminish latency for users,” conclude Apple’s machine learning researchers. “With ReDrafter’s innovative approach to speculative decoding now integrated within the NVIDIA TensorRT-LLM framework, developers can enjoy quicker token generation on NVIDIA GPUs for their production LLM applications.”

For further details, visit Apple’s website and read a blog post on NVIDIA’s website:

Follow Chance: Threads, Bluesky, Instagram, and Mastodon.

FTC: We use income earning auto affiliate links. More.

XGIMI 750 150