Skip to content

Google's New TASP Method Speeds Up Long-Context Large Language Models

TASP's topology-aware sequence partitioning boosts speed by up to 3.58x. It minimizes communication overhead and maximizes performance, paving the way for more efficient large language models.

A group of people are standing and shouting, they wore white color t-shirts. Behind them there is a...
A group of people are standing and shouting, they wore white color t-shirts. Behind them there is a building, on the right side there are trees.

Google's New TASP Method Speeds Up Long-Context Large Language Models

Researchers at Google have developed a new method, TASP, to address communication bottlenecks in large language models dealing with long contexts. This innovation promises substantial speedups and enhanced scalability.

TASP, developed by Mark Chen, Bill Dolan, and their team, introduces topology-aware sequence partitioning for parallelizing Transformer training with long sequences. Unlike previous methods, TASP allows for non-contiguous partitioning, providing flexibility in workload balancing.

To minimize communication overhead, TASP optimizes partitioning based on network topology and interconnect bandwidth. With a batch size of 48, it achieves speedups ranging from 1.3x to 2.4x for sequence lengths of 10K to 50K. Experiments show TASP achieves up to 3.58x speedup compared to Ring Attention and its variant on NVIDIA H100 and AMD MI300X systems.

TASP fully utilizes the communication capacity of accelerators through topology and primitive decomposition, leading to significant memory savings. It improves the compute-to-communication ratio, balancing workload and enhancing performance. Combining TASP with computation-oriented optimizations, like sparse attention, could yield further performance improvements.

TASP, a novel approach by Google researchers, significantly boosts the speed and scalability of long-context large language models. By optimizing partitioning based on network topology and allowing non-contiguous partitioning, TASP minimizes communication overhead and maximizes performance. Future integration with computation-oriented optimizations promises even greater efficiency.

Read also:

Latest