Abstract
We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds of Massively Parallel Computation, a popular model of distributed computing with wide-ranging algorithmic results. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Original language | English (US) |
---|---|
Pages (from-to) | 43276-43327 |
Number of pages | 52 |
Journal | Proceedings of Machine Learning Research |
Volume | 235 |
State | Published - 2024 |
Event | 41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria Duration: Jul 21 2024 → Jul 27 2024 |
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability