Abstract
As recent machine translation models are mostly based on the attention-based neural machine translation (NMT), many well-known models such as Transformer or bidirectional encoder representations from Transformers (BERT) have been proposed. Along with algorithmic advancements, hardware acceleration methods for those attention-based neural machine translation models have also been introduced. However, the size of the parameters for attention-based NMT is also becoming larger to guarantee the satisfactory machine translation quality. Among various weights, linearization weights ( $W^{Q}$ , $W^{K}$ , $W^{V}$ , and $W^{O}$ ) account for a non-negligible portion (by up to 30%) among the entire parameters in the modern NMT models. In this paper, we propose a method for linearization weight compression and near-memory hardware decoder for fast and in-situ weight decompression. Our weight compression method exploits the fixed-point quantization along with Huffman coding which is selectively applied depending on the weight value distribution. Our hardware decoder decompresses the Huffman-coded weights near-memory to minimize the weight decoding latency. Our compression method shows 4.9-10.0 compression ratio with small NMT score drops across the five widely used attention-based NMT models (Transformer, Transformer-XL-base, Transformer-XL-large, BERT-base, and BERT-large). In addition, due to the reduced linearization weight size, our proposed method with near-memory decoding enables multi-head attention (MHA) execution latency reduction by 11.8%, on average, as compared to the baseline when considering the weight loading and initialization. In terms of the memory data transfer energy consumption, our proposed method leads to a memory energy saving of 16.1%, on average, as compared to the baseline.
Original language | English (US) |
---|---|
Pages (from-to) | 42751-42763 |
Number of pages | 13 |
Journal | IEEE Access |
Volume | 11 |
DOIs | |
State | Published - 2023 |
Keywords
- hardware-based near-memory decoding
- Huffman coding
- multi-head attention
- Neural machine translation
- quantization
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering