Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation

Mijin Go, Joonho Kong, Arslan Munir

Research output: Contribution to journalArticlepeer-review


As recent machine translation models are mostly based on the attention-based neural machine translation (NMT), many well-known models such as Transformer or bidirectional encoder representations from Transformers (BERT) have been proposed. Along with algorithmic advancements, hardware acceleration methods for those attention-based neural machine translation models have also been introduced. However, the size of the parameters for attention-based NMT is also becoming larger to guarantee the satisfactory machine translation quality. Among various weights, linearization weights ( $W^{Q}$ , $W^{K}$ , $W^{V}$ , and $W^{O}$ ) account for a non-negligible portion (by up to 30%) among the entire parameters in the modern NMT models. In this paper, we propose a method for linearization weight compression and near-memory hardware decoder for fast and in-situ weight decompression. Our weight compression method exploits the fixed-point quantization along with Huffman coding which is selectively applied depending on the weight value distribution. Our hardware decoder decompresses the Huffman-coded weights near-memory to minimize the weight decoding latency. Our compression method shows 4.9-10.0 compression ratio with small NMT score drops across the five widely used attention-based NMT models (Transformer, Transformer-XL-base, Transformer-XL-large, BERT-base, and BERT-large). In addition, due to the reduced linearization weight size, our proposed method with near-memory decoding enables multi-head attention (MHA) execution latency reduction by 11.8%, on average, as compared to the baseline when considering the weight loading and initialization. In terms of the memory data transfer energy consumption, our proposed method leads to a memory energy saving of 16.1%, on average, as compared to the baseline.

Original languageEnglish (US)
Pages (from-to)42751-42763
Number of pages13
JournalIEEE Access
StatePublished - 2023


  • hardware-based near-memory decoding
  • Huffman coding
  • multi-head attention
  • Neural machine translation
  • quantization

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering


Dive into the research topics of 'Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation'. Together they form a unique fingerprint.

Cite this