Improving fully homomorphic encryption computation by designing specialized hardware is an active topic of research. The most prominent encryption schemes operate on long polynomials requiring many concurrent modular multiplications of very big numbers. Thus, it is crucial to use many small and efficient multipliers. Interleaved and Montgomery iterative multipliers are the best candidates for the task. Interleaved designs, however, suffer from longer latency as they require a number comparison within each iteration; Montgomery designs, on the other hand, need extra conversion of the operands or the result. In this work, we propose a novel hardware design that combines the best of both worlds: Exhibiting the carry save addition of Montgomery designs without the need for any domain conversions. Experimental results demonstrate improved latency-area product efficiency by up to 47% when compared to the standard Interleaved multiplier for large arithmetic word sizes.