Matrix Factorization (MF) is a popular algorithm used to power many recommender systems. Efficient and scalable MF algorithms are essential in order to train on the massive datasets that large scale recommender systems utilize. Graphics Processing Unit (GPU) technology has become very popular in recent years and has become widely used in machine learning. The massive parallelism GPUs offer creates an opportunity to develop an accelerated MF algorithm. This paper presents cu2rec, a matrix factorization algorithm written in CUDA. cu2rec implements a parallel version of Stochastic Gradient Descent (SGD) to solve large scale MF problems. cu2rec utilizes multiple advanced techniques to harness better performance from a GPU. These include aggressive use of constant memory for hyper-parameters and registers for heavily reused values, a sparse matrix data structure, a reduction sum total loss kernel, a novel approach to parallel lock-free updating of feature weights with minimized global memory writes, and fairness across weight updates using user index striding. With a single NVIDIA GPU, cu2rec can be l0x faster than state of the art sequential algorithms while reaching similar error metrics.