Language modelling is a fundamental building block of natural language processing. However, in practice the size of the vocabulary limits the distributions applicable for this task: specifically, one has to either resort to local optimization methods, such as those used in neural language models, or work with heavily constrained distributions. In this work, we take a step towards overcoming these difficulties. We present a method for global-likelihood optimization of a Markov random field language model exploiting long-range contexts in time independent of the corpus size. We take a variational approach to optimizing the likelihood and exploit underlying symmetries to greatly simplify learning. We demonstrate the efficiency of this method both for language modelling and for part-of-speech tagging.