Emergence of digital news provides new opportunities in information extraction. Proper characterization of unstructured news can help identify signals that may drive variations in many observable phenomena, such as disease outbreaks. In this paper, we propose a method to extract such signals from a large corpus of news events and identify a subset of signals that are closely related to the observed phenomenon. We show how words appearing in a large news corpus can be represented and latent features can be extracted to build predictive models. We build and evaluate such a system specifically for characterizing and predicting diseases outbreaks in India. We focused on 5 different diseases prevalent in India and experiments showed that our model can predict disease outbreaks 2 to 4 weeks prior, with an average precision of around 0.80 and recall of around 0.65. We also compared our model with an LDA-based baseline model, where our model demonstrated around 5-14% improvement across different diseases.