Selection of a small subset of informative features from data is a basic technique in signal processing, machine learning, and statistics. Joint selection of entire groups of features is desirable if the data features exhibit shared grouping structures. Bilevel feature selection constitutes a refinement of these ideas, producing a small subset of data features that themselves belong to a small number of feature groups. However, algorithms for bilevel feature selection suffer a computational cost that can be cubic in the size of the data, hence impeding their utility. In this paper, we propose an approach for bilevel feature selection that resolves this computational challenge. The core component of our approach is a novel fast algorithm for bilevel hard thresholding for a specific non-convex, discrete optimization problem. Our algorithm produces an approximate solution to this problem, but only incurs a nearly-linear running time. We extend this algorithm into a two-stage thresholding method that performs statistically as well as the best available methods for bilevel feature selection, but that also scales extremely well to massive dataset sizes.