Programming accelerators such as GPUs withlow-level APIs and languages such as OpenCL and CUDAis difficult, error-prone, and not performance-portable. Au-tomatic parallelization and domain specific languages (DSLs)have been proposed to hide complexity and regain performanceportability. We present P ENCIL, a rigorously-defined subset ofGNU C99 - enriched with additional language constructs - that enables compilers to exploit parallelism and produce highlyoptimized code when targeting accelerators. P ENCIL aims toserve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a P ENCIL-to-OpenCL backend using astate-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affinearray accesses, generates optimized OpenCL code. To demon-strate the potential and performance portability of P ENCILand the P ENCIL-to-OpenCL compiler, we consider a numberof image processing kernels, a set of benchmarks from theRodinia and SHOC suites, and DSL embedding scenarios forlinear algebra (BLAS) and signal processing radar applications(SpearDE), and present experimental results for four GPUplatforms: AMD Radeon HD 5670 and R9 285, NVIDIAGTX 470, and ARM Mali-T604.