Cache coherence is one of the main factors that can complicate the design of Network On Chip (NoC) due to the large volume of control traffic NoCs generate. With the demand to build a scalable multicore system, this problem has become more urgent to resolve. The snooping-based approach, in spite of its simplicity, cannot scale, while the directory-based approach has better scalability but involves synchronization and message exchange, resulting in a larger latency. This paper describes an efficient directory-based cache coherence protocol for a multistage multicore system. The protocol decouples the directories from level-2 caches and places them in the multi-stage network. It achieves two advantages over the traditional directory-based protocol in lowering the latency by almost half due to the placement of the directories, and allowing multipath routing to better load balance the traffic and thus reduce congestion. Control message multicasting and acknowledgment message aggregation are used in the network to speed up the completion of cache coherence operations. We have applied our proposed protocol on a fat-tree topology network. The simulation results based on synthesis traffic and Splash2 trace traffic show that our NBC scheme outperforms the traditional directory-based approach.