In this work, we propose a novel parallel-pipeline traversal unit for hardware-based ray tracing, which can reduce latency and increase cache locality. Owing to the high memory bandwidth and computation requirements of ray-tracing operations such as traversal and intersection tests, recent studies have focused on the development of a hardware-based traversal and intersection-test unit[Nah et al. 2011][Lee et al. 2012]. Existing hardware engines are based on a single deep pipeline structure that increases the throughput of ray processing per unit time. However, traversal operations involve non-deterministic changes in the states of a ray. Therefore, in some cases, the ray may be unnecessarily transferred between pipeline stages, thereby increasing the overall latency. In order to solve this problem, we propose a parallel traversal unit having a pipeline per state. Our results show that the proposed system is up to 30% more efficient than a single-pipeline system because it decreases average latency per ray and increases cache efficiency. Copyright is held by the author / owner(s).