15618 Project Milestone -- Ray Tracing in CUDA

Jiaqi Song (jiaqison@andrew.cmu.edu)

Xinping Luo (xinpingl@andrew.cmu.edu)

URL: Ray Tracing in CUDA (raytracingcuda.github.io)

Summary of Work Completed So Far

We have made substantial progress on our project, achieving several key milestones:

  1. CPU Implementation:

  1. CUDA Implementation:

Progress Towards Goals and Deliverables

Based on our progress so far, we believe that we can achieve the goals and deliverables outlined in the proposal. Here’s a summary of our current status:

  1. Implemented Functionality on CPU:

We have successfully developed a fully functional ray tracing engine with both single-core and multi-core CPU versions. The engine supports:

Additionally, the engine is capable of rendering more complex scenes and supports animations with moving cameras and moving objects.

  1. CUDA Implementation Status:

We have established a functional CUDA implementation that matches the CPU version’s basic features. Tasks remaining for the CUDA version:

These are the final components needed to achieve feature in this project.

  1. Goals and Plan for the Poster Session and Final Report:

Our objective is to complete the full CUDA implementation with all functionalities described in the proposal. Beyond the baseline, we aim to explore and integrate additional optimization techniques to further enhance rendering performance. In the Poster Session, we will present a detailed rendering time comparison table, showcasing the performance differences between the single-core CPU, multi-core CPU, and GPU implementations of the ray tracer. The results will highlight the speedup achieved through parallelism and GPU acceleration. We also plan to include visual demonstrations of rendered scenes and animations to illustrate the quality and performance of our ray tracing engine.

Preliminary Results

In the milestone report, we conducted a speed test based on the current implementation. All rendered images are 600 × 600 in size, with 200 samples per pixel and a maximum depth of 20. This preliminary experiment was performed on GHC machines equipped with an Intel Core i7-9700 8-Core CPU and an NVIDIA GeForce RTX 2080 8GB GPU. In future experiments, we will test our ray tracer on different hardware to evaluate its performance across various configurations.

BVH Effectiveness on CPU

First, we evaluated the effectiveness of BVH on the CPU using an OpenMP-enabled multi-core implementation running on 8 threads.

Render Time on CPU with and without BVH Optimization (ms)

Scene Number of Objects No BVH BVH
First Scene 488 2,742,760 532,700
Cornell Box 13 231,167 237,790
Final Scene 3,409 10,638,643 364,221

As the results show, BVH optimization significantly improves performance, especially in scenes with a large number of objects. For instance, the final scene with BVH achieved a 30x speedup compared to the no-BVH implementation. This demonstrates that BVH is most effective in complex scenes with numerous objects. Therefore, for all subsequent CPU-based experiments, we used BVH optimization.

Multi-Core CPU Performance

Next, we tested the performance of the single-core and multi-core CPU implementations. The speedup achieved by increasing the number of threads is shown below.

Render Time on Single-Core and Multi-Core CPU (ms)

Scene 1 Thread 2 Threads 4 Threads 8 Threads
First Scene 1,588,860 855,407 542,649 532,700
Cornell Box 772,366 434,136 261,806 237,790
Final Scene 1,127,800 605,119 382,038 364,221

The results indicate that performance improves as the number of threads increases. However, the speedup is not perfectly linear, likely due to workload imbalance between pixels. This uneven distribution of work reduces the efficiency of multi-core CPU parallelism.

Comparing CPU and GPU Performance

Finally, we compared the rendering times of the single-core CPU, multi-core CPU, and CUDA (GPU) implementations across all three scenes.

Render Time on CPU and GPU (ms)

Method First Scene Cornel Box Final Scene
Baseline + BVH (CPU) 1,588,860 772,366 1,127,800
OpenMP (8 threads) + BVH (CPU) 532,700 237,790 364,221
CUDA (no BVH) (GPU) 19,054 11,096 334,867

Even without BVH optimization, the CUDA implementation outperforms the multi-core CPU implementation, achieving a 20x speedup in simpler scenes. However, in the final scene with approximately 3,500 objects, the lack of BVH on CUDA significantly impacts performance, making it only slightly faster than the multi-core CPU with BVH version. This highlights the importance of BVH for complex scenes.

The superior performance of the CUDA version stems from the highly parallelizable nature of ray tracing, which is well-suited to the GPU’s architecture. In future work, we will implement BVH on CUDA to further enhance its performance in complex scenes.

Visual Results

Below are the rendered images for all three scenes on both CPU and GPU implementations.

  1. first scene:
first_scene.png
  1. cornell box:
cornell_box.png
  1. final scene:
final_scene.png

Issues Encountered

  1. Dynamic Memory Allocation in CUDA:

Both Monte Carlo Sampling and BVH require dynamic memory allocation during the rendering process. In CUDA, this presents a significant challenge, as dynamic memory allocation on the GPU can lead to unintended memory access issues and segmentation faults if not handled properly. Efficient memory management strategies, such as pre-allocating memory buffers or using custom memory pools, are essential to mitigate these risks and maintain stability.

  1. Implementing BVH on CUDA:

The BVH structure heavily relies on recursion for tree traversal, which can lead to stack overflows in a GPU environment due to limited stack size. To address this, recursive operations must be replaced with iterative approaches using explicit stacks or queues that GPU threads can manage. Designing and optimizing such iterative implementations is complex and requires careful attention to memory usage and thread synchronization to ensure both performance and correctness.

Revised Schedule

Based on our current progress, we have updated the schedule to ensure the completion of the project. The revised timeline is as follows: