ROCm · adeljo-amd · Feb 5, 2025 · Jan 22, 2025
@@ -65,39 +65,8 @@ platforms.
 Additional porting might be required to deal with architecture feature
 queries or CUDA capabilities that HIP doesn't support.
 
-How does HIP compare with OpenCL?
----------------------------------
-
-HIP offers several benefits over OpenCL:
-
-* Device code can be written in modern C++, including templates, lambdas,
-  classes and so on.
-* Host and device code can be mixed in the source files.
-* The HIP API is less verbose than OpenCL and is familiar to CUDA developers.
-* Porting from CUDA to HIP is significantly easier than from CUDA to OpenCL.
-* HIP uses development tools specialized for each platform: :doc:`amdclang++ <llvm-project:index>`
-  for AMD GPUs or `nvcc <https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html>`_
-  for NVIDIA GPUs, and profilers like :doc:`ROCm Compute Profiler <rocprofiler-compute:index>` or
-  `Nsight Systems <https://developer.nvidia.com/nsight-systems>`_.
-* HIP provides
-  * pointers and host-side pointer arithmetic.
-  * device-level control over memory allocation and placement.
-  * an offline compilation model.
-
-How does porting CUDA to HIP compare to porting CUDA to OpenCL?
----------------------------------------------------------------
-
-OpenCL differs from HIP and CUDA when considering the host runtime,
-but even more so when considering the kernel code.
-The HIP device code is a C++ dialect, while OpenCL is C99-based.
-OpenCL does not support single-source compilation.
-
-As a result, the OpenCL syntax differs significantly from HIP, and porting tools
-must perform complex transformations, especially regarding templates or other
-C++ features in kernels.
-
-To better understand the syntax differences, see :doc:`here<reference/terms>` or
-the :doc:`HIP porting guide <how-to/hip_porting_guide>`.
+To better understand the syntax differences, see :doc:`CUDA to HIP API Function Comparison <reference/api_syntax>`
+or the :doc:`HIP porting guide <how-to/hip_porting_guide>`.
 
 Can I install CUDA and ROCm on the same machine?
 ------------------------------------------------

@@ -45,7 +45,7 @@ The HIP documentation is organized into the following categories:
 * [HSA runtime API for ROCm](./reference/virtual_rocr)
 * [HIP math API](./reference/math_api)
 * [HIP environment variables](./reference/env_variables)
-* [Comparing syntax for different APIs](./reference/terms)
+* [CUDA to HIP API Function Comparison](./reference/api_syntax)
 * [List of deprecated APIs](./reference/deprecated_api_list)
 * [FP8 numbers in HIP](./reference/fp8_numbers)
 * {doc}`./reference/hardware_features`

@@ -0,0 +1,176 @@
+.. meta::
+  :description: Maps CUDA API syntax to HIP API syntax with an example
+  :keywords: AMD, ROCm, HIP, CUDA, syntax, HIP syntax
+
+********************************************************************************
+CUDA to HIP API Function Comparison
+********************************************************************************
+
+This page introduces key syntax differences between CUDA and HIP APIs with a focused code
+example and comparison table. For a complete list of mappings, visit :ref:`HIPIFY <HIPIFY:index>`.
+
+The following CUDA code example illustrates several CUDA API syntaxes.
+
+.. code-block:: cpp
+
+  #include <iostream>
+  #include <vector>
+  #include <cuda_runtime.h>
+
+  __global__ void block_reduction(const float* input, float* output, int num_elements)
+  {
+      extern __shared__ float s_data[];
+
+      int tid = threadIdx.x;
+      int global_id = blockDim.x * blockIdx.x + tid;
+
+      if (global_id < num_elements)
+      {
+          s_data[tid] = input[global_id];
+      }
+      else
+      {
+          s_data[tid] = 0.0f;
+      }
+      __syncthreads();
+
+      for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
+      {
+          if (tid < stride)
+          {
+              s_data[tid] += s_data[tid + stride];
+          }
+          __syncthreads();
+      }
+
+      if (tid == 0)
+      {
+          output[blockIdx.x] = s_data[0];
+      }
+  }
+
+  int main()
+  {
+      int threads = 256;
+      const int num_elements = 50000;
+
+      std::vector<float> h_a(num_elements);
+      std::vector<float> h_b((num_elements + threads - 1) / threads);
+
+      for (int i = 0; i < num_elements; ++i)
+      {
+          h_a[i] = rand() / static_cast<float>(RAND_MAX);
+      }
+
+      float *d_a, *d_b;
+      cudaMalloc(&d_a, h_a.size() * sizeof(float));
+      cudaMalloc(&d_b, h_b.size() * sizeof(float));
+
+      cudaStream_t stream;
+      cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
+
+      cudaEvent_t start_event, stop_event;
+      cudaEventCreate(&start_event);
+      cudaEventCreate(&stop_event);
+
+      cudaMemcpyAsync(d_a, h_a.data(), h_a.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
+
+      cudaEventRecord(start_event, stream);
+
+      int blocks = (num_elements + threads - 1) / threads;
+      block_reduction<<<blocks, threads, threads * sizeof(float), stream>>>(d_a, d_b, num_elements);
+
+      cudaMemcpyAsync(h_b.data(), d_b, h_b.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
+
+      cudaEventRecord(stop_event, stream);
+      cudaEventSynchronize(stop_event);
+
+      cudaEventElapsedTime(&milliseconds, start_event, stop_event);
+      std::cout << "Kernel execution time: " << milliseconds << " ms\n";
+
+      cudaFree(d_a);
+      cudaFree(d_b);
+
+      cudaEventDestroy(start_event);
+      cudaEventDestroy(stop_event);
+      cudaStreamDestroy(stream);
+
+      return 0;
+  }
+
+The following table maps CUDA API functions to corresponding HIP API functions, as demonstrated in the
+preceding code examples.
+
+.. list-table::
+    :header-rows: 1
+    :name: syntax-mapping-table
+
+    *
+      - CUDA
+      - HIP
+
+    *
+      - ``#include <cuda_runtime.h>``
+      - ``#include <hip/hip_runtime.h>``
+
+    *
+      - ``cudaError_t``
+      - ``hipError_t``
+
+    *
+      - ``cudaEvent_t``
+      - ``hipEvent_t``
+
+    *
+      - ``cudaStream_t``
+      - ``hipStream_t``
+
+    *
+      - ``cudaMalloc``
+      - ``hipMalloc``
+
+    *
+      - ``cudaStreamCreateWithFlags``
+      - ``hipStreamCreateWithFlags``
+
+    *
+      - ``cudaStreamNonBlocking``
+      - ``hipStreamNonBlocking``
+
+    *
+      - ``cudaEventCreate``
+      - ``hipEventCreate``
+
+    *
+      - ``cudaMemcpyAsync``
+      - ``hipMemcpyAsync``
+
+    *
+      - ``cudaMemcpyHostToDevice``
+      - ``hipMemcpyHostToDevice``
+
+    *
+      - ``cudaEventRecord``
+      - ``hipEventRecord``
+
+    *
+      - ``cudaEventSynchronize``
+      - ``hipEventSynchronize``
+
+    *
+      - ``cudaEventElapsedTime``
+      - ``hipEventElapsedTime``
+
+    *
+      - ``cudaFree``
+      - ``hipFree``
+
+    *
+      - ``cudaEventDestroy``
+      - ``hipEventDestroy``
+
+    *
+      - ``cudaStreamDestroy``
+      - ``hipStreamDestroy``
+
+In summary, this comparison highlights the primary differences between CUDA and HIP APIs.
@@ -111,8 +111,7 @@ subtrees:
   - file: reference/virtual_rocr
   - file: reference/math_api
   - file: reference/env_variables
-  - file: reference/terms
-    title: Comparing syntax for different APIs
+  - file: reference/api_syntax
   - file: reference/deprecated_api_list
     title: List of deprecated APIs
   - file: reference/fp8_numbers