Accurate depth perception is crucial for patient outcomes in endoscopic surgery, yet it is compromised by image distortions common in surgical settings. As shown in the image below, the depth estimation model's performance degrades significantly when the input image is corrupted.
Our study introduces a benchmark for evaluating the robustness of endoscopic depth estimation models. We've compiled a dataset with synthetically induced corruptions at different intensities. We also present the Depth Estimation Robustness Score (DERS), a new metric combining error, accuracy, and robustness measures. This metric and benchmark aim to improve model refinement and reliability under adverse conditions. Our findings highlight the need for robust algorithms, contributing to surgical precision and patient safety.
We utilize the SCARED dataset as the base dataset. We have introduced a range of synthetic corruptions to the SCARED dataset to create a new dataset, which we refer to as SCARED-C dataset. This expanded dataset serves as a rigorous evaluation platform for the accuracy of depth estimation within endoscopic imagery, hence serving a pivotal role in our robustness benchmarking. The SCARED-C dataset contains 551 images, originated from the test split in AF-SfMLearner. In total, 16 corruptions are applied to the images in the dataset at 5 intensities.
The corruptions applied to the images are as follows:
For more details on the corruptions, please refer to the paper Sec. 2.1.
The SCARED-C dataset is structured as follows:
Note that clean refers to the original image without any corruptions applied.
The dataset is available at OneDrive.
- Install basic packages, inlcuding torch, torchvision, numpy, etc.
- Refer to AF-SfMLearner to download and process the SCARED dataset.
- Go to corruptions folder and run the following command to apply corruptions to the images. For example, to apply brightness corruption, run the following command:
python create.py --image_list <path_to_image_list> --save_path <path_to_save_corrupted_images> --if_brightness
Similarly, you can apply other corruptions by using the respective flags.
Commonly used error and accuracy metrics are employed to evaluate the performance of depth estimation models. The metrics used are as follows:
- Error metrics
- Absolute Relative Difference (
$AbsRel$ ) - Squared Relative Difference (
$SqRel$ ) - Root Mean Squared Error (
$RMSE$ ) - Root Mean Squared Error in Logarithmic Scale (
$LogRMSE$ )
- Absolute Relative Difference (
- Accuracy metrics (Thresholded Accuracy)
-
$a1$ ($\delta$ < 1.25) -
$a2$ ($\delta$ < 1.25^2) -
$a3$ ($\delta$ < 1.25^3)
-
DERS purposefully devised to combine three pivotal components—error, accuracy, and robustness—into a comprehensive composite index.
The DERS is calculated as follows:
Here, E, A, and R are the Error Component, Accuracy Component, and Robustness Component, respectively, and can be calculated as follows:
For more details on the DERS metric, please refer to the paper Sec. 2.2.
DERS metric is based on error and accuracy metrics. To calculate the DERS metric, you need to first obtain the error and accuracy metrics on the clean and corrupted images. The following code snippet demonstrates how to calculate the DERS metric using the error and accuracy metrics.
Click to expand!
def calculate_ders(metrics_array, accuracy_weights=None, lambd=1.0):
"""
Calculate the DERS (Depth Estimation Robustness Score) based on the given metrics array for a specific corruption.
Parameters:
- metrics_array (numpy.ndarray): Array of metrics values (6 rows X 7 columns).
Each row represents corrption level 0-5 and each column represents a different metric.
Metric order: abs_rel, sq_rel, rmse, rmse_log, a1, a2, a3.
- accuracy_weights (numpy.ndarray, optional): Array of weights for the accuracy component calculation.
Defaults to [0.5, 0.3, 0.2].
- lambd (float, optional): Lambda parameter for the robustness component calculation. Defaults to 1.0.
Returns:
- ders_score (float): The calculated DERS score.
"""
# Error component calculation
if accuracy_weights is None:
accuracy_weights = np.array([0.5, 0.3, 0.2])
error_norms = metrics_array[0, :4] # Error norms for normalization (error metrics on clean images)
mean_errors = metrics_array[1:, :4].mean(axis=0)
normalized_errors = mean_errors / error_norms
# Accuracy component calculation
mean_accuracies = metrics_array[:, 4:].mean(axis=0)
weighted_accuracies = mean_accuracies * accuracy_weights
accuracy_component = np.sum(weighted_accuracies)
# Robustness component calculation
deviations = metrics_array[1:, :] - metrics_array[0, :]
robustness = np.mean(np.std(deviations, axis=0))
# Final DERM calculation
ders_score = np.sum(normalized_errors) / accuracy_component * np.exp(-lambd * robustness)
# derm_score = lambd * robustness * np.sum(normalized_errors) / accuracy_component
return ders_score