The problems & Solutions

  1. operate on important voxels instead of non-empty voxels [sparser than SSC]

    1. proportion of foreground points in the entire scene is extremely low (around 5%)

    Untitled

    (a) Comparison of the foreground and background ratios. (b) By replacing SPRS-Conv with its counterpart in sparse CNN, unnecessary positions are effectively suppressed from being activated.

  2. Dilation on important voxels instead of no dilation during downsampling [denser than SSC]

    1. Dilated features during downsampling

    Untitled

    (a) SSC: Submanifold sparse convolution, 2017: but the receptive field is limited.

    (b) Regular sparse convolution [14]: compute features for adjacent empty voxels ⇒ dilated feature ⇒ effectively expand the receptive field ⇒ but computation burden.

    Consequently, after convolutional (stride > 1) down-sampling, the number of non-empty voxels might be increased rather than decreased as shown in Fig. 1 (b): the number of non-empty voxels even doubled (see stage 2) compared to the input (see stage1).

==================

Solution: SPS-Conv

Previous solutions

2D image domain [29, 11]: add an auxiliary learnable module to predict a soft mask [21, 28] that locates areas to be skipped for computational efficiency.

The module often requires additional post-finetuning, auxiliary costs for integration, and incurs non-negligible computational overheads.

spatial pruned sparse convolution (SPS-Conv)

Untitled

Illustration of magnitude-guided spatial sampling and spatial pruned submanifold sparse convolution (SPSS-Conv)

Untitled

Illustration of spatial pruned regular sparse convolution (SPRS-Conv). shows the case of stride=2. 没看懂。

Preliminary of Sparse Convolution

这块不讲,还是看图好。

Untitled

  1. $x_p$: input feature with $c_{in}$ dimension at position p
  2. $w\in R^{K^d\times c_{in}\times c_{out}}$ : weight of convolution kernel, where d and Kd refers to the dimension of the spatial space and spatial size of the kernel respectively.
  3. K^d(p, Pin) as a subset of Kd, leaving out the empty position.
    1. k is the kernel offset that corresponds to all the valid/non-empty locations in kernel space Kd ⇒
    2. p¯k = p + k denotes all non-empty neighbor voxels around center p.

regular sparse convolution [14]

Untitled