In this article, we used a novel method for employing the state-of-the-art deep learning framework Mask R-CNN. For a more refined result and a faster training speed, we modified the Mask R-CNN to adapt to our dataset. We changed the loss function of the class branch and adopted the Atrous Convolution [12] in the backbone network. In order to make a better segmentation result, we tried several mainstream backbone networks, and the R101FA demonstrated the best segmentation results.
Data set
From December 2017 to June 2018, we worked with the burn department of the Wuhan Hospital No. 3. Ethics approvals were granted by the Wuhan Hospital No. 3 and Tongren Hospital of Wuhan University. The patients used in this research have already signed the informed consent.
In order to obtain enough data, we used our smartphone to collect images of fresh burn wounds in the hospital every day. Then, we used our own software to annotate burn images and saved the marked content in Common Objects in Context (COCO) data set format. Figure 2 shows the software which we used to annotate. In order to ensure the accuracy of this framework, we annotated the burn images carefully under the guidance of the professional doctors and avoided mistaking confusing parts such as gauze and blood stains as wounds. With the help of doctors and nurses, we finally annotated 1000 burn images for training and another 150 for evaluating.
Network architecture
As shown in Fig. 3, our framework contains three parts. The first part is the backbone network to extract the feature maps. The second part is the RPN [9] network to generate the RoI [9]. Finally, we process the object detection and mask prediction from each RoI. Because there is only one category (here, we do not consider the depth of burn wound), we changed the loss function of the mask branch and classification branch to fit our data set. In the process of training, we collected almost all kinds of burn wound images to train our model, totaling 1000 after filtering. At the same time, in order to realize faster training speed and less evaluating time, we tried different backbone networks in our framework. Finally, we used the R101FA as the backbone network of our framework.
In this article, our backbone network is based on the R101FA. The ResNet101 is made up of 101 layers. We use C1, C2, C3, C4, and C5 to define these output feature maps. As shown in Fig. 4, we obtain final feature maps P2, P3, P4, and P5.
Here, we use a 1 × 1 convolution kernel to get the first feature map P5 by undergoing the output of the C5. Then, we upsample the P5 to get the P* and produce C* by the 3 × 3 convolution kernel undergoing the C5, and the P4 map is the result of merging C* with P*. After iterating all C, we can build P2, P3, P4, and P5.
Atrous
In the convolutional neural network, we employ atrous convolution [12] in ResNet. The traditional convolution kernel is usually composed of a dense matrix of N × N. The kernel of atrous convolution is no longer a dense matrix, and it is shown by Fig. 5 that different rates represent different convolution kernels. Compared with traditional convolution kernels, employing a larger value of atrous rate enlarges the model’s field-of-view, enabling object encoding at multiple scales. This structure is suited for our burn dataset which consists of varying burn depths and burn sizes. In this research, we set the rate at 2.
RPN in FPN
We adopt RPN in FPN to propose the candidate of region proposals. The detail is different from the original RPN network. The original RPN network just adopts one feature map, but in our network, we build several feature maps. In order to handle the images easier, we resized the images to 1024 × 1024 and filled the image with zero to prevent distortion. In order to contain all possible rectangular boxes, we defined five scales 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512. Every scale has three aspect ratios 0.5, 1, and 2. It was not necessary to define all scales on every feature map; we just defined one scale per feature map. Here, in order to correspond five scales, we added P6 on the basis of P5 and it is the output of the max-pooling after the P5. Hence, according to this idea, we can generate all possible rectangular boxes (anchor [9]) on the original image.
In the RPN network, we filter N numbers of RoIs by a small convolution network. This small network determines the object possibility of each anchor, and we call this possibility anchor score. We sorted all the anchors by this score and take the top N high score boxes as RoI. Moreover, in order to adjust the position of each anchor, the small network also predicts the regression offsets of each anchor. Therefore, in FPN, there are several feature maps and this small network is shared with all feature maps, and the detail is shown in Fig. 6.
RPN training
As shown in Fig. 6, the outputs of the RPN network are score and regression offsets of each anchor. Here, we define two loss functions to train the RPN network. The first is the score loss LrpnScoreand the second is regression loss LrpnReg.
To calculate LrpnScore, we assign two kinds of labels which are the positive label and the negative label to each anchor. The anchor which has an intersection over union (IOU) overlap higher than 0.7 with any ground-truth bounding box is a positive label, and the anchor which has an IOU overlap lower than 0.3 with all ground-truth boxes is a negative label. Here, in order to ensure that all the ground-truth boxes correspond to at least one anchor, we will label the highest IOU anchor with each ground-truth box as a positive label. Therefore, we can get all the positive and negative anchors. We encode these anchors into a sequence of 0 and 1, and the sequence is the objective output in the RPN target judgment. As it is shown in Fig. 6, we apply the softmax function to the output of the RPN to get target possibility for all anchors. And then, we use the cross-entropy function to calculate the LrpnScore.
Then, we apply a linear function to the output of the RPN network and predict the regression parameters (t∗). We calculate the regression offsets (t) of each positive anchor. The regression offsets are the same as [8], and it contains four values (x, y, w, h). x and y are the offset ratios of the positive anchors’ center point based on the associated ground-truth boxes center point. Then, w and h are the logarithmic values of the aspect ratio of positive anchors and the associated ground-truth boxes. Finally, we used the smoothL1 to calculate the LrpnReg which is shown in Eq. 1. Here, we stipulate that just the positive anchor will contribute the LrpnReg.
$$ {L}_{rpnReg}\left({t}_i\right)=\frac{1}{N_{r\mathrm{eg}}}{\sum}_i{p}_i^{\ast }{L}_{r\mathrm{eg}}\left({t}_i,{t}_i^{\ast}\right) $$
(1)
Here, i is the index of an anchor in the mini-batch and \( {p}_i^{\ast } \) is 1 if the anchor is positive; otherwise, pi* is 0. Here, ti and \( {t}_i^{\ast } \) are the four vectors representing the regression offset, and ti represents regression offset of a positive anchor based on the associated ground-truth box. And \( {t}_i^{\ast } \) represents the predicted regression offset. The regression loss function is shown in Eq. 2. The smoothL1 is defined in Eq. 3.
$$ {L}_{r\mathrm{eg}}\left(t,{t}^{\ast}\right)={\sum}_{i\in x,\mathrm{y},\mathrm{w},\mathrm{h}}s{\mathrm{mooth}}_{L_1}\left(t-{t}_i^{\ast}\right) $$
(2)
$$ s{\mathrm{mooth}}_{L_1}(x)=\left\{\begin{array}{c}0.5\ {x}^2\kern2.55em \mathrm{if}\ \left|x\right|<1\\ {}\left|x\right|-0.5\kern2.25em \mathrm{otherwise}\end{array}\right. $$
(3)
We used Eq. 4 to make a detailed explanation for regression offset.
$$ {\displaystyle \begin{array}{cc}\begin{array}{c}{t}_x=\frac{x-{x}_a}{w_a}\\ {}{t}_w=\log \left(\frac{w}{w_a}\right)\end{array}& \begin{array}{c}{t}_y=\frac{y-{y}_a}{h_a}\\ {}{t}_h=\log \left(\frac{h}{h_a}\right)\end{array}\\ {}\begin{array}{c}{t}_x^{\ast }=\frac{x^{\ast }-{x}_a}{w_a}\\ {}{t}_w^{\ast }=\log \left(\frac{w^{\ast }}{w_a}\right)\end{array}& \begin{array}{c}{t}_y^{\ast }=\frac{y^{\ast }-{y}_a}{h_a}\\ {}{t}_h^{\ast }=\log \left(\frac{h^{\ast }}{h_a}\right)\end{array}\end{array}} $$
(4)
After choosing the RoIs from the anchors, we map the RoIs on the feature map for subsequent operation of the framework. But in our framework, we have four feature maps. Unlike generating anchors, we do not make each RoI correspond to a feature map. Considering that P2 contains all image features, we map all RoIs to P2. After mapping, the three parallel branches handle the mapping results.
Loss function
In our framework, our loss contains five aspects. The RPN network contains two losses. In the parallel branches, there are three losses. We define the three losses as LmCls, LmBReg, and LmMask. Therefore, our final loss is L =LrpnScore+LrpnReg+LmCls+LmBReg+ LmMask.
Class loss
In Mask R-CNN, the author applied softmax on the output of the fully connected layer and used the cross-entropy function to calculate the class loss. This method was applied to solve the multi-class classification tasks. But in our task, our goal is simply to segment the burn wounds. Therefore, we used two classifiers to replace the multiple classifiers. We applied the sigmoid function on the output and used the cross-entropy function to calculate the loss. We used y to define the ground-truth of the N RoIs. The output of sigmoid is y*. Then, the LmCls is Eq. 5.
$$ {L}_{\mathrm{mCls}}=\frac{1}{N}{\sum}_i^N\left({y}_i\ast \left(-\mathit{\log}{y}_i^{\ast}\right)+\left(1-{y}_i\right)\ast \left(-\log \left(1-{y}_i^{\ast}\right)\right)\right) $$
(5)
Bounding-box loss
As mentioned above, the RPN network will predict the regression offset of each anchor. In the box branch, the input will be the coordinate of RoI. These coordinates are the result of the RoI that applied the regression offset of the RPN network. We then used the same way as LrpnReg to calculate the LmBReg.
Mask loss
In Mask R-CNN, the author applied a small FCN [14] network on the RoI. And in the mask branch, the author predicted the m × m mask. The mask is the output of the sigmoid function which is applied to each pixel. Then, the author calculated the mask loss according to the mask class to avoid the competition between classes and used the binary cross-entropy to define the loss.
However, in this article, we just calculated the mask loss of the positive RoI and did not use the idea of competition between classes. Moreover, we defined the size of each ground-truth mask and predicted the mask as 28 × 28 to reduce memory consumption. Hence, the ground-truth RoI was scaled to 28 × 28 and padded with zero to avoid distortion. In the output of the mask branch, we will scale each RoI to the same size to calculate the mask loss.
Regularization loss
As mentioned above, we collected little data sets. And in order to prevent over-fitting of the model, we add the loss of regular term for entire loss function. We can see the details from the formula 6
$$ {L}_{regLoss}=\lambda {\sum}_{i=1}^n\left({W}_i^2\cdotp \frac{1}{N_{w_i}}\right) $$
(6)
This is the L2 regularization loss which represents the weight decay and aims to reduce the weight values to fit data well. In the formula, Wi is the weight values of the i-th layer and \( {N}_{w_i} \) is the size of the Wi. The λ is a hyper-parameter which is set as 0.0001 here.
Training detail
In order to obtain better training results, we did not randomly initialize weight in this framework. The initialization of the weight includes two parts. In the convolutional neural network, we used the pre-trained COCO model to initialize our backbone network. In the network head, we used the Gaussian distribution to initialize the weight values. Similar to the transfer learning, we fine-tune the convolutional neural network of our framework by collecting data.
Moreover, we tried several convolutional networks to extract the feature map from the original image. These backbone networks are Residual Network-101 with Atrous Convolution (R101A), Residual Network-101 with Atrous Convolution in Feature Pyramid Network (R101FA), and InceptionV2-Residual Network with Atrous Convolution (IV2RA). Through the experiment, we find the R101FA backbone has the best segmentation result. Before training, the images are resized to a 1024 width for a proper input in the network. Then, similar to the [10], the input data will undergo five convolution layers C1, C2, C3, C4, and C5 which have strides of 2, 4, 8, 16, and 32 related to the input image.
After extracting a feature map, the RPN network handled the output of the backbone network. First, the RPN network generated N anchors (the anchor scale is for the original image) on the center of the sliding window. Then, we calculated the IOU value (per anchor) to judge if the anchor is positive or negative. As in [8, 9], each image has N-sampled RoIs which have a ratio of 1:3 of positive to negatives. Then, we pooled each positive anchor to a fixed size. After that, we connected a fully connected network to extract a 2048 dimensions feature vector. This vector is used for the classifier and the box regressor. At the same time, the RoIs will undergo two convolution layers, then we predicted the image mask.
Burn area calculation
Burn area calculation is an important part of burn diagnosis. The framework as mentioned above is an auxiliary technique for calculating burn area and has great significance for fast, convenient, and accurate burn area calculation. As it is shown in Fig. 1, for example, the second method needs to manually mark the edge of the burn wound when calculating. Similar to the 3D application, it is not conducive for rapid treatment of patients. However, if we combine our segmentation framework with this software, we can get a more efficient and convenient area calculation tool. In a sense, we can apply our framework to the calculation of burn wound area.
In our plan, we intend to combine the 3D modeling and mesh parameterization technology with our segmented framework which calculates the burn wound area. The calculation mainly consists of three steps:
-
Step-1: Building a 3D model of patient pictures through 3D reconstruction technology.
-
Step-2: Mapping the 3D model to the planar domain by mesh parameterization algorithm.
-
Step-3: Segmenting the burn regions by using our framework and calculate the total body surface area (TBSA) value.
Some 3D reconstruction technologies are already very mature, such as BodyTalk reconstruction system [15] and Kinect reconstruction system [16], which makes the 3D model building process easier. The mesh parameterization algorithm such as RiccFlow [17] and Authalic [18] make the second step easier to implement. Hence, our segmentation framework can achieve a faster, easier, and more accurate area calculation.