Xu \etal[DIM] proposed an auto-encoder architecture to predict alpha matte from a RGB image and a trimap. Comparisons of Model Size and Execution Efficiency. Xu et al. Table3 shows the quantitative results on the aforementioned benchmark. Applying trimap-based methods in practice requires an additional step to obtain the trimap, which is commonly implemented by a depth camera, e.g., ToF [ToF]. A Trimap-Free Portrait Matting Solution in Real Time [AAAI 2022] (by ZHKKKe), Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML! Since sp is supposed to be smooth, we use L2 loss here, as: where G stands for 16 downsampling followed by Gaussian blur. We draw a rectangle over the object of interest (the foreground) and iteratively tries to improve the results by drawing over the parts the algorithm failed to add pixels to the foreground or remove a set of pixels from the foreground. This fusion branch is just a CNN module used to combine the semantics and details, where an upsampling has to be made if we want the accurate details around the semantics. Supported frameworks are TensorFlow, PyTorch, ONNX, OpenVINO, TFJS, TFTRT, TensorFlowLite (Float32/16/INT8), EdgeTPU, CoreML. it outperforms trimap-based DIM, which reveals the superiority of our network architecture. But the results are not so great when we do not have access to such a green screen. - High-performance Vision library in Python. We measure the model size by the total number of parameters, and we reflect the execution efficiency by the average inference time over PHM-100 on an NVIDIA GTX 1080Ti GPU (input images are cropped to 512512). Third, MODNet can be easily optimized end-to-end since it is a single well-designed model instead of a complex pipeline. The GitHub repo (linked in comments) has been edited with code and commercial solution for anyone interested! Consistency is one of the most important assumptions behind many semi-/self-supervised [semi_un_survey] and domain adaptation [udda_survey] algorithms. The supervised way takes an input, and learns to remove the background based on a corresponding ground-truth, just like usual networks. C indicates that if the values of it1 and it+1 are close, and it is very different from the values of both it1 and it+1, a flicker appears in it. We regard small objects held by people as a part of the foreground since this is more in line with the practical applications. pytorch-deep-image-matting Compared with them, our MODNet is light-weight in terms of both input and pipeline complexity. Real-world data can be divided into multiple domains according to different device types or diverse imaging methods. If we come back to the full architecture here, we can see that they apply what they called a one-frame delay. 9, when a moving object suddenly appears in the background, the result of BM will be affected, but MODNet is robust to such disturbances. Wang \etal[net_hrnet] proposed to keep high-resolution representations throughout the model and exchange features between different resolutions, which induces huge computational overheads. Moreover, since trimap-free methods usually suffer from the domain shift problem in practice, we introduce (1) a self-supervised strategy based on sub-objectives consistency to adapt MODNet to real-world data and (2) a one-frame delay trick to smooth the results when applying MODNet to video human matting. As you can see, the network is basically mainly composed of downsampling, convolutions, and upsampling. We first pick the portrait foregrounds from AMD. 11. 5.1. In this section, we first introduce the PHM-100 benchmark for human matting. 8. The decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end. "Make-A-Scene": a fantastic blend between text and sketch-conditioned image generation. - Pytorch implementation of deep image matting, BackgroundMattingV2 Based on it, a high-resolution branch (supervised by the transition region ((0,1)) in the ground truth matte) is introduced to focus on the human boundaries. caer Moreover, MODNet suffers less from the domain shift problem in practice due to the proposed SOC and OFD. In contrast, MODNet avoids such a problem by decoupling from the trimap input. Intuitively, semantic estimation outputs a coarse foreground mask while detail prediction produces fine foreground boundaries, and semantic-detail fusion aims to blend the features from the first two sub-objectives. MODNet is easy to be trained in an end-to-end style. First, semantic estimation becomes more efficient since it is no longer done by a separate model that contains the decoder. To guarantee sample diversity, we define several classifying rules to balance the sample types in PHM-100. - Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite, ONNX, OpenVINO, Myriad Inference Engine blob and .pb from .tflite. For example, Shen \etal[SHM] assembled a trimap generation network before the matting network. Here we only provide visual results444Refer to our online supplementary video for more results. So, we argue that PHM-100 is a more comprehensive benchmark. Attention [attention_survey] for deep neural networks has been widely explored and proved to boost the performance notably. Create an account to follow your favorite communities and start taking part in conversations. [DIM] suggested using background replacement as a data augmentation to enlarge the training set, and it has become a typical setting in image matting. If the fps is greater than 30, the delay caused by waiting for the next frame is negligible. We process the transition region around the foreground human with a high-resolution branch D, which takes I, S(I), and the low-level features from S as inputs. To obtain better results, some matting models [GCA, IndexMatter] combined spatial-based attentions that are time-consuming. 3, M has three outputs for an unlabeled image ~I, as: We force the semantics in ~p to be consistent with ~sp and the details in ~p to be consistent with ~dp by: where ~md indicates the transition region in ~p, and G has the same meaning as the one in Eq. dont have to squint at a PDF. Although these images have monochromatic or blurred backgrounds, the labeling process still needs to be completed by experienced annotators with considerable amount of time and the help of professional tools. For example, (1) whether the whole human body is included; (2) whether the image background is blurred; and (3) whether the person holds additional objects. High-Resolution Representations. In MODNet, we integrate the channel-based attention so as to balance between performance and efficiency. There is a low-resolution branch which estimates the human semantics. (2020). For human matting without the green screen111Also known as the blue screen technology., existing works either require auxiliary inputs that are costly to obtain or use multiple models that are computationally expensive. For each foreground, we generate 5 samples by random cropping and 10 samples by compositing the backgrounds from the OpenImage dataset [openimage]. Some works [GCA, IndexMatter] argued that the attention mechanism could help improve matting performance. - A repository for storing models that have been inter-converted between various frameworks. Press J to jump to the feed. Another contribution of this work is a carefully designed validation benchmark for human matting. We briefly discuss some other techniques related to the design and optimization of our method. Using two powerful models if you would like to achieve somewhat accurate results. Then, we produce a segmentation where the pixels equivalent to the person are set to 1, and the rest of the image is set to 0. because no ground truth mattes are available. To prevent this problem, we duplicate M to M and fix the weights of M before performing SOC. - Convert tf.keras/Keras models to ONNX. When the background is not a green screen, this problem is ill-posed since all variables on the right hand side are unknown. Cho \etal[NIMUDCNN] and Shen \etal[DAPM] combined the classic algorithms with CNN for alpha matte refinement. Is a Green Screen Really Necessary for Real-Time Human Matting? For example, background matting [BM] replaces the trimap by a separate background image. Finally, a fusion branch, also supervised by the whole ground truth matte is added to predict the final result of the alpha matte, which will be used to remove the background of the input image. We also compare MODNet against the background matting (BM) proposed by [BM]. The inference time of MODNet is 15.8ms (63fps), which is twice the fps of previous fastest FDMPA (31fps). You can see how much computing power is needed for this technique. We use Mean Square Error (MSE) and Mean Absolute Difference (MAD) as quantitative metrics. Note that fewer parameters do not imply faster inference speed due to large feature maps or time-consuming mechanisms, e.g., attention, that the model may have. GitHub for Is a Green Screen Really Necessary for Real-Time Human Matting? A version of this model is currently used in most websites you use to automatically remove the background from your pictures. Then, we can generate the trimap through dilation and erosion. Moreover, to alleviate the flicker between video frames, we apply a one-frame delay trick as post-processing (Sec. Most existing matting methods take a pre-defined trimap as an auxiliary input, which is a mask containing three regions: absolute foreground (=1), absolute background (=0), and unknown area (=0.5). tflite2tensorflow Specifically, MODNet has a low-resolution branch (supervised by the thumbnail of the ground truth matte) to estimate human semantics. This is called self-supervised because this network does not have access to the ground truth of the videos it is trained on. Human matting is an extremely interesting task where the goal is to find any human in a picture and remove the background from it. However, these methods consist of multiple models and constrain the consistency among their predictions. Finally, MODNet has better generalization ability thanks to our SOC strategy. The result of assembling SE-Block proves the effectiveness of reweighting the feature maps. - An artificial intelligence platform for the StarCraft II with large-scale distributed training and grand-master agents. Methods that are based on multiple models [SHM, BSHM, DAPM] have shown that regarding trimap-free matting as a trimap prediction (or segmentation) step plus a trimap-based matting step can achieve better performances. Advantages of MODNet over Trimap-based Method. (2020), https://github.com/ZHKKKe/MODNet[3] Xu, N. et al., Deep Image MattingAdobe Research (2017), https://sites.google.com/view/deepimagematting[4] GrabCut algorithm by OpenCV, https://docs.opencv.org/3.4/d8/d83/tutorial_py_grabcut.html. To view or add a comment, sign in arXiv as responsive web pages so you 7, we composite the foreground over a green screen to emphasize that SOC is vital for generalizing MODNet to real-world data. I strongly recommend reading the paper [1] for a deeper understanding of this new technique. Now, do you really need a green screen for real-time human matting? The main problem of all these methods is that they cannot be used in interactive applications since: (1) the background images may change frame to frame, and (2) using multiple models is computationally expensive. 4.1). (, (b) To adapt to real-world data, MODNet is finetuned on the unlabeled data by using the consistency between sub-objectives. arXiv Vanity renders academic papers from In Fig. In computer vision, we can divide these mechanisms into spatial-based or channel-based according to their operating dimension. To successfully remove the background using the Deep Image Matting technique, we need a powerful network able to localize the person somewhat accurately. It calculates the absolute difference between the input image and the composited image obtained, from the ground truth foreground and the ground truth background. Unfortunately, our method is not able to handle strange costumes and strong motion blurs that are not covered by the training set. Many techniques are using basic computer vision algorithms to achieve this task, such as the GrabCut algorithm, which is extremely fast, but not very precise. We use MobileNetV2 pre-trained on the Supervisely Person Segmentation (SPS) [SPS] dataset as the backbone of all trimap-free models. Instead, MODNet only applies an independent high-resolution branch to handle foreground boundaries. md is generated through dilation and erosion on g. Hence, it can reflect the matting performance more comprehensively. However, the trimap is costly for humans to annotate, or suffer from low precision if captured via a depth camera. In this way, the matting algorithms only have to estimate the foreground probability inside the unknown area based on the priori from the other two regions. However, its implementation is a more complicated approach compared to MODNet. The background replacement [DIM] is applied to extend our training set. Sengupta \etal[BM] proposed to capture a less expensive background image as a pseudo green screen to alleviate this issue. For example, Ke \etal[GCT] designed a consistency-based framework that could be used for semi-supervised matting.