1. Introduction
Maritime video surveillance has become increasingly important for traffic situational awareness in a range of maritime applications (Bloisi et al., Reference Bloisi, Previtali, Pennisi, Nardi and Fiorini2017). In particular, the widely-used surveillance cameras provide indispensable visual information for video surveillance to enhance transportation safety and security. As an essential function of maritime surveillance, visual data-based vessel detection has attracted considerable attention due to its practical importance. Accurate vessel detection offers excellent potential for promoting navigational safety monitoring, collision avoidance, vessel traffic services, etc. However, the quality of visual information obtained from surveillance cameras will directly affect the efficiency of vessel detection. With the rapid rise of deep learning, it is feasible to build learning-based methods that significantly enhance visual information quality in real time, potentially leading to improved vessel detection in maritime supervision. Maritime safety and security can thus be guaranteed in ports and other maritime infrastructures.
In the current literature, many vision-based technologies of vessel detection and tracking have been developed (Zhang et al., Reference Zhang, Kopca, Tang, Ma and Wang2017; Yang et al., Reference Yang, Park and Rashid2018; Liu et al., Reference Liu, Nie, Garg, Xiong, Zhang and Hossain2021a, Reference Liu, Yuan, Chen and Lu2021b) for effective maritime surveillance. These methods have been verified to generate satisfactory vessel detection results in normal lighting environments. Under low-light imaging conditions, however, much valuable visual information is buried in the dark, especially the moving objects of interest. To overcome the environmental effects, Nie et al. (Reference Nie, Yang and Liu2019) designed a vessel detection method under different weather conditions. However, the method did not enhance the essential information hidden in the dark and the study only expanded the dataset by simulating various environments to improve the robustness of the vessel detection network. Therefore, it is necessary to develop an efficient low-visibility enhancement network as the preprocessing step. The advantage of this strategy is that it can directly serve any network to detect vessels under low-light environment without retraining.
Low-visibility enhancement is a topic that has attracted sustained attention. Traditional model-based methods often design special hand-crafted priors to obtain enhanced results. For instance, Fu et al. (Reference Fu, Zeng, Huang, Ding and Zhang2013) introduced the bright channel prior (BCP), different from the popular dark channel prior originally proposed by He et al. (Reference He, Sun and Tang2010), to perform luminance estimation. Fu et al. (Reference Fu, Zeng, Huang, Zhang and Ding2016) introduced a weighted variational model, which is significantly different from previous models. It is capable of separately estimating reflection and illumination components. Subsequently, Guo et al. (Reference Guo, Li and Ling2017) constructed a structure prior model to optimise the illumination component. These methods have been verified to produce satisfactory results on traditional images. However, due to the significant differences in textural structures, these methods often fail to enhance maritime images accurately under poor imaging conditions.
Furthermore, many studies on the enhancement of low-visibility maritime images have been carried out to obtain satisfactory visual performance. Yang et al. (Reference Yang, Nie and Liu2019a, Reference Yang, Liu, Yang and Guo2019b) proposed a coarse-to-fine luminance estimation-based marine low-visibility enhancement method. Inspired by the convolutional neural network (CNN), Guo et al. (Reference Guo, Lu, Liu, Yang and Chui2020) constructed a low-light maritime image enhancement framework by combining traditional model estimation and deep learning. Although these methods can obtain satisfactory visual effects, it is challenging to complete real-time processing tasks due to the high computational complexity. To eliminate these potential limitations, a lightweight neural network for real-time low-visibility enhancement is proposed. Compared with previous studies, the proposed low-visibility enhancement network (LVENet) differs from other competing methods as follows.
• A CNN-enabled LVENet is designed to enhance the visual qualities of images captured under low-light imaging conditions. In particular, the network is designed based on the Retinex theory. Therefore, high-quality images can be obtained quickly with fewer calculations.
• To guarantee real-time maritime surveillance, it is proposed to replace ordinary convolutions with depthwise separable convolutions to significantly reduce network parameters and increase computational speed.
• A synthetically-degraded image generation method and a hybrid loss function are proposed to further enhance the robustness and generalisation capacities of the lightweight neural network.
• Image enhancement and vessel detection experiments under low-visibility conditions have demonstrated that the method has superior image enhancement results and can improve detection accuracy. In addition, the network obtains the best running time compared with other competing image enhancement methods.
The rest of this paper is categorised into the following sections. Section 2 presents a review of related works. In Section 3, the LVENet is introduced in detail. Network training details and numerical experiments are presented in Section 4. Finally, Section 5 summarises the main contributions and future works.
2. Related works
2.1 Maritime video surveillance system
Both the traditional remote sensing-based technology, for example, synthetic aperture radar (Yang et al., Reference Yang, Park and Rashid2018; Chaturvedi, Reference Chaturvedi2019; Liu et al., Reference Lu, Yang and Liu2021), and the automatic identification system (AIS) (Zhang et al., Reference Zhang, Goerlandt, Kujala and Wang2016, Reference Zhang, Kopca, Tang, Ma and Wang2017) have been salient achievements in maritime surveillance. However, these methods require special equipment to be installed on both vessels and observation stations. Unfortunately, some vessels are not equipped with these devices, which leads to the failure of maritime supervision. Meanwhile, some illegal vessels attempt to escape detection and surveillance by shutting down related equipment intentionally. Therefore, the maritime video surveillance system is crucial to improve maritime supervision and emergency rescue capabilities further. Though videos and images can provide more intuitive information for managers, long-term observation will cause visual fatigue and eventually vital information may be overlooked. With the emergence of deep learning technology and target detection, many high-quality vessel detection methods have been proposed, making it possible to build an autonomous maritime video surveillance system.
As shown in Figure 1, the workflow of the proposed deep learning-based maritime video surveillance system consists of three components: visual sensing data acquisition (VSDA), visual data processing and analysis (VDPA), and maritime applications. VSDA can obtain video and image information collected by various devices in the ocean, on land, and in the sky. The VDPA receives and processes visual information through special hardware, assisting supervisors to complete different maritime surveillance and management tasks comprised in maritime applications. However, the visual data collected by VSDA is easily affected by weather conditions which can directly affect the normal operation of VDPA. In previous studies, the step of visual data preprocessing is usually ignored, which will cause serious problems. For example, directly detecting the target on the video/image captured under low-light conditions will significantly reduce the accuracy of vessel detection and even result in detection failure. For the sake of better supervision, it is important to develop a low-visibility enhancement preprocessing network to promote the robustness and accuracy of vessel detection.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig1.png?pub-status=live)
Figure 1. The workflow of the proposed deep learning-based maritime video surveillance system. Note that UAV and USV denote unmanned aerial vehicle and unmanned surface vehicle, respectively
2.2 Low-visibility enhancement methods
In the current literature, low-visibility enhancement methods can be divided into three broad categories: plain methods, model-based methods, and deep learning-based methods.
2.2.1 Plain methods
Histogram equalisation (Pisano et al., Reference Pisano, Zong, Hemminger, DeLuca, Johnston, Muller, Braeuning and Pizer1998) and its improved versions (Kim, Reference Kim1997; Chen and Ramli, Reference Chen and Ramli2003; Tan et al., Reference Tan, Sim and Tso2012) belong to the classic low-light enhancement methods that can force the greyscale histogram to the full range by contrast stretching. In particular, histogram equalisation-based methods have the capacity of guaranteeing contrast enhancement. However, the contrast stretching-based methods easily cause over- and under-enhancement. To further improve image quality, gamma correction individually conducts a non-linear operation on each pixel to enhance illumination. Because the relationship between adjacent pixels is ignored, the restored images may be inconsistent with the latent sharp versions.
2.2.2 Model-based methods
Unlike plain methods, model-based methods focus on decomposing the image and then processing the components separately. Inspired by the popular Retinex theory (Land, Reference Land1977), the colour image I with red, green, and blue (RGB) channels can be decomposed as follows:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn1.png?pub-status=live)
where x is the pixel index, ${\ast}$ denotes the element-wise multiplication operator, I, $\tilde{I}$
, and L are the captured low-visibility image, reflection map, and illumination map, respectively. It is well known that the reflection map contains textural details and rich colour. In contrast, the illumination map has only the luminance information with single channel. Previous attempts estimated the smoothed illumination using a Gaussian filter, for example, single-scale Retinex (Jobson et al., Reference Jobson, Rahman and Woodell1997a, Reference Jobson, Rahman and Woodell1997b), multi-scale Retinex (MSR) (Jobson et al., Reference Jobson, Rahman and Woodell1997a, Reference Jobson, Rahman and Woodell1997b), and MSR with colour restoration (Jiang et al., Reference Jiang, Woodell and Jobson2015). Though these methods can enhance the image illumination, direct reduction of illumination will cause problems of unnatural effects and over-enhancement. Wang et al. (Reference Wang, Zheng, Hu and Li2013) thus developed the naturalness-preserved enhancement method, which can significantly improve the image contrast and make the illumination more natural. However, under certain circumstances (excessively dark regions), the enhancement results appear an unnatural grey colour. Fu et al. (Reference Fu, Zeng, Huang, Ding and Zhang2013) presented the BCP-regularised method to generate smooth illumination. However, the image details are over-smooth due to implementing a variational approach to process the reflection. Fu et al. (Reference Fu, Zeng, Huang, Zhang and Ding2016) used a novel weighted variational model (SRIE) for estimating the reflection and illumination and found that logarithmic transformation can significantly highlight the details of dark regions. SRIE can enhance the target version by controlling the illumination and suppressing unwanted noise. To further improve the enhancement results, the low-light image enhancement framework (LIME) (Guo et al., Reference Guo, Li and Ling2017) was presented to refine the illumination by introducing a structural prior.
2.2.3 Deep learning-based methods
Driven by deep learning, CNN has been widely used in image processing, for example, in denoising (Lu et al., Reference Lu, Yang and Liu2021), dehazing (Li et al., Reference Li, Peng, Wang, Xu and Feng2017), deraining (Yang et al., Reference Yang, Nie and Liu2019a, Reference Yang, Liu, Yang and Guo2019b), and super-resolution (Dong et al., Reference Dong, Chen, He and Tang2016). Due to the strong non-linear learning capability of CNN, deep learning-based methods have been widely presented to perform low-visibility image enhancement. For instance, Lore et al. (Reference Lore, Akintayo and Sarkar2017) constructed an encoder-decoder structure (LLNet) to optimise the illumination without over-amplifying the lighter regions. Inspired by MSR (Jobson et al., Reference Jobson, Rahman and Woodell1997a, Reference Jobson, Rahman and Woodell1997b), Shen et al. (Reference Shen, Yue, Fen, Chen, Liu and Ma2017) proposed an end-to-end MSR-based network (MSRNet) to directly learn the projection between low-light and latent sharp images. However, the restored results generated by MSRNet often appear unnatural in practice. In 2018, a Retinex-enabled network (RetinexNet) (Wei et al., Reference Wei, Wang, Yang and Liu2018) was proposed, which introduced a component separation method and employed a decomposition network. However, the images enhanced by RetinexNet often suffer from serious distortion problems due to inaccurate component estimation. To solve this problem, Zhang et al. (Reference Zhang, Zhang and Guo2019) proposed a strong low-light image enhancer, which can correctly deal with the issue of image distortion.
Though current studies on plain, model-, and learning-based methods have made breakthroughs, these methods fail to directly serve maritime supervision due to the enormous difference between maritime and traditional imaging scenarios. In addition, most strategies require expensive calculations and fail to perform real-time video monitoring in maritime applications.
2.3 Automatic vessel detection methods
Vessel detection is important in maritime video surveillance, which aims automatically to detect the moving vessels of interest in specific waters. It is able to identify abnormal vessel behaviours in a timely manner, leading to improving maritime traffic safety. In the literature, several vessel detection methods have been implemented to enable intelligent maritime surveillance. For instance, Zhu et al. (Reference Zhu, Zhou, Wang and Guo2010) proposed an intact hierarchy method to detect vessels from spaceborne optical images. The visual monitoring strategy of cage culture (Hu et al., Reference Hu, Yang and Huang2011) was proposed to detect and track vessels automatically. More recently, Chen et al. (Reference Chen, Wang, Shi, Wu, Zhao and Fu2019) adopted a multi-view learning method which is capable of extracting a highly coupled vessel appearance and shape. These methods have verified their detection performance through massive experiments, but they generally require extensive calculations. Therefore, it is still hard to detect dynamic vessels in real time.
The development of CNN has enabled many novel vessel detection methods to be designed. Wu et al. (Reference Wu, Zhou, Wang and Ma2018) proposed a novel inshore ship detection method, which is able to estimate the location of possible ship heads and the rough ship directions through global search. Kim et al. (Reference Kim, Hong, Choi and Kim2018) introduced a deep learning-enabled novel probabilistic ship detection and classification system. Shao et al. (Reference Shao, Wang, Wang, Du and Wu2020) designed a saliency-aware CNN framework for ship detection, which mainly includes comprehensive ship discriminative features. In addition, the hybrid kernelised correlation and anomaly cleaning were combined with tracking moving vessels from maritime visual sensing data. Meanwhile, Chen et al. (Reference Chen, Yang, Wang, Wu, Tang, Zhao and Wang2020) proposed a coarse-to-fine cascaded CNN (CFCCNN) to distinguish vessels with similar visual appearance. These methods can satisfy the real-time process requirements on high-performance equipment and further improve vessel detection accuracy. However, in low-light imaging environments, the accuracy of vessel detection will be significantly reduced. The primary reason for worse performance is that vital information is hidden in the dark and fails to be extracted by the vessel detection network. Therefore, to improve vessel detection results in low-visibility conditions, it is necessary to improve visual quality for installed imaging cameras in a maritime video surveillance system.
3. LVENet
This section is dedicated to developing the proposed LVENet. The flowchart of LVENet is summarised in Figure 2. This study designs a lightweight CNN for learning the features of maritime low-visibility scenes. In particular, depthwise separable convolution is used instead of traditional convolution to reduce model parameters and improve calculation speed. Given that it is a recent advance, no research has been conducted on the adoption of depthwise separable convolution for low-light image enhancement thus far. Furthermore, a hybrid loss function is constructed to supervise the network training and enhance the network generalisation.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig2.png?pub-status=live)
Figure 2. The flowchart of the proposed LVENet. DS-Conv and DS-DConv represent depthwise separable convolution and depthwise separable deconvolution, respectively
3.1 Network architecture
Inspired by the Retinex theory, the proposed LVENet also assumes that a single low-visibility image I can be decomposed into reflection $\tilde{I}$ and illumination L. Once a satisfactory illumination component is obtained, the enhanced image (i.e., reflection component) can be easily obtained by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn2.png?pub-status=live)
Since the illumination component L only contains the single-channel brightness information, it is possible to obtain a stable enhancement effect with fewer calculations. Therefore, LVENet is devoted to learning the mapping between low-visibility image I and illumination component L.
The architecture of LVENet is visually shown in Figure 2. The input of the network is the three-channel low-visibility image $\; I$, and the output is the estimated illumination map $\hat{L}$
with single channel. In particular, LVENet first adopts convolutional layers and then deconvolutional layers. The first three convolutional layers, respectively, exploit 8, 16, and 32 filters to generate feature maps. Subsequently, three residual blocks (He et al., Reference He, Zhang, Ren and Sun2016) are displayed to learn residual mapping, making it easier for the network to detect subtle differences. Mathematically, the k-th feature map of the m-th residual block is defined as follows
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn3.png?pub-status=live)
where ${\mathcal{A}}(\cdot)$ denotes the rectified linear units (ReLU) activation function, ${\mathcal{R}}_k^m$
is the k-th output feature map generated by two convolutions of the m-th residual block, and $g_k^{m - 1}$
represents the k-th input map obtained by the (m−1)-th residual block. Finally, three deconvolutional layers are employed to guarantee that the output illumination component $\hat{L}$
has the same spatial resolution as the input I. Note that the ReLU activation function is displayed after each convolutional layer, and all convolutional/deconvolutional layers are composed of depthwise separable convolution/deconvolution to simplify calculations and reduce model parameters. In particular, the configuration details of the LVENet have been shown in Table 1. Furthermore, the robust Adam technique (Kingma and Ba, Reference Kingma and Ba2014) is exploited for optimisation. The l-th updated model parameters ${\theta _l}$
can be updated by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn4.png?pub-status=live)
where $\partial$ and e represent the learning rate and minimal constant for avoiding outliers, ${\hat{u}_l}$
and ${\hat{v}_l}$
, respectively, denote the exponential decay averages of the (l−1)-th gradient ${u_{l - 1}}$
and the (l−1)-th square gradient ${v_{l - 1}}$
, and ${\theta _{l - 1}}$
is network parameters through the (l−1)-th updating. After obtaining the illumination component $\hat{L}$
, the enhanced image is finally generated, i.e.,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn5.png?pub-status=live)
with $\gamma$ being a particular parameter to avoid over-enhancement effect and suppress outliers. In the experiments, $\gamma = 0 {\cdot}15$
was empirically selected to obtain satisfactory imaging performance.
Table 1. Configurations of proposed LVENet
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_tab1.png?pub-status=live)
Note: C: convolution; DC: transposed convolution.
3.2 Depthwise separable convolution and deconvolution
The powerful depthwise separable convolution was originally proposed by Howard et al. (Reference Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto and Adam2017). Unlike traditional convolution, depthwise separable convolution is actually composed of two convolutions. Figure 3 and Table 2 describe a classic case to distinguish the traditional convolution and depthwise separable convolution. Specifically, this case takes the three-channel feature map as input and the four-channel feature map as output. Four filters of the traditional convolutional layer simultaneously process each channel input and obtain four maps. It is worth mentioning that each filter contains three kernels of size 3 × 3. Therefore, the parameter quantity of the traditional convolution is 108.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig3.png?pub-status=live)
Figure 3. Usage case of traditional convolution and depthwise separable convolution
Table 2. Comparison of the parameter quantities between traditional convolution (Conv) and depthwise separable convolution (DS-Conv) in Figure 3
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_tab2.png?pub-status=live)
In contrast, the depthwise separable convolution splits this process into two steps: depthwise and pointwise convolutions. Depthwise convolution is adopted to obtain the spatial characteristic information (i.e., depthwise feature) of each channel. Inspired by grouped convolution, all filters in depthwise convolution only process the feature map of the corresponding channel. It is obvious that the number of depthwise feature maps obtained by depthwise convolution is equal to the input. Like traditional convolution, pointwise convolution directly performs convolution with kernel size of 1 × 1 on all depthwise feature maps. To sum up, this depthwise separable convolution only contains 39 parameters. By comparison, the depthwise separable convolution can reduce the model parameters by nearly 64% in this case.
To restore the image resolution faster, a depthwise separable deconvolution module is designed, including the depthwise deconvolution and pointwise convolution. Unlike depthwise convolution, depthwise deconvolution replaces the traditional convolution with transposed convolution (Radford et al., Reference Radford, Metz and Chintala2015) to construct a feature map with a larger pixel size. A 1 × 1 convolution is then applied after the depthwise deconvolution for pointwise feature extraction.
3.3 Hybrid loss function
To promote the network performance, a hybrid loss function is introduced to constrain the estimated illumination component $\hat{L}$ and enhanced image $\hat{I}$
. In this work, two loss functions (i.e., the gradient loss function ${{\mathcal{L}}_G}$
and the illumination-based mean square error loss function ${{\mathcal{L}}_{\textrm{IMSE}}}$
) are proposed to make the illumination L and estimated illumination $\hat{L}$
as similar as possible. The gradient loss function ${{\mathcal{L}}_G}$
is thus defined as follows
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn6.png?pub-status=live)
where ${\Omega }$ is the entire image domain, ${\nabla _h}$
and ${\nabla _v}$
represent the operators of the horizontal and vertical gradients, respectively. In this work, ${{\mathcal{L}}_G}$
can make the estimated illumination $\hat{L}$
obtain an edge structure that is similar to the ground truth L.
In the low-visibility enhancement task, the researchers considered that the dark region of the estimated version should receive more attention to ensure the effectiveness of enhancement. Thus, a simple method was proposed to divide the whole estimated illumination component $\hat{L}$ into two parts: the high illumination component ${\hat{L}_h}$
and the low illumination component ${\hat{L}_l}$
. Since the single-channel illumination only contains the luminance information of the scene, a specific value $\tau$
was thus chosen to guarantee robust estimation. When the value of a certain pixel in $\hat{L}$
(i.e., illumination of the pixel) is higher than $\tau$
, it is regarded as ${\hat{L}_h}$
; while it is regarded as ${\hat{L}_l}$
otherwise. The final ${{\mathcal{L}}_{\textrm{IMSE}}}$
can be thus written as follows
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn7.png?pub-status=live)
where ${L_l}$ and ${L_h}$
, respectively, denote the ground truth corresponding to ${\hat{L}_l}$
and ${\hat{L}_h}$
, ${\lambda _l}$
and ${\lambda _h}$
represent the trade-off parameters. According to the comparison experiments, the reliable coefficients, i.e., $\tau = 0 {\cdot}4$
, ${\lambda _l} = 0 {\cdot}8$
and ${\lambda _h} = 0 {\cdot}2$
, were chosen to generate the satisfactory low-visibility enhancement results. To preserve the essential structure, illumination, and contrast in the final enhanced image $\hat{I}$
, the structural similarity loss function ${{\mathcal{L}}_{\textrm{SSIM}}}$
is exploited, i.e.,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn8.png?pub-status=live)
where $\tilde{I}$ is the original clear image, $\textrm{SSIM}(\cdot)$
denotes the calculation operation of structural similarity (SSIM). The terms of the SSIM metric will be explained in Section 4. To sum up, the hybrid loss function ${\mathcal{L}}$
of the network can be defined as follows
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn9.png?pub-status=live)
where ${\lambda _1}$, ${\lambda _2}$
, and ${\lambda _3}$
denote the penalty coefficients. These were set as: ${\lambda _1} = 0 {\cdot}5$
, ${\lambda _2} = 1 {\cdot}3$
, and ${\lambda _1} = 2 {\cdot}0$
.
4. Experimental results and analysis
This section introduces the details of implementation of network training. All the qualitative and quantitative experiments conducted are also discussed. Note that all experiments and training were performed in Python 3⋅7 and Matlab 2019a environment running on a PC with Intel (R) Core (TM) i5-10600KF CPU @ 4⋅10 GHz and a Nvidia GeForce RTX 2080 Ti GPU.
4.1 Evaluation metric
To fully evaluate the enhancement performance, five full-reference evaluation metrics and two no-reference metrics were employed in the experiments. In particular, first five full-reference metrics were adopted to measure the enhanced and latent sharp images: PSNR, SSIM, FSIM, FSIMc, and VSI. Two no-reference metrics, NIQE and BTMQI, are then exploited to blindly evaluate the quality of enhanced image. Finally, the mean average precision (mAP) is introduced to evaluate the accuracy of vessel detection. The definitions of these evaluation metrics are given as follows:
• PSNR: Peak signal-to-noise ratio (Wang and Bovik, Reference Wang and Bovik2009) is a widely-used evaluation metric to measure image quality. The PSNR value between the restored image $\hat{Y}$
and the target version Y is given by
(10)\begin{equation}\textrm{PSNR}({\hat{Y},Y} )= 10\;\textrm{log}\frac{{{M^2}}}{{\textrm{MSE}({\hat{Y},Y} )}},\end{equation}
where M is the maximum pixel value, and $\textrm{MSE}({\hat{Y},Y} )$ represents the operation of calculating the mean square error (MSE) between $\hat{Y}$
and Y.
• SSIM: Structural similarity (Wang et al., Reference Wang, Bovik, Sheikh and Simoncelli2004) is proposed based on the assumption that the human visual system can judge image structure similarity objectively. SSIM consists of three main components: luminance comparison $l({\hat{Y},Y} )$
, contrast comparison $c({\hat{Y},Y} )$
, and structure comparison $s({\hat{Y},Y} )$
. The mathematical expression of SSIM can be extracted as
(11)\begin{equation}\textrm{SSIM}({\hat{Y},Y} )= \frac{{({2{\mu_{\hat{Y}}}{\mu_Y} + {c_1}} )({2{\sigma_{\hat{Y}Y}} + {c_2}} )}}{{({\mu_{\hat{Y}}^2 + \mu_Y^2 + {c_1}} )({\sigma_{\hat{Y}}^2 + \sigma_Y^2 + {c_2}} )}},\end{equation}
where ${\mu _{\hat{Y}}}$ and ${\mu _Y}$
, respectively, represent the mean values of $\hat{Y}$
and Y, ${\sigma _{\hat{Y}}}$
and ${\sigma _Y}$
denote the corresponding standard deviations, ${\sigma _{\hat{Y}Y}}$
is the covariance value. ${c_1}$
and ${c_2}$
are special constants to prevent outliers.
• FSIM and FSIMc: Feature similarity (Zhang et al., Reference Zhang, Zhang, Mou and Zhang2011) considers all pixels in an image to have different importance. For instance, pixels at the edge of an object are more important than pixels in other background regions. Therefore, the improvement direction of this evaluation index focuses on distinguishing necessary pixels and giving them appropriate weights. FSIM is similar to SSIM, coupled with the phase congruency term ${S_{pc}}$
and gradient magnitude term ${S_G}$
. The definition of FSIM is described as follows
(12)\begin{equation}\textrm{FSIM}({\hat{Y},Y} )= \frac{{\mathop \sum \nolimits_{x \in {\Omega }} {S_{pc}}(x ){S_G}(x )\textrm{max}({\textrm{P}{\textrm{C}_{\hat{Y}}}(x ),\textrm{P}{\textrm{C}_Y}(x )} )}}{{\mathop \sum \nolimits_{x \in {\Omega }} \textrm{max}({\textrm{P}{\textrm{C}_{\hat{Y}}}(x ),\textrm{P}{\textrm{C}_Y}(x )} )}},\end{equation}with $\textrm{P}{\textrm{C}_{\hat{Y}}}$and $\textrm{P}{\textrm{C}_Y}$
being the phase congruency of $\hat{Y}$
and Y. On the basis of FSIM, FSIMc further considers chromaticity information ${S_C}(x )$
and uses $\lambda$
to further adjust the importance of chromatic components, i.e.,
(13)\begin{equation}\textrm{FSIMc}({\hat{Y},Y} )= \frac{{\mathop \sum \nolimits_{x \in {\Omega }} {S_{pc}}(x ){S_G}(x )S_C^\lambda (x )\textrm{max}({\textrm{P}{\textrm{C}_{\hat{Y}}}(x ),\textrm{P}{\textrm{C}_Y}(x )} )}}{{\mathop \sum \nolimits_{x \in {\Omega }} \textrm{max}({\textrm{P}{\textrm{C}_{\hat{Y}}}(x ),\textrm{P}{\textrm{C}_Y}(x )} )}}.\end{equation}• VSI: Visual saliency-induced index (Zhang et al., Reference Zhang, Shen and Li2014) considers that distortion will cause visual saliency changes, which strongly correlates with distortion. Therefore, the GBAS model-based visual saliency map (VS map) is used to evaluate the distortion. The mathematical definition of VSI between $\hat{Y}$
and Y is given by
(14)\begin{equation}\textrm{VSI}({\hat{Y},Y} )= \frac{{\mathop \sum \nolimits_{x \in {\Omega }} S(x )\textrm{max}({V{S_{\hat{Y}}}(x ), V{S_Y}(x )} )}}{{\mathop \sum \nolimits_{x \in {\Omega }} \textrm{max}({V{S_{\hat{Y}}}(x ), V{S_Y}(x )} )}},\end{equation}
where S denotes the local similarity of $\hat{Y}$ and Y, $V{S_{\hat{Y}}}$
and $V{S_Y}$
are the VS maps corresponding to $\hat{Y}$
and Y.
• NIQE: Natural image quality evaluator (Mittal et al., Reference Mittal, Soundararajan and Bovik2013) is proposed according to the statistical law observed in natural images. NIQE is a quality perception collection of statistical features constructed based on a simple and successful airspace natural scene statistics model.
• BTMQI: Blind tone-mapped quality index (Gu et al., Reference Gu, Wang, Zhai, Ma, Yang, Lin, Zhang and Gao2016) is an efficient and effective no-reference objective quality metric. In particular, BTMQI can automatically evaluate standard low dynamic range images created by different tone mapping operators without accessing the original high dynamic range image.
• mAP: Mean average precision (Everingham et al., Reference Everingham, Gool, Williams, Winn and Zisserman2010) is an evaluation index widely used in target detection. mAP is obtained by calculating the mean value of average precision for all classes. In this work, average precision can be obtained by the area under the precision/recall curve of detections.
Note that more details of the implementation of PSNR, SSIM, FSIM/FSIMc, VSI, NIQE, BTMQI, and mAP can be found in the literature (Wang et al., Reference Wang, Bovik, Sheikh and Simoncelli2004; Wang and Bovik, Reference Wang and Bovik2009; Everingham et al., Reference Everingham, Gool, Williams, Winn and Zisserman2010; Zhang et al., Reference Zhang, Zhang, Mou and Zhang2011, Reference Zhang, Shen and Li2014; Mittal et al., Reference Mittal, Soundararajan and Bovik2013; Gu et al., Reference Gu, Wang, Zhai, Ma, Yang, Lin, Zhang and Gao2016). To obtain better enhancement performance, the values of full-reference metrics should become higher, whereas the values of no-reference metrics should be lower. In addition, higher mAP indicates more accurate detection of moving vessels in maritime applications.
4.2 Synthetically-degraded image generation
Though many datasets that pair low-light and clear images already exist, a network model trained by these datasets cannot perform maritime surveillance effectively due to the distinctive nature of maritime imagery. Meanwhile, it is challenging to collect paired low-light/clear maritime images. Therefore, it is crucial to generate more realistic low-visibility images based on clear maritime surveillance images. The traditional low-visibility image generation method simply multiplies all image pixels by a definite coefficient, which causes all regions of the image to be darkened to the same degree. Unlike synthetic images, natural low-light images usually have both bright and dark regions. Therefore, an assumption was proposed that applying a stronger darkening factor (i.e., lower value) to the darker regions can make the composite image more realistic. Inspired by this prior, a simple but effective illumination estimation-based method was designed to obtain the darkening weight of each pixel. Specifically, the maximum value of each pixel of the RGB image $\tilde{I}$ in the three channels (i.e., R, G, B) was found, which can be written as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn15.png?pub-status=live)
with ${\tilde{I}^C}$ and $\bar{L}$
being the single-channel image of $\tilde{I}$
and coarse illumination component. Subsequently, a guided filter was adopted to smooth the details and preserve the significant edges as much as possible, which can be given by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn16.png?pub-status=live)
where $\hat{L}$ and ${\tilde{I}^{\textrm{Gray}}}$
represent the darkening weight (i.e., refined illumination component) and guided image (i.e., greyscale image corresponding to $\tilde{I}$
), ${\otimes}$
and ${G_\ast }$
are the convolutional operator and guided filter associated with ${\tilde{I}^{\textrm{Gray}}}$
. In this paper, the local window radius and regularisation parameter are respectively set as 7 and 10−3. More details on the description of guided filter can be found in He et al. (Reference He, Sun and Tang2013). Finally, the specific darkening coefficient $\varepsilon$
, weight $\hat{L}$
, and clear image $\tilde{I}$
were multiplied to obtain the final synthetically-degraded image, i.e.,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_eqn17.png?pub-status=live)
4.3 Implementation details
To test the implementation of the proposed model for maritime surveillance, 1,000 images were selected and captured as a training dataset containing 800 SeaShips images (Shao et al., Reference Shao, Wu, Wang, Du and Li2018), 100 outdoor images chosen by the MIT-Adobe FiveK dataset (Bychkovsky et al., Reference Bychkovsky, Paris, Chan and Durand2011), and 100 maritime images captured by the authors’ SLR camera. Then 8,000 patches with the size of 256 × 256 were generated using cropping, rotating, and scaling. In the numerical experiments, the training epoch was set as 60. During training, the learning rate was set to 10−3 in the first 30 epochs and 10−4 in the last 30 epochs. In the training, the inputs were obtained by Equation (17) mentioned in Section 4.1 with darkening coefficient $\varepsilon$ ranging between (0⋅4, 0⋅8). It took about 3 h to train the low-visibility enhancement network LVENet with the Pytorch 1.5.0. The Python source code is available at https://github.com/gy65896/LVENet.
4.4 Comparisons with other competing methods
In this work, the model-based methods of BCP (Fu et al., Reference Fu, Zeng, Huang, Ding and Zhang2013), JIEP (Cai et al., Reference Cai, Xu, Guo, Jia, Hu and Tao2017), and SRIE (Fu et al., Reference Fu, Zeng, Huang, Zhang and Ding2016) and the deep learning-based methods of RetinexNet (Wei et al., Reference Wei, Wang, Yang and Liu2018), MBLLEN (Lv et al., Reference Lv, Lu, Wu and Lim2018), LightenNet (Li et al., Reference Li, Guo, Poriki and Pang2018), and KinD (Zhang et al., Reference Zhang, Zhang and Guo2019) are involved as the competitors on image enhancement experiments. To ensure a fair comparison, the optimal parameters of these state-of-the-art methods are directly adopted according to the authors’ codes. Furthermore, YOLOv4 (Bochkovskiy et al., Reference Bochkovskiy, Wang and Liao2020), a well-designed target detection network, is used to analyse the improvement of detection effect by low-visibility enhancement.
• BCP: Bright channel prior (Fu et al., Reference Fu, Zeng, Huang, Ding and Zhang2013). This method proposes a BCP-regularised variational framework for low-visibility enhancement. An alternating direction optimisation method is adopted to effectively handle the resulting minimisation problem. In particular, BCP can effectively achieve low-visibility enhancement through an alternate direction optimisation-based method.
• JIEP: Joint intrinsic-extrinsic prior (Cai et al., Reference Cai, Xu, Guo, Jia, Hu and Tao2017). This method has the capacity of jointly estimating both illumination and reflection components from an observed image. It is able to capture the luminance by illumination prior, estimate the reflectance with rich details by texture prior, and preserve the structural features by shape prior.
• SRIE: Simultaneous reflectance and illumination estimation (Fu et al., Reference Fu, Zeng, Huang, Zhang and Ding2016). SRIE is proposed for better prior representation based on the logarithmic transformation. This method can effectively retain more details and suppress unwanted noise.
• RetinexNet: Deep Retinex decomposition-based network (Wei et al., Reference Wei, Wang, Yang and Liu2018). Based on the assumption that original images can be decomposed into illumination and reflection components, a deep Retinex-based network is presented to enhance imaging quality. RetinexNet uses a decomposition network and an illumination adjustment network to enhance the reflectance and illumination components, respectively.
• MBLLEN: Multi-branch low-light enhancement network (Lv et al., Reference Lv, Lu, Wu and Lim2018). MBLLEN is devoted to simultaneously handling various factors, including artefact, contrast, brightness, and noise. It employs different modules to extract abundant features and perform enhancement via multiple subnets. Finally, the enhanced image can be generated through multi-branch fusion.
• LightenNet: LightenNet for weakly illuminated enhancement (Li et al., Reference Li, Guo, Poriki and Pang2018). LightenNet is a trainable CNN for enhancement of weakly illuminated images. In particular, LightenNet takes the weakly illuminated image as an input and outputs its related illumination. The final enhanced image can thus be obtained according to Retinex theory.
• KinD: Kindling the darkness (Zhang et al., Reference Zhang, Zhang and Guo2019). KinD first exploits a simple and effective module to decompose the original image into illumination and reflection components. Two sub-networks are then designed to enhance the illumination and reflection components separately. These two enhanced components are finally combined to generate the latent sharp image.
• YOLOv4: You only look once v4 (Bochkovskiy et al., Reference Bochkovskiy, Wang and Liao2020). YOLOv4 is composed of four parts: CSPDarknet53 backbone, PANet path-aggregation neck, SPP additional module, and YOLOv3 anchor-based head. In this paper, this model is trained for 50 epochs and uses 7,000 SeaShips images (Shao et al., Reference Shao, Wu, Wang, Du and Li2018) as the dataset.
4.5 Full-reference image quality assessment
This subsection is devoted to verifying the superior performance of the proposed method. Specifically, 27 clear maritime images were randomly selected, as shown in Figure 4, for the experiments, and 81 low-visibility images were synthesised using Equation (17) with $\varepsilon = \{{0 {\cdot}5,\; 0 {\cdot}6,\; 0 {\cdot}7} \}$. Furthermore, three model-based methods (i.e., BCP, JIEP, and SRIE) and four learning-based methods (i.e., RetinexNet, MBLLEN, LightenNet, and SRIE) were also introduced to compete with the LVENet.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig4.png?pub-status=live)
Figure 4. Twenty-seven selected sharp maritime images for synthetic experiments
To make better visual comparisons, three typical synthetic low-visibility images generated from clear images (Figure 4) and their corresponding enhanced versions obtained by various methods are shown in Figures 5–7. It can be clearly found that BCP usually causes distortion and artefact problems. Especially in Figure 6, the enhanced version obtained by BCP has serious colour abnormalities. Although JIEP, SRIE, and MBLLEN can guarantee natural visual effects, these enhanced images have the problem of insufficient enhancement. Meanwhile, it can be clearly observed from the magnified regions in Figure 7 that MBLLEN makes the enhanced image excessively smooth. Unlike JIEP, SRIE, and MBLLEN, LightenNet has the problem of over-enhancement and a black halo exists on the edges of the object of interest. RetinexNet and KinD cause different degrees of colour distortion and unnaturalness in the image. The reason behind these phenomena may be that the competing methods fail to extract the structural features in maritime images. In contrast, the proposed network is capable of learning more meaningful features, resulting in effectively restoring visual colour and details in low-light regions.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig5.png?pub-status=live)
Figure 5. Comparisons of synthetic experiments on one image from Figure 4. From top-left to bottom-right: (a) synthetic low-light image with $\varepsilon = 0 {\cdot}5$, enhanced versions obtained by (b) BCP, (c) JIEP, (d) SRIE, (e) RetinexNet, (f) MBLLEN, (g) LightenNet, (h) KinD, (i) LVENet, and (j) ground truth (GT), respectively
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig6.png?pub-status=live)
Figure 6. Comparisons of synthetic experiments on one image from Figure 4. From top-left to bottom-right: (a) synthetic low-light image with $\varepsilon = 0 {\cdot}6$, enhanced versions obtained by (b) BCP, (c) JIEP, (d) SRIE, (e) RetinexNet, (f) MBLLEN, (g) LightenNet, (h) KinD, (i) LVENet, and (j) ground truth (GT), respectively
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig7.png?pub-status=live)
Figure 7. Comparisons of synthetic experiments on one image from Figure 4. From top-left to bottom-right: (a) synthetic low-light image with $\varepsilon = 0 {\cdot}7$, enhanced versions obtained by (b) BCP, (c) JIEP, (d) SRIE, (e) RetinexNet, (f) MBLLEN, (g) LightenNet, (h) KinD, (i) LVENet, and (j) ground truth (GT), respectively
To quantitatively evaluate the enhancement effect of restored images, five full-reference evaluation metrics (PSNR, SSIM, FSIM, FSIMc, and VSI) are introduced. The calculation results of each metric are shown in Table 3. Meanwhile, the data in Table 3 are visualised as a Kiviat diagram shown in Figure 8 to compare the enhanced performance more intuitively. It can be found that the proposed method is the optimal approach in the calculation results of all five full-reference metrics. The superior performance benefits from imposing more significant penalties on dark pixels and adopting maritime images as datasets during training.
Table 3. Comparison of LVENet with PSNE, SSIM, FSIM, FSIMc, and VSI (mean±std) framework and state-of-the-art methods on 81 synthetic images. The best, second best, and third best results are highlighted in red, blue, and green, respectively
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_tab3.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig8.png?pub-status=live)
Figure 8. Kiviat diagram that visualises the calculation results from Table 3. All calculated values of each metric are normalised, i.e., the best value is 1 and the worst is 0
4.6 No-reference image quality assessment
This subsection is dedicated to verifying the effectiveness of the proposed method for realistic low-light maritime images. In this work, the LVENet is compared with seven different image enhancement methods: BCP, JIEP, SRIE, RetinexNet, MBLLEN, LightenNet, and KinD. Specifically, the LVENet and other competitive methods are exploited to enhance three real low-visibility maritime examples. The enhancement results and their associated magnified versions are shown in Figure 9. Meanwhile, two evaluators (i.e., NIQE and BTMQI) are adopted to compare all the imaging methods.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig9.png?pub-status=live)
Figure 9. Comparisons of realistic experiments on images 1–3 (from top to bottom: image 1, image 2, and image 3). From left to right: (a) low-light image, and enhanced versions obtained by (b) BCP, (c) JIEP, (d) SRIE, (e) RetinexNet, (f) MBLLEN, (g) LightenNet, (h) KinD, and (i) LVENet, respectively
Through visual comparison, it can be observed that BCP tends to produce an unnatural white halo on the target of interest and is accompanied by a certain degree of colour distortion. The restored images of JIEP and SRIE seem very natural, but their insufficient enhancement may make it hard to find essential information. Although RetinexNet has successfully enhanced the illumination, it is evident that the colour of the enhanced versions becomes extremely unnatural, which will directly lead to the failure of vessel detection. MBLLEN can obtain satisfactory visual effects, however, the improved images of MBLLEN have the problem of insufficient enhancement and the risk of over-smoothing. The combined effect of these two issues may directly erase valuable information hidden in the dark. The enhanced versions of LightenNet have the problem of overexposure and boundary artefacts in certain regions. Although KinD can achieve almost the same visual effect as the proposed method through expensive calculations, both no-reference metrics show that the proposed recovery effect has better performance. As shown in Table 4, the LVENet has superior performance on both NIQE and BTMQI metrics in most cases. Besides, Figure 10 shows more cases of low-visibility maritime images obtained by this method, illustrating that the proposed LVENet can effectively obtain buried important information.
Table 4. NIQE and BTMQI comparisons of LVENet with other methods on all test images in Figure 9. The best, second best, and third best results are highlighted in red, blue, and green, respectively
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_tab4.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig10.png?pub-status=live)
Figure 10. More low-visibility enhancement cases generated by the proposed LVENet
4.7 Vessel detection after low-visibility enhancement
In the maritime video surveillance system, the low-visibility environment seriously reduces the accuracy of manual or autonomous vessel detection. The LVENet is designed to solve the problems of detection failure and inaccurate identification. To verify that the LVENet can improve detection performance, YOLOv4 (Bochkovskiy et al., Reference Bochkovskiy, Wang and Liao2020) is employed to detect vessels in both original low-visibility images and their enhanced versions obtained by LVENet.
Figure 11 shows the detection effects of synthetic low-visibility images, enhanced images, and normal visibility images (i.e., original image). By comparison, detection failures and errors often occur in low-visibility environments. However, the enhanced images can achieve almost the same detection effect as the normal visibility image. Meanwhile, Figure 12 shows the visual results of vessel detection on the real low-visibility images and the enhanced images. It can be clearly seen that the preprocessing of images through the LVENet can significantly improve the detection performance in natural low-visibility maritime environments. In low-light conditions, YOLOv4 tends to distinguish vessel types incorrectly and even produces the problem of detection failure. However, using the proposed method as preprocessing module can enable the vessel detection model to collect features better and to improve accuracy. Furthermore, Table 5 qualitatively displays the detection performance of all images using mAP indicators to further reveal the improvement of the LVENet for vessel detection effect in a low-visibility environment.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig11.png?pub-status=live)
Figure 11. Vessel detection experiments on synthetic low-visibility maritime images. From top to bottom: the vessel detection results on (a) low-light images, (b) enhanced images obtained by LVENet, and (c) ground truth (GT), respectively
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig12.png?pub-status=live)
Figure 12. Vessel detection experiments on real low-visibility maritime images. From top to bottom: the vessel detection results on (a) low-light images and (b) enhanced images obtained by LVENet, respectively
Table 5. Comparison of mAP results (unit: %)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_tab5.png?pub-status=live)
Note: Dataset (Syn), Dataset (Real), and Dataset (All) contain 18 images in Figure 11, 12 images in Figure 12, and 30 images in Figures 11 and 12, respectively.
4.8 Running time analysis
The lightweight structure of LVENet leads to fast low-visibility enhancement. In this section, LVENet is sufficiently compared with seven image enhancement methods. It is noted that other competing imaging methods are performed using Matlab, Matlab (Caffe), and Python (Tensorflow) platforms, according to the original codes provided by the authors. To analyse the network complexity of deep learning-based approaches, the model parameters of all methods are counted. Meanwhile, three types of low-visibility datasets with sizes of 480 × 640, 720 × 1080, and 1080 × 1920 are selected for all methods to run. In particular, each dataset contains 20 images. The model parameters and average calculation times of the three scale images are shown in Table 6. Despite other slower Matlab implementations, it is fair to compare RetinexNet, MBLLEN, KinD, and the LVENet. The results illustrate that the proposed method takes first place by a large margin in terms of calculation speed. LVENet takes just 1 s to process 222 (1920 × 1080) images under the acceleration of 2080Ti GPU.
Table 6. Comparison of running times (unit: second) and model parameters of several image enhancement methods
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_tab6.png?pub-status=live)
Note: RetinexNet, MBLLEN, KinD, and the proposed LVENet method are accelerated by 2080 Ti GPU.
5. Conclusion and future work
This paper proposes a low-visibility enhancement network (termed LVENet) that directly reconstructs enhanced images via a lightweight CNN. In particular, it is inspired by the Retinex theory to estimate the illumination component using a depthwise separable convolutional network. Finally, the enhanced image is generated by dividing the input low-visibility image with the estimated illumination. In addition, a novel synthetically-degraded image strategy and a hybrid loss function were also designed to improve network performance and provide maritime images as training datasets to support better maritime surveillance. The benefit of the LVENet is that it requires relatively few calculations to obtain a satisfactory visual enhancement effect. Its superior performance has been verified on both full and no-reference assessments. Furthermore, experiments on detecting vessels after low-visibility enhancement and computing running time were also conducted to prove the practicality and real-time capability of the proposed LVENet.
Although the effectiveness of the method has been proven in many experiments, it is sometimes difficult to obtain perfect visual effects under certain circumstances. Figure 13 visually illustrates two failed cases related to low-visibility enhancement. The proposed method enhanced the brightness, but the enhanced images are still disturbed by unwanted noise carried in the input low-visibility images. However, it is still worthy of consideration since it enables better imaging results than other competitive methods. In future work, the authors will focus on developing a noise-insensitive learning method to further enhance image quality and vessel detection in maritime videos. The authors believe that there is significant potential to exploit the proposed method to guarantee the safety of vessel navigation under maritime video surveillance.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220316021422219-0112:S0373463321000783:S0373463321000783_fig13.png?pub-status=live)
Figure 13. Failure enhancement cases. From left to right: (a) low-light image, and (b) enhanced versions obtained by LVENet, respectively
Acknowledgements
This work was supported by the NSFC (Grant No. 51609195) and the National Key R&D Program of China (Grant No. 2018YFC0309602).