1. Introduction
Inland waterways are important components of waterway transportation. A vessel traffic transit service (VTS) can effectively supervise the fixed waters of an inland waterway in real time, playing an important role in improving navigation efficiency and ensuring the navigation safety of vessels (Can, Reference Can2017); as such, it is a key research field in the shipping industry. As one important means of VTS supervision, maritime video surveillance primarily relies on humans to analyse and determine ship movement information in a video feed; however ship information obtained from such videos is extremely limited. A ship's automatic identification system (AIS) can collect a large number of spatial position data generated during ship navigation to provide intelligent ship management. The integration of AIS technology and ship VTS will provide a safe navigation system for maritime transportation (Kartika et al., Reference Kartika, Siswandari and Puspitorini2018).
Augmented reality (AR) technology has been developing rapidly in recent years. This technology can be used to register virtual information (e.g., text, symbols and pictures) with objects in a real scene and, at the same time, achieve enhanced information tracking of real objects. As more and more intelligent algorithms are applied in maritime affairs, marine ship navigation and waterway supervision are gradually moving toward the field of AR. Researchers are committed to adding more navigation information to ship navigation videos to aid pilots. Frydenberg et al. (Reference Frydenberg, Nordby and Eikenes2018) developed a shipborne AR deployment scheme from the ship driving environment, human movements and illumination factors, and provided additional environment and navigation information for ship pilots by superimposing graphics and audio. Hugues et al. (Reference Hugues, Cieutat and Guitton2014) proposed an image analysis algorithm capable of detecting the horizon in maritime scenes, creating an indirect visual function in an AR system for ships. Oh et al. (Reference Oh, Park and Kwon2016) analysed current shipboard equipment, such as radar and electronic charts, to design an AR user interface that integrates the visual view in front of a ship with virtual images and navigation information and presents it to the pilot via an independent computer screen.
With the continuous development of communication technology, the utilisation rate of AIS will be continuously improved in the future (Liu et al. Reference Liu, Nie, Garg, Xiong and Hossain2020). Based on the massive parallel computing power of graphics processing units (GPUs), Huang et al. (Reference Huang, Li, Zhang and Liu2020) compacted the massive AIS tracks to realise visualisation. To obtain additional ship information, some researchers have tried to incorporate AIS information to enable visual support for AIS information. Lee et al. (Reference Lee, Lee and Nam2016) calculated the ship attitude by detecting an image ahead of the ship and then used image analyses and AIS data to detect the ship position and generate an enhanced display of that information. Based on AIS data, An et al. (Reference An, Qiao, Yang, Hong and Bai2019) designed a visual analysis platform for global shipping routes that can display multiple aspects of the AIS data and generate a display of the entire sea route. Lukas et al. (Reference Lukas, Vahl and Mesing2014) developed a multi-signal fusion monitoring method that combines electronic charts, radar and AIS information with the video feed, so that users can touch a screen to obtain all the ship information from the video.
Although researchers are increasingly incorporating AR technology in maritime applications. However, there are limited AR studies concerning inland waterway monitoring, which is often of higher application value in some important sections. This paper focuses on the special scenes of inland rivers and proposes innovations in improving the accuracy of ship detection, night image enhancement and maritime information fusion. In this paper, based on AIS data and combined with computer vision analysis methods, virtual information is registered with real images and an identification and tracking system for ship information suitable for inland waterway supervision is constructed, providing a convenient method for waterway supervisors to obtain ship information.
2. Related computer vision research
2.1 Object detection
Target detection is an important research direction of computer vision, which is used to lock the target position in a single-frame image. Research on ship detection began in 2002, code-named ‘Spartan scout,’ in the form of a project to self-control a pilotless ship. The ship was equipped with a ship detection system that could detect and track ship objects at sea (Rao et al., Reference Rao, Wang, Hu and Mullane2014). Later, Guang et al. (Reference Guang, Qichao and Feng2011) proposed a method of region segmentation to extract the features of different segmentation regions; then, a machine-learning algorithm was used to detect the ships in each region of the image. Kim et al. (Reference Kim, Park and Yu2010) used background estimation to enable ship detection and merged with AIS to implement relevant ship information matching. Liu (Reference Liu2010) detected ships in an inland waterway by separating the sky–water line and incorporating the average ship features in the region of interest.
In recent years, deep learning has been developing rapidly and is widely used in the field of object detection. Object detection algorithms are divided into two categories: object detection models based on classification, including region-based convolutional neural network (R-CNN) (Girshick et al., Reference Girshick, Donahue, Darrell and Malik2014), Fast R-CNN (Girshick, Reference Girshick2015) and Faster R-CNN (Ren et al., Reference Ren, He, Girshick and Sun2017), and detection models based on regression, including the YOLO (Redmon et al., Reference Redmon, Divvala, Girshick and Farhadi2016) and single shot multibox detector (SSD) (Liu et al., Reference Liu, Anguelov, Erhan, Szegedy, Reed, Fu and Berg2016) series.
Object detection methods based on deep learning have been widely used in various fields, including ship detection. Betti et al. (Reference Betti, Michelozzi, Bracci and Masini2020) constructed a YOLOv3 detector based on the Keras application programming interface to detect four types of ships: cargo ships, warships, oil tankers and tugboats. He et al. (Reference He, Yi, Mu and Zhang2019) designed a method based on the combination of a Gabor filter and Faster R-CNN to effectively improve the accuracy of ship object detection in satellite images. Wang et al. (Reference Wang, Wang, Hong, Cheng and Fu2017) proposed a ship detection method based on the SSD algorithm and transfer learning according to different image input sizes; two models, SSD-300 and SSD-500, were designed to test the detection performance. Guo et al. (Reference Guo, Yang, Wang, Song and Gao2020) proposed a rotational Libra CNN method and introduced the concept of a balanced feature pyramid to improve the detection effect for different sizes of ships. At present, deep-learning algorithms are widely used in ship object detection. However, in real circumstances, the types of ships are diverse, their sizes are different and the water traffic environment is complex; in addition, it is necessary to consider recognition under night visual conditions, which poses a huge challenge to water object detection.
2.2 Object tracking
Object tracking is a key technology used to solve the real-time locking of moving objects in video feeds. Classic object-tracking algorithms include mean shift, particle filter and correlation filter. These methods mainly model the current object region and then find similar regions in the next image. Well-known correlation filtering algorithms include circulant structure tracking with kernels (CSK) (Henriques et al., Reference Henriques, Rui, Martins and Batista2012) and kernelised correlation filter (KCF) (Henriques et al., Reference Henriques, Caseiro, Martins and Batista2015), both of which determine the moving direction of an object by finding the relationship between two adjacent frames.
Compared to general motion environments, water traffic environments are more complex, including the partial or complete occlusion of adjacent ships, attitude and illumination changes, sudden scale changes and motion blur (Chen et al., Reference Chen, Yuan, Wu, Zhang and Zheng2014; Li and Zhu, Reference Li and Zhu2014; Li et al., Reference Li, Tian, Zuo, Zhang and Yang2018). Chen et al. (Reference Chen, Li, Tian and Chao2017) used the difference in the grey peak to detect the position of a ship and tracked ships based on the mean-shift algorithm. Dong et al. (Reference Dong, Zheng, Li, Tian and Liu2019) designed an object-tracking algorithm based on an improved KCF, in which a Kalman filter module was added to predict the position of the ship tracking object in the next frame, and the concept of object-tracking critical probability was used to evaluate whether the object tracking was abnormal. However, in maritime video feeds, because of occlusion, ships are often not detected. Cheng et al. (Reference Cheng, Zhilin and University2019) improved the KCF algorithm by adding a scale-transform frame and tracking-effect detection method to solve the object occlusion problem or tracking failures caused by object detection. Chen et al. (Reference Chen, Xu, Yang, Wu, Tang and Zhao2020) proposed an enhanced ship-tracking framework based on KCF and a curve-fitting algorithm and used a data anomaly detection and correction program to correct the positions of occluded ships.
With the development of deep-learning algorithms, such algorithms have been applied to object tracking. This method trains a CNN model to predict the object region in the next frame (Dorai et al., Reference Dorai, Chausse, Gazzah and Amara2017; Pang et al., Reference Pang, Coz, Yu, Luaces and Diez2017). Recently, such algorithms have appeared in maritime applications, where researchers (Shan et al., Reference Shan, Zhou, Liu, Zhang and Huang2020) have used a modified Siamese network combined with multi-region proposal networks to build a tracking pipeline to track maritime ships. ResNet-50 with a feature pyramid network structure was used as the CNN of the Siamese detection subnetwork.
However, research concerning multi-object ships is relatively sparse. Vivone et al. (Reference Vivone, Braca and Horstmann2015) used a prior knowledge-based multi-object tracking method, prior information given by the ship channel, and its related motion model as the basic components of a variable structure interactive multi-model process and used the joint probabilistic data association rules to deal with false and missed detections. Xiao and Gang (Reference Xiao and Gang2011) employed the continuously adaptive mean shift (camshaft) multi-ship tracking algorithm and multi-feature adaptive fusion based on colour, shape and texture to improve the robustness of the model.
3. Methodology
This paper uses a multi-object ship tracking mechanism based on the combination of a detector and a tracker and, using a data fusion algorithm, achieves AIS AR. The steps are as follows. (1) We enhance the night images to increase the contrast between the ship objects and the water surface background. (2) We use the SSD model to build a detector and introduce a self-attention mechanism to improve the model to enhance the accuracy of object detection. (3) We use the DeepSORT algorithm to build a tracker, according to the current tracking results, to predict the next position of the object and obtain the best match. (4) We establish a mapping between the ship positioning points and the world coordinate system. (5) We track and predict the AIS signal via a distance calculation matching an AIS signal with a ship object. The structure of the algorithm discussed in this paper is shown in Figure 1.
3.1 Increased contrast at night
Computer vision began to dabble in maritime video-surveillance and finds many both civilian and military applications (Grimaldi et al., Reference Grimaldi, Bechar, Lelore, Guis and Bouchara2015). Therefore, it is necessary to improve the image quality before further analysis. Yang et al. (Reference Yang, Nie and Liu2019) estimated the initial brightness through Max-RGB and then refined it with the WLS filter. Based on the well-constructed brightness, the enhanced image at a low light level was obtained. In the inland waterway, the problem of night image detection mainly came from the low contrast between the ship target and background.
In this paper, we chose the RGB colour space based on the visible light image. The differences in red (R), green (G) and blue (B) in three dimensions cause the image to present different colour representations; however, it can be difficult to express the deep image characteristics. By constructing the pixel matrices hue (H), saturation (S) and brightness (V), the image can be easily quantified according to its colour, saturation and brightness. For image P, as shown in Figure 2(a), the corresponding S and V grey matrices are calculated as
where ${R_{ij}}$, ${G_{ij}}$, and ${B_{ij}}$ represent the three colour weights for the pixel coordinates $({i,j} )$ after normalisation.
A generic ship tends to show high saturation and low brightness in night images; accordingly, the pixels in the ship region have higher S and lower V. The surface region in the background, conversely, exhibits lower S and higher V, e.g., ${S_{\textrm{boat}}} > {S_{\textrm{river}}}$ and ${V_{\textrm{boat}}} < {V_{\textrm{river}}}$. At the same time, because of the influence of the illumination on both sides of the water surface, the pixels have high gradient characteristics; therefore, the Sobel operator can be used to extract the gradient image of $P:L$, as shown in Figure 2(b), and then transform it into the grey pixel matrix $L\mathrm{^{\prime}}$, as shown in Figure 2(c). Then, the contrast of the water region pixels in S and V can be enhanced:
where ${P_{\textrm{max}}} = 255$. The processed greyscale images of S and V are shown in Figure 2(d) and (e), respectively. It is obvious from the figure that the ship region is more obvious in the saturation image S than in the brightness image V. The ship region can be enhanced by subtracting the two images:
where ${T_{ij}}$ = grey image after enhancement processing and the parameter $\varepsilon = 40\varepsilon$. The grey image T is shown in Figure 2(f). Now that the ship region is clearly expressed, the contrast enhancement calculation for the object region can be carried out using
where ${P_{ij}}$ = pixel after contrast enhancement and $\beta$ = controllable parameter, where $\beta = 1.15$. Each pixel in T is compared with 0, and the ship region is segmented from T. Then, the RGB component of the corresponding position in the original image P is exponentially calculated. The processed image P is shown in Figure 2(f). The ship object in the figure shows higher contrast compared with the background region.
3.2 Self-attention mechanism
The self-attention mechanism is a recently proposed improved machine-learning method that enables the model to fully consider the relationship between the features of the samples in training and enhances the feature learning of key regions. This mechanism has been widely used in natural language processing and image learning (Cao et al., Reference Cao, Chen, Guo and Shi2020). Because of the small size of the convolution kernel, CNNs only capture regional features with the same size and cannot effectively reflect the relationships between pixels in different regions. In this paper, we improve the network structure of SSD, based on the self-attention mechanism, so that it can analyse the correlations of all the pixels in the different scale feature maps, making the correlation of the extracted features stronger.
First, we input the feature map x with a structure $c \times w \times h$, use a 1 × 1 convolution kernel to convolute twice, and compress the channel number according to $c = c \div 8$ and obtain matrices F and G with structures $c^{\prime} \times w \times h$, which are then transformed into $c^{\prime} \times n$, where $n = \; w \times h$. The third convolution operation is performed using a $1 \times 1$ convolution kernel without channel number compression, and a matrix H with a $c \times w \times h$ structure is obtained, which is then transformed into $c \times n$. The matrices F and G are used to calculate the attention matrix
where ${e_{i,j}}$ = element values of the attention matrix E, which has dimensions $n \; \times \; n$; ${s_{ij}}$ = relative weight of the i position to the j position; N = number of elements in the characteristic graph matrix; and $\; f({{x_i}} )= {W_f}{x_i}$, $g({{x_j}} )= {W_g}{x_j}$, where ${W_f}$ and ${W_g}$ represent $1 \times 1$ convolutions.
We continue to multiply the matrices H and E to obtain a matrix with dimensions $c \times n$ as the self-attention map:
where ${q_{i,j}}$ = elements in the matrix Q and y = learning parameter, which is used to represent the dependence of the network on the relationship between the long-distance regions. $h({{x_j}} )= {W_h}{x_i}$, where ${W_h}$ is a $1 \times 1$ convolution.
The matrix Q is expanded into a characteristic graph with a $c \times w \times h$ structure as the output result of the self-attention module, as shown in Figure 3. The input structure is a characteristic graph with a $c \times w \times h$ structure. Because the structure of the entire process characteristic graph remains unchanged, the self-attention module can be directly added to the convolution network without changing its basic structure.
3.3 The detector
The basic network of SSD was improved compared with VGG16. The full connection layer of the original VGG16 was deleted, and the depth of the convolution layer was increased. In addition, four convolution layers were added after the network to extract the network features. In the process of the network moving forward, a series of prediction scores of bounding box sets and categories are generated and then the final detection results are determined via nonmaximum suppression (NMS). As the depth of the network increases, the size of the feature map decreases and the NMS layer collects the object information at each scale; therefore, the default box generated by SSD is multiscale.
As shown in Figure 4, based on the SSD network, the self-attention module is added after the 4th, 7th, 8th, 9th, 10th and 11th layers. The output characteristic diagram of the original layer is taken as the input of the self-attention module, and the corresponding feature map is the output. With the deepening of the network, the scale of the generated feature map decreases, and different scale feature maps produce different default boxes. NMS is used again to fuse the self-attention feature maps of the different receptive fields, and the final detection result is calculated.
3.4 The tracker
This component of the algorithm must continuously track all the ship objects detected by the detector. Multiple-object tracking (MOT) is used to track and extract the trajectories of multiple objects of interest in a video sequence. As opposed to single-object tracking, MOT must identify different objects in each frame. For new objects, it must generate new trajectories, and for objects that have left the field of view of the camera, it must terminate the tracking of their trajectories.
The traditional multi-object tracking algorithm is based on correlation filtering, with its representative algorithm being a KCF-based multi-object tracking algorithm (Henriques et al., Reference Henriques, Caseiro, Martins and Batista2015) that uses multithreading to track multiple single objects. The SORT (Bewley et al., Reference Bewley, Ge, Ott, Ramos and Upcroft2016) algorithm proposed in 2016 regards multi-object tracking as a data association problem. SORT is based on accurate object detection, according to the location information in the detection box, and uses Kalman filtering and the Hungarian algorithm to match the tracking objects of the front and back two frames of the image. However, because SORT ignores the surface characteristics of the object being detected, it is easy to lose an object.
DeepSORT (Wojke et al., Reference Wojke, Bewley and Paulus2017) has been improved, combining object motion and surface feature information and using the Mahalanobis and cosine distances as a measure of the similarity between the motion and depth features within the detection box, while at the same time using the Hungarian algorithm to cascade match the predicted track with the detections in the current frame, giving priority to the objects that have not been lost. Then, use the minimum IOU threshold to filter the low confidence matching to reduce error matching, and set a threshold to remove trackers that loop too many times. The algorithm structure is shown in Figure 5.
3.5 Multimodal data fusion
(1) Perspective transformation
Perspective transformation is the process of projecting an image onto a new view plane by establishing a projection matrix. The projection process is the product of the coordinate vector and the transformation matrix, as
where u and v = coordinates of the original plane; the parameter $w = 1$; the perspective matrix contains the zoom, rotation and translation operations of the plane; and ${a_{33}} = 1$. The transformed coordinates are represented as ($u\mathrm{^{\prime}},v\mathrm{^{\prime}}$), where $u\mathrm{^{\prime}} = x\mathrm{^{\prime}}/w\mathrm{^{\prime}}$ and $v\mathrm{^{\prime}} = y\mathrm{^{\prime}}/w\mathrm{^{\prime}}$. Entering this into Equation (3.7), we obtain Equation (3.8). According to this formula, the parameters in the perspective matrix can be calculated by simply identifying four sets of key points:
In this paper, the principle of perspective transformation is used to project the image coordinate system onto the longitude and latitude coordinate system. The process of transforming the pixel coordinate to longitude and latitude coordinates is called the forward perspective, while the opposite is called the reverse perspective. If we reverse perspective the plane of the river water, the effect is shown in Figure 6, which demonstrates the reverse perspective effect of a square virtual grid with a side length of 0⋅00045 in the longitude and latitude coordinate system.
(1) Ship position estimation
The previous steps transform any pixel coordinate into longitude and latitude coordinates on the river surface in an image. However, the longitude and latitude data in AIS are confirmed by the shipborne GPS, which is often close to the bow or stern of the ship; therefore, there may be a gap of tens of meters between the two sets of coordinates. This paper takes the ship centre as the estimated positioning point and marks the pixel coordinates of the ship centre position corresponding to the river surface in the video.
This is shown in Figure 7, where the point O on the long side of the detection box is the selected water surface projection point. A ray is traced along the angle $\mathrm{\theta }$ to the right boundary and intersects with the boundary box at the point F, where $\mathrm{\theta }$ is obtained by the DeepSORT tracker, such that the line segment $OF$ can approximately represent the waterline of the ship. The position of B in world coordinates is
where P is used to represent the pixel coordinates of each point and $\textrm{Transform}\; ()$ represent the forward perspectives. From point B along the vertical travel direction, a ray is traced into the detection box to the ship positioning point S. The length of this line segment in world coordinates, $l_{BS}^{\prime}$, is determined by the length/width ratio of the ship, such that
where $\delta$ represents the length/width ratio, which for bulk carriers, tankers and river ships is approximately $1:6$ and for yachts, cruise ships and tugboats is approximately $1:3$; $\textrm{Distance}\; ()$ is used to calculate the Euclidean distance between two points. On this basis, the position S in the world coordinate system $P_S^\mathrm{^{\prime}}$ can be obtained as
where $\beta$ = azimuth of point S with respect to point B in world coordinates. Then, we can obtain the pixel coordinates of the point $S$:
where $\textrm{Transform}^{\prime}()$ = reverse perspective.
(1) AIS signal prediction and data fusion
Next, we demarcate the object region within the longitude and latitude coordinate system. We receive the dynamic AIS information in the object region and extract the Maritime Mobile Service Identity (MMSI) number, longitude and latitude, direction and speed information. We track the latest signal points of the ship according to the MMSI number, where ${D_{mmsi,t}} = \{{\textrm{lon},\textrm{lat},\textrm{sog},\textrm{cog}} \}$. The AIS signal is updated only every few seconds and is out of sync with the video signal; therefore, it is necessary to track the ship and continuously predict its position at every moment ${A_{mmsi,t^{\prime}}} = \{{\textrm{lon}^{\prime},\textrm{lat}^{\prime}} \}$ according to ${D_{mmsi,t}}$, such that
Therefore, the position of AIS objects at each moment are expressed as ${P_A}$, the position of video ship objects after the positive perspective transformation are expressed at the same moment as $P_S^\mathrm{^{\prime}}$, two points constitute the vector ${R_1} = \overrightarrow {P_S^\mathrm{^{\prime}}{P_K}}$. Their forward directions are represented by the vector ${R_2} = P_S^\mathrm{^{\prime}} + \overrightarrow {P_O^\mathrm{^{\prime}}P_F^\mathrm{^{\prime}}}$, and we calculate the distance between the two as
where $\textrm{Normalisation}()$ represents a normalised function, the first half of the formula represents the Euclidean distance between two points, and the second half represents the cosine similarity between the vectors. The range of values of distance is $[{0,2} ]$.
Using the near-matching mechanism, information pairs with smaller distances can be matched first.
4. Results
4.1 The experiment platform
The experimental scene selected in this paper is the Bund section of the Huangpu River, Yangtze River channel, which is in the central region of Shanghai, shown in Figure 8. During the day, this area accommodates many bulk carriers and river ships, which differ in distance and size. At night, the traffic volume increases and there are more cruise ships and yachts on the water surface. Under the illumination of the lights on both sides of the channel, the water surface is colourful and provides a variety of visual conditions for passing ships.
We used a long focal distance network camera to receive a video signal with a resolution of $2550 \times 1440$. The AIS signal receiving base station was set at the same location, and the video and AIS signals were transmitted to the remote experimental platform via the network at the same time.
Using the computer vision method for ship detection and tracking, this experiment employed an original, independently established dataset and a monitoring probe located on the river bank. Images of the channel were collected from different shooting angles, under different weather conditions, and for different time periods throughout the day, resulting in a total of 2,700 images, including daytime, night-time, rainy weather and foggy weather, in which 6,330 ships were marked. The collected image resolution was adjusted to $300 \times 300$, and the ship dataset was divided into six categories: river ships, bulk carriers, cruise ships, tankers, tugboats and yachts. We manually distinguished the night images, processed using the method in Section 3.1, and fed the processed images into the improved SSD–DeepSORT frames for training. We used Python programming software, and the training environment consisted of an Ubantu operating system, PyTorch 1.0 framework, TensorFlow 1.14 frame, GTX 1050ti (2 blocks) GPU, and 10 G of memory, using the NVIDIA CUDA9.0 acceleration toolbox.
4.2 Experiment and analysis
As can be seen from Table 1, the average detection accuracy of the improved SSD model in the six ship categories is 87⋅53, which is 4⋅7 higher than the original SSD model. The accuracy for the river ships, which are the most frequent type of ship, increased by 5⋅11; the accuracy for the cruise ships improved the most, by 7⋅85; and the accuracy increase for the yachts was smaller, only 1⋅05. Results show that the detector proposed in this paper has a higher detection performance than that of the original method.
We used continuous frames to test the effect of tracing and to draw the ship trajectories. Figure 9 shows the object-tracking test results. According to the objects detected in Figure 9(a),
Even in rainy weather, some small-sized ship targets can still be tracked effectively. Figure 9(b) shows that the detection and tracking algorithm can still performs well in foggy weather. In Figure 9(c), the ships and the water surface region are dim and the degree of discrimination is not large. However, the trajectories of the objects indicate that the algorithm can solve the problem of ship object tracking at night. Figure 9(d) indicates that the algorithm can also achieve good detection and tracking effects for crowded ships.
After capturing the objects in the video, the angle ranging from 0° to 35° between the line of two central pixels one second apart and the horizontal line is figured out. Then according to the method described in Section 3.5, the real scene was modelled and coordinate matches between the video and AIS objects were made. The result is shown in Figure 10.
To verify the effect of the algorithm on the information enhancement at different times, we selected a video with a length of 1 h, selected a frame with an interval of 3⋅5 s to test and compiled statistics of the relevant indicators in each frame. These indicators included the total number of ships, the number of tracked ships, the number of successful matches with AIS information and the number of correct matches with AIS information. Then, all the indicators were summed over 12 time periods. Because the detection difficulty for ships in day and night scenes is different, and the types of ships on the waterway during these periods are different, 1 h each of day and night scenes was selected; the statistical results for the day and night scenes are shown in Figure 11(a) and (b), respectively.
5. Discussion
As can be seen in Figure 11, the gap between the total number of ships and the number of tracked ships is large, indicating that, in some cases, some ships fail to form a tracking trajectory because they are distant and small, resulting in object tracking interruption or failure to track. In most time periods, the gap between the number of tracked ships and the number of successful matches is significant, indicating that, even if the ship has a greater probability of successfully matching the AIS signal under the premise of successful tracking, in some cases, because of the gap between the AIS forecast position and the actual position of the ship, the matching conditions cannot be met. The gap between the number of successful matches and the number of correct matches is generally small, indicating that the AIS information for the successful matches has a high accuracy.
An analysis of Table 2 reveals that, given a period of 1 h, the total number of ships at night is 2,319, reflecting an increase of 525 compared with the same period during the day. This suggests that the traffic flow on the waterway is significantly greater at night than during the day. According to the summation calculation results for each indicator in Table 2, the ratios of the number of tracked ships to the total number of ships during the day and at night are 87⋅0% and 83⋅4%, respectively, which indicates that it is easier to lose or not track a ship at night. This is because there are more smaller yachts at night and the self-lighting of these ships makes it difficult to separate them from the reflected light on the water surface, leading to object leakage. The ratios of the number of successful matches to the number of tracked ships during the day and night were 93⋅0% and 89⋅5%, respectively; the difference between the two ratios is obvious. This is because the error in the ship signal position prediction at night makes it more difficult to meet the matching conditions. The ratios of the number of correct matches to the number of successful matches during the day and night were 96⋅6% and 94⋅4%, respectively, indicating that the correct rate of matched information is lower at night than during the day because the ships are denser at night, resulting in interference between the ships, which are closer to each other. The overall accuracy of the algorithm is represented by the proportion of the number of correct matches to the total number of ships; the accuracies during the day and night were calculated to be 78⋅2% and 70⋅4%, respectively. The algorithm can visualise the AIS information for most ships in the surveillance video and that the display effect is obviously better during the day than during the night.
As shown in Figure 10, the information displayed in the upper part of the ship detection box includes the MMSI number, type, speed and course, where the type is derived from the detector and the rest of the information is obtained from the AIS dynamic information and is constantly updated. However, since the missed detection and occlusion of the ship will cause the temporary disappearance of the tracking target, the cascade matching algorithm of DeepSORT can retain the missing track information for a period of time, which requires the combination of the video frame rate and the ship moving speed to set the optimal max_age. These are basic pieces of information concerning a ship that can help an observer quickly understand the driving status of the ship and can be used to search for the ship data and voyage information according to the ship MMSI number. Our method can enhance the display of the basic ship information in a channel video. It can also be adapted to similar scenes, and further developments can be applied to various channel video surveillance platforms to achieve the intelligent supervision of waterways.
6. Conclusions
VTSs are key to ensuring the efficient navigation of ships. This paper makes full use of channel surveillance video and AIS data to design a ship information tracking scheme. We standardise virtual information with actual ships and superimpose a visual expression of AIS information on video images. The algorithm was tested on day and night scenes, and the results show that the algorithm can be effective in real scenes. This study proposes a constructive plan for an intelligent upgrade for river shipping supervision.
This paper proposes a solution to the problem that the contrast between a ship object and background is not strong in night images, which is helpful for detecting ordinary bulk ships, river vessels, tankers and tugboats because of their special appearance in HSV images, but not for ship objects with higher brightness; therefore, this method has local applicability in night scenes. In this paper, a popular object detection algorithm was selected and improved upon, and the advantages of the improved algorithm were verified via comparative tests. Tracking tests were performed on both day and night scenes. The application of computer vision principles and algorithms illustrates that artificial intelligence has broad prospects for maritime applications. Based on the completion of image object positioning, this paper designed a signal-fusion calculation method based on the perspective transformation, simulating the positioning point of a ship, and predicting the AIS signal position that can be applied to a variety of overlooking monitoring scenes. To achieve coordinate matching between the virtual information and a ship object by combining the Euclidean distance and the cosine similarity between the multimodal data.
The establishment of the ship information tracking model in this paper needs to be adjusted according to different video images. The target detection based on SSD can adapt to the change of the camera's focal length in a certain range, but the camera has different perspectives to collect the video of the waterway; thus, it is necessary to consider the multiangle features of the ship when building the ship data set. Under different focal lengths and angles of view, different coordinates of reference points are also needed to establish the perspective transformation model, to ensure the basic conditions of interrelation between objects.
Financial statement
This research was funded by the National Natural Science Foundation of China (Grant No. 71804059), Social Development Major Project of Shanghai Municipal Science and Technology Commission (grant number 18DZ1206300).