NOMENCLATURE
- UAV
-
Unmanned Aerial Vehicle
- AI
-
Artificial Intelligence
- DRL
-
Deep Reinforcement Learning
- RL
-
Reinforcement Learning
- TL
-
Transfer Learning
- CNN
-
Convolutional Neural Network
- BiFPN
-
Bi-directional Feature Pyramid Network
- API
-
Application Programming Interface
- YOLO
-
You Only Look Once
- LIDAR
-
Light Detection and Ranging
- RPN
-
Region Proposal Network
- ROI
-
Region of Interest
- SVM
-
Support Vector Machines
- HOG
-
Histogram of Oriented Gradient
- FLD
-
Fisher Linear Discriminant
- RGN
-
Relational Graph Network
- GPS
-
Global Positioning System
- GPU
-
Graphical Processing Unit
- TPU
-
Tensor Processing Unit
-
$\varphi$
-
Roll Angle in radians
-
$\theta$
-
Pitch Angle in radians
-
$\varPsi$
-
Yaw Angle in radians
- mAP
-
Mean Average Precision
- IoU
-
Intersection Over Union
- TP
-
True Positives
- FP
-
False Positives
- TN
-
True Negatives
- FN
-
False Negatives
- ms
-
milliseconds
- BFLOPS
-
Billions of Floating-Point Operations required per Second
1.0 INTRODUCTION
Drones are generally used for surveillance, reconnaissance, shipping and delivery. Also, the number of drones which are commercially available is increasing, along with the risk of their misuse. Counter-drone systems are an emerging need to detect and eliminate malicious drones or any kind of UAV that threaten public security or individual privacy. Technologies for the detection, localisation and identification of small UAVs include infrared sensors, laser devices, optical surveillance aids and devices, acoustic devices, Light Detection and Ranging (LiDAR) sensors, equipment operating with image recognition technology, devices capable of detecting and localising UAV remote control signals and human air observers(Reference Kratky and Farlik1). After a target drone is detected, elimination methods such as laser guns, water cannons, birds trained to catch drones and jamming can be applied.
However, drones themselves can also be used to counter malicious drones. In the literature, researchers have studied different instruments to detect UAVs. Choi et al.(Reference Choi, Oh, Kim, Chong and Li2) proposed a radar system to detect drones such as quadcopters from long distances. The drone detection system was also tested experimentally in outdoor environments to verify its ability for long-range drone detection. In other research by Bernardini et al.(Reference Bernardini, Mangiatordi, Pallotti and Capodiferro3), an acoustic drone detection method was presented. A machine-learning-based warning system was developed to detect drones by using their audio fingerprint. The effectiveness of this sensing approach was supported by preliminary experimental results. Haag et al.(Reference de Haag, Bartone and Braasch4) presented LiDAR and radar sensors to detect small unmanned aerial system platforms. The position and average velocity of the target could also be determined very accurately by applying motion compensation and target tracking techniques based on the high update rate and ranging accuracy of LiDARs. A full counter-drone system using several types of sensors and several levels of prediction and fusion was also presented by Samaras et al.(Reference Samaras, Diamantidou, Ataloglou, Sakellariou, Vafeiadis, Magoulianitis, Lalas, Dimou, Zarpalas, Votis, Daras and Tzovaras5). A Deep Reinforcement Learning (DRL) solution was proposed by Çetin et al.(Reference Çetin, Barrado and Pastor6) to counter a drone by using another drone. The countering drone can autonomously avoid all kinds of obstacles (trees, cars, houses, etc.) inside a suburban neighbourhood environment, while trying to catch a malicious drone that is moving randomly. The current research could be considered as part of the object detection system in that DRL solution.
In this paper, a state-of-the-art object detection algorithm is trained and tested for the detection of drones in real time. Moreover, the drone detection models are trained using different kinds of image sets, one of which was created by capturing images from the AirSim simulator and auto-labelled. The results are analysed and compared against those of existing drone detection models described in literature. The proposed models are adapted to detect new types of drones by improving the layer parameters of a pre-trained Convolutional Neural Network (CNN). The remainder of this manuscript is organised as follows. In Section 2, related work is explained. Section 3 explains the training setup, drone image dataset and auto-labelling process, the real-time object detection method and the models used to detect drones and transfer learning, as well as the mapping from world to image coordinates. In Section 4, the training and test results are presented. Also, the performance of the proposed models is explained and analysed in detail, followed by a discussion and the conclusions of this work.
2.0 RELATED WORK
In this section, firstly studies related to state-of-the-art object detection methods are presented, then drone detection models available in literature are explained.
2.1 Object detection models
Object detection is one of the core techniques used in computer vision and image processing. There are several fast and powerful real-time object detection systems, such as Fast r-CNN(Reference Girshick7), Faster r-CNN(Reference Ren, He, Girshick and Sun8), Xception(Reference Chollet9), Yolo(Reference Redmon, Divvala, Girshick and Farhadi10), or EfficientNet(Reference Tan and Le11)(Reference Tan, Pang and Le12). These methods have been updated and represent considerable improvements with respect to former CNN models. For example, Fast r-CNN(Reference Girshick7) is an evolution of the VGG16 network, a region-based CNN, using the Caffe framework with multitasking, in which training is carried out in a single stage, thus avoiding any disk storage. Faster r-CNN(Reference Ren, He, Girshick and Sun8) builds on Fast r-CNN by introducing a Region Proposal Network (RPN) with the aim of proposing regions of interest at the same time that feature maps are being generated.
Xception(Reference Chollet9) is also an evolution of the VGG16 network that uses inception layers(Reference Szegedy, Liu, Jia, Sermanet, Reed and Anguelov13). These are neural network layers that independently look for correlations across channels and at spatial pixels. Xception is based on a linear pipeline of depth-wise separable convolution layers, efficiently implemented in TensorFlow and forms the base of the Facebook object detector software.
Yolo(Reference Redmon, Divvala, Girshick and Farhadi10) is a family of algorithms used in many research studies on object detection. Proposed by Redmon et al., Yolo, named after the slogan “you look only once”, frames object detection as a regression problem to spatially separate bounding boxes and associated class probabilities. Yolo processes images by first resizing the input image, then secondly, a single convolutional neural network runs on the image. Finally, the resulting detections are thresholded based on the model’s confidence. Newer versions of Yolo propose the use of larger neural networks than the precious version and/or show faster execution. Yolo predicts bounding boxes using dimensional clusters as anchor boxes(Reference Redmon and Farhadi14). Yolo offers the advantage of allowing multi-label predictions; this is, an object can be detected as two (or more) different labels at the same time. In this way, a friendly drone and a malicious drone can both be detected and labelled as drones, while at the same time, if their visual appearance is known, they could also be labelled as a threat or non-threat.
EfficientNet(Reference Tan and Le11) is a brand-new state-of-the-art object detection model that, together with its recent evolution, EfficientNet(Reference Tan, Pang and Le12), has become very popular in a short time thanks to its accuracy and efficiency. One of the key improvements is its novel Bi-directional Feature Pyramid Network (BiFPN), which allows information to flow in different directions: top down and bottom up. Secondly, EfficientNet uses a fast, normalised fusion technique, which adds an additional weight for each input feature, thus identifying the importance of each input feature. Finally, it introduces a scaling method, which jointly resizes the resolution/depth/width of the model to better fit with different resource constraints. EfficientNet-B0 is a version of EfficientNet adapted for small-size objects.
2.2 Object detection models for on-board UAV processing
Object detection methods have also been implemented to detect objects from UAVs with different target applications such as surveillance and disaster management.
For example, real-time object detection has been performed to detect humans by using videos captured from a UAV(Reference Bhattarai, Nakamura and Mozumder15), addressing constraints such as computation time, viewing scale and altitude. The results are also visualised in a Geographic Information System platform by geo-localising objects to world coordinates.
In other research, a technique was proposed(Reference Rudol and Doherty16) to detect humans at a high frame rate by using an autonomous UAV. A map of points of interest is built by geo-locating the positions of the detected humans. The video sequences, which are streamed from two video sources, thermal and colour, are collected on board the UAV and fused to increase the successful detection rate.
Furthermore, Yang et al.(Reference Yang, Yurtsever, Renganathan, Redmill and ÖzgÜner17) recently proposed a real-time detection and warning system based on artificial intelligence (AI) to monitor social distancing between people during the coronavirus disease 2019 (COVID-19) pandamic(18). A fixed monocular camera is used to detect individuals in a region of interest (ROI), and the distances between them are measured in real time without data recording. The proposed method was tested across real-world datasets to measure its generality and performance.
2.3 Drone detection models
In 2017, the European SafeShore project, in parallel with the IEEE AVSS conference, launched the Drone-vs-Bird Challenge to improve methods for detecting UAVs close to coastal borders, where they can easily be confused with birds. The challenge aims to correctly label drones and birds appearing in a video stream. From the papers presented in the first and second editions, in 2017 and 2019, respectively, it can be concluded that new advances are being driven by using CNNs with more and more layers, larger training datasets and improved implementations. In addition, the exploitation of temporal information is a key issue in differentiating drones from birds. The best paper of 2019(Reference Craye and Ardjoune19) proposed a 110-layer CNN based on a semantic segmentation U-Net network originally designed for medical image processing, adding some dilation layers to improve detection of small objects. Posterior spatial–temporal filtering allowed an F1-score of
$0.73$
to be obtained. The set of training and test images used in those works, although challenging, were taken from the ground with the sky as background.
Drone-Net(20) is a deep learning model, available as open source, trained with 2,664 real drone images. It contains 24 convolutional layers in total, two of which are detection layers called Yolo layers. The details of the Drone-Net CNN are shown in Fig. B.1 in the Appendix. This model is considered as the state-of-the-art model for drone detection, and its accuracy results will be compared with those achieved by the new models presented herein.
Xiaoping et al.(Reference Xiaoping, Songze, Boxing, Yanhong and Feng21) proposed a dynamic drone detection method based on two consecutive inter-frame differences. They combine a Support Vector Machines (SVM) classifier with the traditional Histogram of Oriented Gradient (HOG) detection algorithm, and add an intermediate step based on a Fisher Linear Discriminant (FLD) to reduce the dimensionality of the HOG features. Using a dataset of 500 images, their results show an accuracy above
$90 \%$
, similar to other drone detection algorithms, but with improvements in terms of the detection time that allow the processing of up to 10 images per second. However, since this method is based on the difference between consecutive images, it is not suitable for a moving camera.
Hu et al.(Reference Hu, Duan, Mao, Zhou and Zhou22) proposed an object detection method, called DiagonalNet, by using an improved hourglass CNN as its backbone network and generating confidence diagonal lines as the detection result. A large dataset (10,974 sample images) was created by processing videos and photos with different backgrounds and lighting, augmented by rotating and flipping each image with random angles. The images contain six types of UAVs, including multirotor and helicopter devices, and are labelled manually. The proposed algorithm detects UAV quickly (at 31 images per second) and accurately (with mean average precision above 90%). However, the experiments were all carried out indoors and in close proximity to the target drone.
With the objective of improving swarm cooperative flight and low-altitude security, Jin et al.(Reference Jin, Jiang, Qi, Lin and Song23) proposed a Six-Dimensional (6D) pose estimation algorithm for quadrotors. The proposed algorithm, based on the Xception network and pre-trained with ImageNet, includes a novel Relational Graph Network (RGN) to improve the performance of the drone pose detection. The pose is obtained by recognising eight key points (nose, rotors, etc.) of a quadrotor. After training with 340 images, simulation experiments showed a mean average precision of
$0.94$
in position and
$0.74$
in velocity, representing an enhancement of
$9.7\%$
compared with the baseline network. Given the real-time capabilities of the algorithm (30 frames per second), it was also tested with real flights. However, the mean average precision dropped to
$0.65$
–
$0.75$
. Moreover, the accuracy of the 6D pose results decreased for small-sized drones.
Carrio et al.(Reference Carrio, Tordesillas, Vemprala, Saripalli, Campoy and How24) used AirSim to train a CNN detection network with 16 layers by automatically labelling depth maps. Depth maps were obtained by stereo-matching of the Red–Green–Blue (RGB) image pairs of the virtual ZED stereo camera on the AirSim drone. The ground-truth labels of the depth maps were generated automatically by colour segmentation of the visual image. After training with 470 images, the detection system was integrated on board a small drone and tested while the drone navigated in the environment. The results showed that the system can simultaneously detect drones of different sizes and shapes with mean average precision of 0.65–0.75 and localise them with a maximum error of 10% of the truth distance when flying in linear motion encounters. The solution is limited to a maximum distance of 8m with relative speeds up to 2.3m/s.
Wyder et al.(Reference Wyder, Chen, Lasrado, Pelles, Kwiatkowski, Comas, Kennedy, Mangla, Huang, Hu, Xiong, Aharoni, Chuang and Lipson25) presented a UAV platform to detect and counter a small UAV in an indoor environment where Global Positioning System (GPS) is not available. An image dataset is used to train a Tiny Yolo object detection algorithm. This algorithm, combined with a simple visual-servoing approach, is also validated in a physical platform. It successfully tracked and followed a target drone, even with an object detection accuracy limited to 77%.
3.0 CONTRIBUTIONS
The main contribution of the current work is that the drone detection models are constructed by transfer learning and training a state-of-the-art object detection algorithm. The drone detection models are trained using different kinds of images of drones to obtain a more robust drone detector. The source codes are available online(Reference Çetin26), and the model configuration is presented in detail. Images captured from the AirSim simulator are automatically labelled with a bounding box around the drone. One of the advantages is that the time to label each image captured in the simulator is reduced compared with labelling them manually. Besides, the images can be used directly for training without needing third-party applications for labelling. It is considered that the auto-labelling procedure introduced here can save time for many researchers who work in the object detection field. Finally, the models are also tested, and the overall test results show that the proposed models achieve acceptable accuracies above 85% to detect only drones.
4.0 TOOLS AND METHODS
In this section, the tools and methods that are used for developing, training and testing the neural network models are discussed.
4.1 Tools
The following training tools are used for model creation:
-
• AirSim simulator
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig1.png?pub-status=live)
Figure 1. AirSim training images.
AirSim(Reference Shah, Dey, Lovett and Kapoor27) is a platform for AI research to experiment with deep learning, computer vision and reinforcement learning algorithms for autonomous vehicles such as cars or drones. AirSim is built as an Unreal Engine(28) plugin. Unreal Engine provides ultra-realistic rendering and strong graphical features for AirSim. Many environments are available for use in AirSim. In this work, the suburban neighbourhood is selected to capture images.
-
• Darknet framework
Darknet(Reference Redmon29,30) is a framework for training and testing of neural networks, written in C. The C language provides an efficient solution for general object detection in real time. We use Yolo-V3(Reference Redmon and Farhadi31) with Darknet-53, the original 53-layer CNN shown in Fig. A.1, which achieves the highest measured floating point operations per second. In addition to the implementation of the algorithms, the Darknet framework also provides several pre-trained CNN models.
-
• Local desktop computer and Google Cloud Platform
The proposed models are trained on a desktop with an NVIDIA GeForce GTX 1060 graphical processing unit (GPU) with 6GB RAM graphic co-processor and Intel i7 processor with 16GB of memory. In addition, the Google Cloud Platform provides a colaboratory service(Reference Bisong32) (“Google Colab”), which allows you to write and execute python in an internet browser. Google Colab allows the execution of codes on Google’s cloud serves, which include powerful hardware including GPUs and tensor processing unit (TPUs). Neural networks were also tested on a Google cloud server with a Tesla T4-16GB GPU.
4.2 Drone image dataset and auto-labelling
Training a CNN requires a large dataset of labelled images. In this work, new drone images for the training dataset were captured by using the AirSim simulator. AirSim provides a public Application Programming Interface (API) for receiving parameters related to the drone and environment. We created a dataset of 2,000 images for training,
$1,280\times960$
in size, captured in AirSim by using a drone flying randomly in the environment. Then, the captured images are auto-labelled by mapping the drones position in world coordinates to the image coordinates. Some of the image samples used in training can be seen in Fig. 1.
The procedure for projecting from 3D world coordinates to the image plane includes a few steps, which are explained in detail below:
-
• The pinhole camera model
The pinhole camera model defines the geometric relationship between a 3D point in the scene and its corresponding 2D projection onto the image plane. This geometric mapping from 3D to 2D is a perspective projection. In the AirSim simulator, a pinhole camera model is available and mounted onto the drones for capturing images. The pinhole camera models in AirSim do not include the geometric distortion caused by lenses. Figure 2 shows a schematic view of the pinhole camera projection. The following paragraphs explain the different coordinate systems in more detail:
-
• Forward projection
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig2.png?pub-status=live)
Figure 2. Pinhole camera projection visualisation.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig3.png?pub-status=live)
Figure 3. Projection summary.
The order of the forward projection is shown in Fig. 3. Firstly, the world coordinates of the drone which is found in the image are converted to camera coordinates by using quaternions and a rotation matrix. The rotations are described as a yaw–pitch–roll sequence, and the rotation matrix can be obtained by using the Euler angles which are available in the simulator as shown in Equation (1).
Quaternions are applied to the coordinate rotations and to relate them to the Euler angles(Reference Stevens, Lewis and Johnson33). The quaternion for a yaw–pitch–roll sequence are presented in Equation (2)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_eqn1.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_eqn2.png?pub-status=live)
The drone captures an image of the other drone, which is visible in the camera at certain angles when the field of view of the camera is less than
$60^\circ$
. Secondly, the image coordinates are obtained by using the perspective projection of the camera, which uses the camera matrix received from the simulator. Finally, the pixel values are calculated by moving the origin to the upper-left corner of the screen. The mapping geometry is presented in Fig. 2.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig4.png?pub-status=live)
Figure 4. Darknet-53 CNN used as backbone for model 2.
4.3 Transfer learning
The main purpose of Transfer Learning (TL) is to improve the learning performance by using the experience from successfully pre-trained models(Reference Taylor and Stone34). TL can be used for different goals and in different situations. For instance, Drone-Net is built by using transfer learning to refine a Darknet model for detecting drones.
4.4 Proposed models
In this section, the new models proposed to improve the current results of Drone-Net are presented.
-
• Model 1 uses the same CNN architecture as Drone-Net, shown in Fig. B.1 in the Appendix. From the pre-trained Yolo network, up to 16 convolutional layers are transferred. After transferring these convolutional layers, the network is trained with 2,000 new images obtained from the AirSim simulator and another 1,000 images taken from the Drone-Net training set. The main purpose is to use the pre-trained weights from Drone-Net with the hope that the new model can detect real drones as well as drones from AirSim images.
-
• Model 2 is built by using the pre-trained weights from Darknet-53 for the neural network model based on the Yolo-v3(Reference Redmon and Farhadi31) architecture. The neural network model is based on the Yolo-v3(Reference Redmon and Farhadi31) architecture. In this model, the Darknet-53 model is implemented and the default Yolo-v3 network is modified to detect only the drone class. The Yolo-v3 network shown in Fig. 4 contains 107 layers: 75 convolutional layers, 23 shortcut layers, 4 routes, 2 upsamples and 3 Yolo detection layers. The predictions in Fig. 4 show that Yolo-v3 detects objects in three different layers. The model summary can be seen in Fig. C.1 of the Appendix. In this model, up to 74 convolutional layers from Darknet-53 model are transferred, then the training is extended with the 3,000 images: 2,000 images from the AirSim simulator as before, plus another 1,000 images taken from the Drone-Net training set.
-
• Model 3 is constructed and optimised by using the EfficientNet-B0 object detection algorithm to detect a drone. EfficientNet-B0 contains 145 layers and 2 detection layers. Up to 132 convolutional layers from Efficient-D0 model are transferred. The EfficientNet-B0 model summary used in training can be seen in Fig. 15 in the Appendix. The training set includes the 3,000 images: 2,000 images from the AirSim simulator and 1,000 images taken from the Drone-Net training set. The models proposed here have backbone networks such as Drone-Net, Darknet-53 and EfficientNet-B0. The details of these backbones are presented in Table 1. In addition, in Table 2, the model details are explained. For instance, model 1 uses up to 16 Drone-Net backbone network convolutional layers, and model 2 has a backbone network from Darknet-53 with up to 74 convolutional layers. Model 3 has more layers in total than the other models thanks to the EfficientNet-B0 model, which has 145 layers in total, and it uses up to 132 convolutional layers from EfficientNet-B0.
Table 1 Backbone models used in training
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab1.png?pub-status=live)
Table 2 Model details for training (all 6,000 iterations)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab2.png?pub-status=live)
5.0 RESULTS
In this section, the models described in Section 4 are trained and tested. First, the training metrics are given, then the test results are presented.
5.1 Training results
Training was accomplished in 6,000 steps for all models. During training, the mean Average Precision (mAP) value is calculated and the best weights (those that give the highest mAP value) is saved. The mAP is the mean value of the average precision (AP) for each class, being the average precision, i.e., the area under the precision–recall curve(30). The training results for all the models are summarised in Table 3, showing the mean average precision (mAP) metric and the training time for each model. The mAP is calculated for an Intersection Over Union (IoU) threshold of
$0.5$
and presented as mAP@0.5.
Table 3 Training results of models after 6,000 steps
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab3.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig5.png?pub-status=live)
Figure 5. Loss and mAP (%) for training model 1.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig6.png?pub-status=live)
Figure 6. Loss and mAP (%) for training model 2.
These results show that model 3 achieves the highest mAP value of 89.77%, while model 1 reached the lowest mAP value of 83.63%. Model 2 has an intermediate mAP@0.5 value of 84.93% and also required a training time intermediate between model 1 and model 3. The training time can be affected by batch size and subdivisions, which is set in the configuration of each models for training. Model 1 had the shortest training time thanks to the size of its neural network, which has a total of 24 convolutional layers.
Training plots for each model are also presented in Figs 5, 6 and 7, where the red line represents the calculated mAP values and the blue line shows the loss value during training. Note that the loss value drops dramatically after 600 iterations for all the models. In Fig. 5, the model 1 mAP value starts at 71% and fluctuates around 80%. After 3,420 iterations, the best mAP value of 83.63% is recorded. In the meantime, the average loss value remains stable at the minimum of around 0.5, which is the highest among all the models.
Moreover, Fig. 6 shows the training progress for model 2. In this training, the mAP value starts at 43% and jumps to the 76% level. After 3,600 iterations, the best mAP@0.5 value of 84.93% is calculated. At the end of the training, the mAP value is calculated, but we saved the weights of the model for the first highest mAP value to avoid over-fitting.
Finally, the training progress for model 3 is presented in Fig. 7. As when training the other models, the mAP values were calculated during training and the best weights saved. In this training, the best weights were obtained at around 4,000 iterations. The model 3 mAP value reaches the highest value of 89.77% with respect to the two models described above. The mAP value of model 3 starts at a higher value of 68%, and the loss settles down to
$0.25$
in 6,000 iterations.
5.2 Test results
Once the models were trained, we fed them with new unseen images for the test evaluation. The state-of-the-art object detection models, shown as the backbone models in Table 1, and the three proposed models were tested with those different groups of images. The details regarding the number of test images from each group are presented in Table 4. Four sources of images were proposed with a balanced distribution of 20 images from each. Some of the images are shown in Fig. 8.
Table 4 Example test images
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab4.png?pub-status=live)
Note in Fig. 8 that the first two images (a, b) are from the AirSim simulator, images (c) and (d) are obtained from the Drone-Net test set while images (e) and (f) are random but challenging drone images, with noisy background, found in the Internet. Finally, the models were also tested with images, such as (g) and (h), which do not include any drones, to capture potential errors in prediction. Note that, unlike the other two sets, the AirSim images are drone images taken from another flying drone, not from the ground as in the other cases.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig7.png?pub-status=live)
Figure 7. Loss and mAP (%) for training model 3.
The following evaluation metrics were used to measure the test performance of the neural networks and compare the results obtained by each model:
-
• Accuracy: (TP + TN)/total predictions
-
• Precision: TP/(TP + FP)
-
• Recall: TP/(TP + FN)
-
• F1-score: 2
$\times$ (precision
$\times$ recall)/(precision + recall)
based on the number of True Positives (TP), False Positives (FP) (or detection errors), False Negatives (FN) (or omissions) and True Negatives (TN).
While the accuracy measures the goodness of the models, the precision and recall measure the errors in terms of the false detection rate and omissions, respectively. The F1-score, which is calculated as the harmonic mean of the precision and recall, is a measure of the robustness of a model.
The overall test results including the whole test set (80 images) are presented in Table 5. The test results for each set of test images are also analysed in detail, and the results presented in Tables 7, 8, 9 and 10.
Table 5 Overall test results
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab5.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig8.png?pub-status=live)
Figure 8. Test images from four different sets.
The results presented in Table 5 show that Darknet-53 and EfficientNet-B0 were less accurate and, as expected, their F1-scores were very low compared with Drone-Net, which is a state-of-the-art drone detection model. The Darknet-53 and EfficientNet-B0 models were already trained by using the COCO(Reference Lin, Maire, Belongie, Hays, Perona, Ramanan, DollÁr and Zitnick35) image set to detect up to 80 classes. These classes are different kinds of objects such as humans, cars, trees, birds, dogs, bags, trains, etc., while all kinds of air vehicle are labelled under a single class of aeroplanes. A detailed look at the number of TP predictions in Table 5 shows that Darknet-53 and EfficientNet-B0 miss most of the drone detections. For this reason, both models also have a high number of FN predictions and thus a low F1-score compared with Drone-Net. As a direct conclusion for counter-drone systems, it is not feasible to use such generic models directly to detect only drones. For this reason, the new models proposed here provide an improved way to detect only drones with acceptable accuracy. For example, model 1, model 2 and model 3 achieved high accuracies, reaching 85%, 91% and 86%, respectively. Equivalently, the models have high F1-scores of 90%, 94% and 91%, respectively, with model 2 achieving the highest scores compared with the other models. This is very important because counter-drone systems are expected to detect drones precisely. In other words, the number of false detections is expected to be zero or as low as possible, thus avoiding failures in the detection of unexpected intruder drones. As seen in Table 5, model 2 achieved the lowest false detections compared with the other models.
Table 6 Comparison of real-time performance of the models
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab6.png?pub-status=live)
The real-time performance of the proposed models is compared in Table 6. Evaluation metrics such as the inference time in milliseconds (ms) and Billions of Floating-Point Operations required per Second (BFLOPS) are used to compare each models. The models are tested on a 16-GB Tesla T4 GPU, which is commonly used among researchers. Model 1 achieved the fastest inference time of 4.838ms. Model 3 performed slightly faster than model 2, although model 3 has the highest number of layers, but the lowest BFLOPS value of 3.670.
Table 7 AirSim image test results
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab7.png?pub-status=live)
To better understand how the models perform predictions, we carried out a deeper analysis of the results for each test dataset. The results of the AirSim images are presented separately in Table 7, revealing that the performance of the models was satisfactory for detecting AirSim images. Model 1 and model 3 showed accuracy of 70%, while model 2 achieved a higher rate of correct detection of 80%. Model 1 showed only one FP detection, while model 2 and model 3 showed none. The F1-score was also calculated for all the models. Model 2 showed the highest F1-score of 0.89, compared with 0.82 and 0.83 for model 1 and model 3, respectively.
Table 8 Web image test results
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab8.png?pub-status=live)
Table 8 presents partial results for brand-new images taken from the Internet. Note that model 1, model 2 and model 3 showed very good rates of detection of 90%, 90% and 95%, respectively. Additionally, model 1, model 2 and model 3 achieved higher and promising F1-scores of 0.95, 0.95 and 0.98, respectively.
When looking at the partial results for the models tested with Drone-Net images, model 1, model 2 and model 3 detected the images with higher accuracy of 100%, as seen in Table 9. Also, the models showed higher F1-scores of 1.
Table 9 Drone-Net image test results
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab9.png?pub-status=live)
Finally, the partial model test results for the images which do not include drones are presented in Table 10. It is expected that the models will not detect any object, including in images which include shapes similar to drones. All the models showed higher accuracy (above 80%), although model 1 and model 3 had few FP detections. The F1-score is undefined given that the number of TP is zero.
Table 10 No-drone image test results
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_tab10.png?pub-status=live)
Figure 9 shows the same AirSim test image to compare the detection results of the four models able to predict only the class drone. The Drone-Net model prediction shown in Fig. 9(a) fails to detect drones in the AirSim test images. However, model 1, which has the same configuration as the Drone-Net model but is trained with AirSim images, successfully detected a drone (Fig. 9(b)). The model 2 and model 3 detection tests shown in Fig. 9(c) and (d) reveal that both correctly detected drones.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig9.png?pub-status=live)
Figure 9. AirSim image test detection results.
Figure 10 shows the test results of the models for a challenging image from the web image set. All the new models successfully detected the drone in the image. However, the state-of-the-art drone detection model Drone-Net failed to detect the drone against such a noisy background. In addition, the bounding box sizes can have different sizes. For example, in Fig. 10(c), the bounding box is larger than expected, which can just cover the predicted drone. However, this is not the case in general in all the test images. Such different sizes of the bounding boxes may be the result of background noise and the scale of the drone dimensions.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig10.png?pub-status=live)
Figure 10. Web source image test detection results.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig11.png?pub-status=live)
Figure 11. Inaccurate bounding boxes in auto-labelling.
6.0 DISCUSSION
In this section, further analysis is discussed.
6.1 Inaccurate bounding boxes of the auto-labelling process
In auto-labelling, it was observed that there were inaccurate bounding boxes, and they had to be removed from the training set. Almost 10% of the auto-labelled images were detected to be inaccurate. In Fig. 11, some of the inaccurately labelled images are presented. For instance, in Fig. 11(a), there is a bounding box at the top left of the image with no drone shown within. The position of the drone at the moment of the image capture was very close to the other drone, and the mapping from world to image coordinates was not correct. We believe that this issue was caused by the processing time between capturing the image and obtaining its world position from the simulator. The processing time does not cause problems when the drones are far from each other, but when they are too close, their relative speed is high and the retrieved drone location is already obsolete. However, if the drone stays stationary, this problem disappears and none of the bounding boxes are inaccurate. Other erroneous and disregarded bounding boxes, not centred correctly in the images, are shown in Fig. 11(b), (c) and (d), all of which correspond to a drone at the border or outside of the captured image.
6.2 FP detections
In this section, FP detections by the models are discussed and analysed. These are error cases where another object is mistakenly labelled as a drone. The FP images can be seen in Fig. E.1 in the Appendix.
At the start of this research, we tested Drone-Net to detect drones in some images and found that most of them were FP, especially when dealing with AirSim images. Figure E.1(a), (b), (c), (d) and (e) shows AirSim test images detected as FP. These figures show that Drone-Net is not accurate enough to detect a drone in these images. The detections were bounded to objects such as trees or wires, or covering the whole image. In Fig. E.1, Drone-Net detects a drone, but it also detects the commercial aircraft as drone. Drone-Net also detects two of the images from the no-drone test image set as FP. Figure E.1(g) and (h) shows that objects such as humans and road signs are also detected as FP by Drone-Net.
Model 1 also detected FP images. For example, in Fig. E.1(i), a large part of the image was detected as a drone, similar to the Drone-Net model. However, most of the FP detections occurred for no-drone test images. Figure E.1(j), (k), (l), and (m) shows that drone-like shapes could be detected as a drone.
In addition, model 2 detected two FP images, one of them from the no-drone test image set. A noisy image from the no-drone image set was tested, and model 2 detected smoke as a drone. This FP image seen in Fig. E.1(o) can be caused by the shape of the smoke, which appears like a drone in the image.
As observed above for the other models, model 3 also detected objects which are not drones. A drone is detected in a few of the test images from the no-drone test set seen in Fig. E.1(q), (r), (s), and (t). However, there are no drones in these images.
Some of the common FP detections among the models are shown in Fig. E.1(n), (p), and (u). In these figures, it is seen that a drone is detected in one of the test images from the web. However, another object, a commercial aircraft, is detected as a drone, which is a FP instead of the drone to the bottom left of the aircraft.
7.0 CONCLUSIONS
Drone detection is a critical element of counter-drone systems, which also include other subsystems such as drone type classifiers (malicious or friendly) and neutralisation subsystems, such as jamming, laser guns or shotguns. We believe that AI will protect the skies from incoming malicious drone threats. This paper shows that drones can be detected with high accuracy by using powerful available real-time object detection algorithms.
The proposed models are trained by using different kinds of images of drones and compared with three state-of-the-art CNN models for object/drone detection that are available in the public domain and also used as baseline models for transfer learning to the new models. The best models are found to be model 2 and model 3. The results show that drones in AirSim images can be detected with high precision by using pre-trained convolutional layers and training with AirSim and Drone-Net images. Additionally, it is observed that model 1 produces promising results compared with the state-of-the-art Drone-Net drone detection model.
Both state-of-the-art object detection algorithms, i.e., Darknet-53 (model 2) and EfficientNet-B0 (model 3), showed similar results. However, in real-world applications such as counter-drone systems, the object detection method must be operated with limited resources. EfficientNet-B0 provides state-of-the-art accuracy with a neural network that is nine times smaller and consumes significantly less computation power compared with other state-of-the-art object detectors. In future work, the EfficientNet-B0 drone detection model tested here could be integrated as part of our counter-drone system in which, using DRL methods, a guardian drone could detect and counter malicious drones while respecting and avoiding obstacles and legal drones. The accuracy of object detection and the fast response time are very important challenges to track and catch drones. Finally, in the future, the inaccurate bounding boxes of the auto-labelling process must be investigated to obtain more robust drone detection models. One possible solution is to update the simulator when possible, since the updated simulator API might fix the issue relating to the processing time between the image capture and the receipt of the drone flight data from the simulator. A second solution would be to optimise the Yolo object detection function to make the bounding boxes centre correctly.
ACKNOWLEDGEMENTS
This work was funded by the Ministry of Science, Innovation and Universities of Spain under grant no. TRA2019 and the SESAR Joint Undertaking (JU) project CORUS-XUAM, under grant agreement No 101017682. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and the SESAR JU members other than the Union.
APPENDIX
B.0 Drone-Net & Model 1 NN summary
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig13.png?pub-status=live)
Figure B.1. Drone-net and model 1 neural network model summary.
C.0 Model 2 NN Summary
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig14.png?pub-status=live)
Figure C.1. Model 2 neural network summary.
D.0 Model 3 NN summary
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig15.png?pub-status=live)
Figure D.1. Model 3 neural network summary.
E.0 FP detections
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20211008051703200-0404:S0001924021000439:S0001924021000439_fig16.png?pub-status=live)
Figure E.1. FP images.