Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-02-06T06:18:37.578Z Has data issue: false hasContentIssue false

Indoor Scene recognition for Micro Aerial Vehicles Navigation using Enhanced SIFT-ScSPM Descriptors

Published online by Cambridge University Press:  05 July 2019

B. Anbarasu*
Affiliation:
(Madras Institute of Technology Campus, Anna University, Chennai, India)
G. Anitha
Affiliation:
(Madras Institute of Technology Campus, Anna University, Chennai, India)
*
Rights & Permissions [Opens in a new window]

Abstract

In this paper, a new scene recognition visual descriptor called Enhanced Scale Invariant Feature Transform-based Sparse coding Spatial Pyramid Matching (Enhanced SIFT-ScSPM) descriptor is proposed by combining a Bag of Words (BOW)-based visual descriptor (SIFT-ScSPM) and Gist-based descriptors (Enhanced Gist-Enhanced multichannel Gist (Enhanced mGist)). Indoor scene classification is carried out by multi-class linear and non-linear Support Vector Machine (SVM) classifiers. Feature extraction methodology and critical review of several visual descriptors used for indoor scene recognition in terms of experimental perspectives have been discussed in this paper. An empirical study is conducted on the Massachusetts Institute of Technology (MIT) 67 indoor scene classification data set and assessed the classification accuracy of state-of-the-art visual descriptors and the proposed Enhanced mGist, Speeded Up Robust Features-Spatial Pyramid Matching (SURF-SPM) and Enhanced SIFT-ScSPM visual descriptors. Experimental results show that the proposed Enhanced SIFT-ScSPM visual descriptor performs better with higher classification rate, precision, recall and area under the Receiver Operating Characteristic (ROC) curve values with respect to the state-of-the-art and the proposed Enhanced mGist and SURF-SPM visual descriptors.

Type
Research Article
Copyright
Copyright © The Royal Institute of Navigation 2019 

1. INTRODUCTION

Navigation of Micro Aerial Vehicles (MAV) in Global Navigation Satellite System (GNSS)-denied indoor environments is still a challenging task. Indoor scene categorisation will aid the MAV to adopt a suitable navigation strategy to fly in a constrained indoor environment. Many scene classification methods have been proposed for indoor scene recognition (Quattoni and Torralba, Reference Quattoni and Torralba2009; Xie et al., Reference Xie, Wang, Guo, Zhang and Tian2014; Cakir et al., 2011; Kawewong et al., Reference Kawewong, Pimpup and Hasegawa2013). In spite of visual complexity, the visual perception of a scene at a glance is known as Gist recognition of a scene. In recent years, deep learning architectures or Convolutional Neural Networks (CNN) (Lecun et al., Reference Lecun, Bottou, Bengio and Haffner1998; Zhou et al., Reference Zhou, Lapedriza, Xiao, Torralba and Oliva2014., Krizhevsky et al., 2012; Khan et al., Reference Khan, Hayat, Bennamoun, Togneri and Sohel2016), have performed well for large scale datasets. The main drawback is that training a CNN or deep learning architectures for image recognition tasks requires a large amount of computational power. The contextual visual information in the region of interest is combined with local information for effective scene categorisation (Qin and Yung, Reference Qin and Yung2010).

In published work, global features (Meng et al., 2011) have been widely used to represent the scene. A compact spatial pyramid-based image representation has been proposed for object and scene recognition (Elfiky et al., Reference Elfiky, Khan, Van de Weijer and Gonzalez2012). Discriminative Ternary Census Transform Histogram (DTCTH) has been proposed to capture the discriminative structural properties (Rahman et al., Reference Rahman, Rahman, Rahman, Hossain and Shoyaib2017) in an image. A sparse coding model (Yang et al., Reference Yang, Yu, Gong and Huang2009) has been proposed for image classification along with maximum pooling based on a linear Spatial Pyramid Matching (SPM) method and Scale Invariant Feature Transform (SIFT). The SPM method proposed by Lazebnik et al. (Reference Lazebnik, Schmid and Ponce2006) has shown promising performance for recognising natural scene categories. Anbarasu and Anitha (Reference Anbarasu and Anitha2017) proposed a method to estimate the heading and lateral deviation for navigation of a MAV in a corridor environment using a vanishing point detected in the image. Wei and Phung (2016) experimentally evaluated several visual descriptors for scene categorisation.

This paper is a comparative study of several state-of-the-art visual descriptors and proposed Enhanced SIFT-based Sparse coding Spatial Pyramid Matching (Enhanced SIFT-ScSPM), Enhanced multichannel Gist (Enhanced mGist) and Speeded Up Robust Features- Spatial Pyramid Matching (SURF-SPM) visual descriptors from both methodological and experimental perspectives for 67 indoor scene categorisation tasks performed in the challenging Massachusetts Institute of Technology (MIT)-67 Indoor data set. Many existing state-of-the art visual descriptors have been implemented in this paper to address the challenges in indoor scene categorisation and the proposed Enhanced SIFT-ScSPM, Enhanced mGist and SURF-SPM visual descriptors have been compared with the state of the art visual descriptors such as SIFT with Locality-constrained Linear Coding (SIFT-LLC), multichannel Gist (mGist), SIFT with Spatial Pyramid Matching (SIFT-SPM), Enhanced-Gist, Enhanced-mGist, Gist, Histogram of Oriented Gradients with Spatial Pyramid Matching (HOG-SPM), Census Transform Histogram (CENTRIST), multichannel Census Transform Histogram (mCENTRIST), CENTRIST (spatial pyramid representation), Uniform Local Binary Pattern (ULBP) and Histogram Of Directional Morphological Gradient (HODMG).

The major contribution is the integration of dense SIFT-ScSPM descriptors (Bag of Words) and Gist-based descriptors (Enhanced Gist and Enhanced mGist) to recognise 67 complex indoor scene image categories for navigation of MAVs in GNSS-denied indoor environments. Conventional SIFT-ScSPM descriptors capture the spatial information on different spatial image regions at different scales but do not encode the holistic spatial envelope of the indoor scenes. To overcome this shortcoming, a new indoor scene recognition visual descriptor called Enhanced SIFT-ScSPM descriptor is proposed for the indoor scene categorisation task.

2. PROPOSED METHOD

The two major stages in the scene recognition model are offline training and testing phases. The block diagram of the proposed framework for indoor scene recognition is shown in Figure 1.

Figure 1. The block diagram of the proposed method.

Visual descriptors such as SIFT-ScSPM, Enhanced Gist and Enhanced mGist are extracted and combined into a new visual descriptor called Enhanced SIFT-ScSPM for all training images in the offline training phase. The extracted Enhanced SIFT-ScSPM descriptors and image labels are learned using SVM classifier at the end of the offline training phase. The SIFT-ScSPM, Enhanced-Gist and Enhanced mGist descriptors are extracted for each test image frame and fed to the Support Vector Machine (SVM) classifier for real time indoor scene recognition. Finally, the classifier can classify the 67 indoor scene image categories based on the trained scene recognition model.

3. VISUAL DESCRIPTORS

In this paper, visual descriptors such as Enhanced SIFT-ScSPM, SIFT-ScSPM, SIFT-LLC, mGist, SIFT-SPM, Enhanced-Gist, Enhanced-mGist, Gist, HOG-SPM, CENTRIST, CENTRIST (spatial pyramid representation), Uniform LBP, DMGH (Directional Morphological Gradient Histogram), mCENTRIST and SURF-SPM have been employed for indoor scene categorisation. In this section, feature extraction techniques for scene categorisation are discussed.

3.1. Scale-Invariant Feature Transform (SIFT)

Lowe (Reference Lowe2004) proposed the SIFT algorithm to detect SIFT keypoints that are invariant to image scale, rotation, translation and changing illumination. Difference-of-Gaussian filters are applied to identify stable keypoints in the scale-space. Detected keypoints below the threshold value are removed to detect the final stable keypoints in the image. Consistent orientations are assigned to each keypoint based on local image gradient. SIFT keypoints detected in a corridor image from one of the 67 class indoor scenes are shown in Figure 2.

Figure 2. Visual illustration of SIFT keypoints detected in a corridor image.

Finally, the SIFT features are extracted by computing the image gradient magnitude and orientation in each keypoint of the 16×16 neighbourhood of the pixel. Gradient magnitude and orientation are weighted by a Gaussian window and accumulated into orientation histograms (eight bins) over the sub-regions of the pixel neighbourhood $(16\times 16)$ of size 4×4 (16 regions). In each 4×4 sub-region a 128-dimensional SIFT descriptor is extracted.

3.2. Speeded-Up Robust Features (SURF)

Bay et al. (Reference Bay, Ess, Tuytelaars and Van Gool2008) proposed Hessian matrix-based interest points, scale-and rotation-invariant visual descriptor called Speeded-Up Robust Features (SURF). SURF features are extracted by applying a determinant of a Hessian matrix. Detected interest points are localised using a non-maximum suppression in multi-scale space. The dominant orientation of keypoints is estimated by calculating the sum of all Haar wavelet responses within a sliding orientation window. The longest vector over all windows defines the orientation of the keypoint. For each 4×4 sub-regions, the Haar wavelet responses are computed in the horizontal direction (d x) and the vertical direction (d y) over the keypoint. Finally, a Four-Dimensional (4D) descriptor vector for each 4×4 sub-regions is denoted as $v=\big(\sum d_{x},\sum d_{y},\sum |d_{x}|,\sum |d_{y}|\big)$. SURF key points detected in a classroom image form the 67 class indoor scene set is shown in Figure 3.

Figure 3. Visual illustration of SURF keypoints detected in a class room image.

3.3. Histogram of Oriented Gradients (HOG)

HOG was developed by Dalal and Triggs (Reference Dalal and Triggs2005) for human detection. The feature extraction of HOG involves several consecutive steps. First, image gradients are computed along the horizontal and vertical directions. Second, the image is divided into connected regions (16×16 pixels) called cells. Third, for the pixels within each cell a histogram of gradient orientations is computed. Each pixel in the cell is used as a weighted gradient to the corresponding angular bin. Finally, the cell histograms (nine bins) are normalised using L2-norm and grouped in blocks to obtain the HOG descriptor. The HOG descriptor is used for object detection, object recognition, pedestrian detection in static images and natural scene categorisation tasks. The HOG descriptor extracted from a greyscale image (Library) for 16×16 pixels cell size is shown in Figure 4.

Figure 4. Visual illustration of HOG features detected in a library image.

3.4. Gist

The Gist scene descriptor was proposed by Oliva and Torralba (Reference Oliva and Torralba2006) to represent the dominant spatial envelope of the indoor scene. In the original Gist model, 32 Gabor filters at four scales and eight orientations are used. In the image pre-processing stage, a Red-Green-Blue (RGB) indoor image is converted into a greyscale image with a resolution of $256\times 256$ pixels. Next, each indoor greyscale image is filtered by using a bank of 32 Gabor filters at four scales (spatial frequencies) and eight orientations, to produce 32 feature maps. The 32 feature maps obtained are divided into 16 regions (4×4 grids) and then the filter outputs are averaged within each region to produce a 512 (16×32) dimensional Gist descriptor. Gist features extracted from an input image (airport inside) is shown in Figure 5.

Figure 5. Gist feature extraction output (a) Input image (b) Gist features.

3.5. Histogram of Directional Morphological Gradient (HODMG)

Directional morphological gradients (Haralick et al., Reference Haralick, Sternberg and Zhuang1987; Soille, Reference Soille2003) can be computed for horizontal and vertical directions by using line segments as structuring elements that are symmetric with respect to the neighbourhood centre as shown in Figure 6. Directional morphological gradients for a given direction α can be computed as follows:

(1)$$gL_\alpha (\,f)=\delta L_\alpha (\,f)-\varepsilon L_\alpha (\,f)$$

where $\delta L_{\alpha} (\,f)$ is the dilated image, $\varepsilon L_{\alpha} (\,f)$ is the eroded image and L is the structuring element. A flat linear structuring element that is symmetric with respect to the neighbourhood centre is used as the morphological structural element. The directional gradient is extracted from the greyscale image using the same structuring element (line segment L) for the two principal directions, that is, horizontal and vertical directions. Directional morphological gradient feature extraction for a staircase image is shown in Figure 6. The histogram of horizontal directional gradient and vertical directional gradient are then combined to produce a final feature descriptor, called HODMG. This Histogram of Directional Morphological Gradient (HODMG) or DMGH is a 512-dimensional descriptor.

Figure 6. Directional Morphological gradient feature extraction output. (a) Input Staircase image. (b) Horizontal Directional Gradient. (c) Vertical Directional Gradient.

3.6. CENTRIST

CENTRIST is a new visual descriptor, where each centre pixel intensity value of greyscale image is replaced by its Census Transform (CT) value (Wu and Rehg, Reference Wu and Rehg2011). In this work, greyscale indoor images with a resolution of $256\times 256$ pixels are converted into Census Transformed images by comparing the intensity value of the centre pixel to the intensity values of pixels in a 3×3 neighbourhood. For example, as shown in Figure 7, a centre pixel intensity value of 180 is compared to its neighbouring pixels (177, 175, 174, 176, 178, 181, 186 and 183); if the neighbouring pixel value is less than the centre pixel value, bit ‘1’ is assigned to the neighbouring pixel; otherwise the neighbouring pixel value is set to ‘0’. Finally, the generated eight bits (11111000) from intensity comparisons of centre pixel with the neighbouring pixels are collected from left to right, and from top to bottom), which is converted into a base-10 number (248). This converted base-10 number from eight bits is called the Census Transform (CT) value.

Figure 7. Census Transformed value.

Histogram of Census Transformed (CT) values have been used as a scene categorisation visual descriptor. The CENTRIST (Not using PCA (Principal Component Analysis)) is a 256-dimensional feature descriptor. The Census Transformed image for an indoor scene (office) is shown in Figure 8. The CENTRIST feature vector encodes the global structural information of the indoor scenes. A spatial representation of the CENTRIST descriptor was extracted by dividing the image into 31 blocks (25+5+1) for level 2, 1 and 0 split in a spatial pyramid. The extracted CENTRIST descriptors in 31 blocks are concatenated to produce a 7,936 (31×256) dimensional CENTRIST descriptor.

Figure 8. Census Transformed output. (a) Office image. (b) Census Transformed office image.

3.7. Local Binary Pattern

The Local Binary Pattern (LBP) is a powerful texture feature descriptor used by several researchers for texture classification, face recognition and scene recognition (Ojala et al., Reference Ojala, Pietikäinen and Mäenpää2002). If the centre pixel value is greater than or equal to the neighbouring pixel in a neighbourhood of 3×3 pixels, the value of the neighbouring pixel is assigned a binary value ‘1’; otherwise, it is assigned a binary value ‘0’. Finally, the binary digits are collected from the neighbouring pixels and converted to its decimal equivalent as shown in Figure 9. For circular neighbourhood the LBP$_{{\rm P},{\rm R}}$ operator can be computed as follows:

(2)$$\textit{LBP}_{P,R} (x_c,y_c)=\sum_{n=0}^{P-1} s(g_n-g_c)2^n$$

where x c, y c denote the central pixel locations, n denotes the n-th neighbouring pixel intensity value, R denotes the radius, and s(x) is the unit step function, which returns s(x)=1 if x≥0, otherwise s(x)=0. The rotation invariant uniform LBP value for a circular neighbourhood can be computed as follows:

(3)$$\textit{LBP}_{P,R}^{riu2} (x_c,y_c)=\begin{cases} \displaystyle\sum_{n=0}^{P-1} s(g_n-g_c),& U(\textit{LBP}_{P,R})\le 2 \\ P+1,& otherwise \\ \end{cases}$$

where subscript P and R denote the circular neighbourhood of sampling points and radius of the circle respectively. The superscript riu2 denotes rotation invariant uniform patterns.

Figure 9. Illustration of basic LBP operator.

3.8. mGist

Hyper opponent colour space (O1O2O3S) enhances the performance of visual descriptors such as mCENTRIST, mGist, msSIFT and mSIFT (Xiao et al., Reference Xiao, Wu and Yuan2014) on the 21-class land-use classification dataset. mCENTRIST possesses the strongest discriminative power on the challenging 67 class indoor scene recognition dataset. Therefore, multi-channel feature extraction is highly successful for image classification tasks. In the hyper-opponent colour space O1, O2 and O3 are the RGB opponent transformed images and S is the Sobel image of the R channel. Hyper-opponent transformed image (O1, O2, O3 and S) with an image resolution with a resolution of $256\times 256$ pixels are filtered by a bank of 16 Gabor filters at four spatial frequencies (scales) and 16 orientations, to produce 64 feature maps. The 64 feature maps obtained are divided into 16 regions (4×4 grids) and then the filter outputs are averaged within each region to produce 1,024 Gist features for each channel (O1, O2, O3 and S). The mGist descriptor is a 4,096 (1, 024×4) dimensional descriptor. An RGB image of a child's room and the corresponding hyper-opponent transformed image is shown in Figure 10.

Figure 10. Visual illustration of mGist descriptor (a) RGB. (b) O1. (c) O2. (d) O3. (e) Sobel image of R channel. (f), (g), (h) (i) and (j). mGist descriptor output for the RGB image and the Hyper opponent transformed images.

3.9. Enhanced-Gist

The enhanced-Gist descriptor encodes both the spatial envelope and geometric structure of the indoor scene. A bank of 16 Gabor filters are applied on a greyscale image with a size of $256\times 256$ pixels at four scales and 16 orientations to produce 64 feature maps. The 64 feature maps obtained are divided into 16 regions (4×4 grids) and then the filter outputs are averaged within each region to produce a 1,024 (16×64) dimensional Gist descriptor. Afterwards, a bank of 64 Gabor filters are applied on the grayscale image with the size of $256\times 256$ pixels at four scales and 64 orientations to produce 256 feature maps. The 256 feature maps obtained are divided into 16 regions (4×4 grids) and then the filter outputs are averaged within each region to produce a 4,096 (16×256) dimensional Gist descriptor. Now the total dimensionality of the Gist descriptor is 5,120 (1, 024+4, 096) dimensional. Finally, the extracted Gist features (5,120-dimensional) are combined with DMGH features (512-dimensional) to produce a 5,632-dimensional visual descriptor, called Enhanced-Gist feature descriptor.

3.10. Enhanced-mGist

Extraction of the Enhanced-mGist descriptor is done in the three steps. In step 1, a hyper-opponent transformed image (O1, O2, O3 and S) with a resolution of $256\times 256$ pixels is divided into 4×4 grids (16 regions) and convolved with 16 Gabor filters at four scales and 16 orientations to produce 64 Gist features. Then, Gabor filtered output responses are averaged within 16 regions to produce 1,024 Gist features for each channel (O1, O2, O3 and S). The mGist descriptor is a 4,096 (1, 024×4) dimensional descriptor. In step 2, hyper-opponent transformed images (O1, O2, O3 and S) with a resolution of $256\times 256$ pixels are divided into 4×4 grids (16 regions) and convolved with 256 Gabor filters at four scales and 16 orientations to produce 64 Gist features. Then, Gabor filtered output responses are averaged within 16 regions to produce 1,024 Gist features for each channel (O1, O2, O3 and S). The mGist descriptor is a 4,096 (1, 024×4) dimensional descriptor. In step 3, hyper-opponent transformed images (O1, O2, O3 and S) with a resolution of $256\times 256$ pixels are divided into 4×4 grids (16 regions) and convolved with 512 Gabor filters at four scales and 16 orientations to produce 64 Gist features. Finally, Gabor filtered output responses are averaged within 16 regions to produce 1,024 Gist features for each channel (O1, O2, O3 and S). The mGist descriptor is 4,096 (1, 024×4) dimensional descriptor. Integrate the extracted mGist descriptor to produce the Enhanced-mGist descriptor (12,288-dimensional descriptor).

3.11. Bag-of-words algorithms

Three different Bag-of-Word (BoW) algorithms, namely, SPM-based SIFT (SIFT-SPM), Sparse coding Spatial Pyramid Matching-based SIFT (SIFT- ScSPM, and Locality-constrained Linear Coding (LLC)-based SIFT (SIFT-LLC) are employed for indoor scene classification. The BoW algorithm has a feature extraction stage (extract features from a regular grid) followed by quantising the extracted feature into discrete visual words. For each indoor image, visual words are built based on a codebook (trained dictionary). Lazebnik et al. (Reference Lazebnik, Schmid and Ponce2006) proposed the Spatial Pyramid Matching (SPM) algorithm to recognise natural scenes. In SPM, features such as SIFT, SURF and HOG are extracted in each grid by dividing the input image into regular grids (grid spacing of eight pixels). First, k-means clustering is applied to form a visual vocabulary, then Vector Quantisation (VQ) is applied to form the features. Using the trained dictionary, the local features are represented. The spatial histograms (average pooling) of coded features are used as feature vectors. One versus All (OAA) SVM method is used to recognise indoor scene categories. Sparse coding based Spatial Pyramid Matching (ScSPM) was proposed by Yang et al. (Reference Yang, Yu, Gong and Huang2009). In ScSPM, sparse coding is used to quantise the local features instead of vector quantisation used in SPM. For spatial pooling in ScSPM the maximum operator is applied due to its robustness for local spatial translations. In SPM, histograms are used for spatial pooling. To reduce the computational cost of ScSPM, linear SVM classifier is used. Wang et al. (Reference Wang, Yang, Yu, Lv, Huang and Gong2010) introduced LLC for image classification. The computational cost of SPM is reduced in LLC by using locality constraints instead of the sparsity constraint used in ScSPM.

3.12. mCENTRIST descriptor

The multi-channel feature extraction in the hyper-opponent colour space improves the recognition accuracy. The mCENTRIST descriptor (Xiao et al., Reference Xiao, Wu and Yuan2014), was extracted for hyper-opponent transformed indoor images (O1, O2, O3 and S) with a resolution of $256\times 256$ pixels. A Level 2, 1 and 0 spatial pyramid is used to split the hyper-opponent transformed indoor image into 25, 5 and 1 block, respectively. The CENTRIST descriptor extracted in 31 (25+5+1) blocks is concatenated to produce a 7,936 (31×256) dimensional CENTRIST descriptor for each channel (O1, O2, O3 and S). The extracted mCENTRIST descriptor is a 31,744 (7, 936×4) dimensional descriptor.

3.13. Enhanced SIFT-ScSPM descriptor (proposed visual descriptor)

The Enhanced SIFT-ScSPM Descriptor is an integrated new visual descriptor proposed for the indoor scene recognition task by combining SIFT-ScSPM, Enhanced-Gist and Enhanced mGist descriptors. First, dense SIFT features are extracted from greyscale images with a resolution of $256\times 256$ pixels on a dense grid every two pixels from overlapping 16×16 patches. The SIFT-ScSPM descriptor is extracted by applying sparse coding on the extracted dense SIFT descriptor. The SIFT-ScSPM is a 21,504-dimensional descriptor. To extract the Gist descriptor, a bank of 64 Gabor filters are applied on the greyscale image with the size of $256\times 256$ pixels at four scales and 256 orientations to produce 1,024 feature maps. The 1,024 feature maps obtained are divided into 16 regions (4×4 grids) and then the filter outputs are averaged within each region to produce a 16,384 ($16\times 1024$) dimensional Gist descriptor. Next, extracted Gist features (16,384 dimensional) are combined with DMGH features (512-dimensional) to produce a 16,896-dimensional Enhanced-Gist visual descriptor. The Enhanced-mGist descriptor is extracted as explained in Section 3.10. The extracted Enhanced-mGist descriptor is a 12,288-dimensional descriptor. The extracted SIFT-ScSPM, Enhanced-Gist and Enhanced mGist descriptors are integrated to produce the Enhanced SIFT-ScSPM descriptor.

4. IMPLEMENTATION OF INDOOR SCENE VISUAL DESCRIPTORS

In this section, a detailed description of the implementation of different visual descriptors for indoor scene recognition is presented. This work implemented the BoW-based visual descriptors (SIFT-ScSPM, SIFT-LLC, SIFT-SPM, HOG-SPM and SURF-SPM), histogram-based descriptor (DMGH), four Gist-based descriptors (Gist, mGist, Enhanced Gist and Enhanced mGist), four LBP-based descriptors (Uniform LBP, mCENTRIST, CENTRIST-spatial pyramid and CENTRIST), and a combination of BoW-based visual descriptors and Gist based descriptors (Enhanced SIFT-ScSPM). The Gist descriptor is a biologically-inspired visual descriptor. Three variants of Gist descriptor such as Gist, mGist and Enhanced-Gist have been implemented. To extract the Gist descriptor, the input greyscale image with a resolution of $256\times 256$ pixels is divided into 4×4 grids (16 regions) and convolved with 32 Gabor filters at four scales and eight orientations to produce 32 Gist features. Finally, Gabor filtered output responses are averaged within 16 regions to produce a 512 (16×32) dimensional Gist descriptor. To extract the mGist descriptor, hyper-opponent transformed images (O1, O2, O3 and S) with a resolution of $256\times 256$ pixels are divided into 4×4 grids (16 regions) and convolved with 16 Gabor filters at four scales and 16 orientations to produce 64 Gist features. Finally, Gabor filtered output responses are averaged within 16 regions to produce 1,024 Gist features for each channel (O1, O2, O3 and S). The mGist descriptor is a 4,096 (1024×4) dimensional descriptor.

To extract the HODMG descriptor, horizontal directional gradients and vertical directional gradients are computed for indoor greyscale images with a resolution of $256\times 256$ pixels based on Equation (1) and their combination produce a 512-dimensional HODMG descriptor.

To extract the Enhanced-Gist descriptor, a bank of 16 and 64 Gabor filters are applied on the greyscale image with a size of $256\times 256$ pixels at four scales with 16 and 64 orientations to produce 64 and 256 feature maps. The 64 and 256 feature maps obtained are divided into 16 regions (4×4 grids) and then the filter outputs are averaged within each region to produce a 5,120 ($(1{,}024\ (16\times 64) + 4{,}096 (16\times 256))$ dimensional Gist descriptor. Finally, the extracted Gist features (5,120 dimensional) are combined with DMGH features (512-dimensional) to produce a 5,632-dimensional Enhanced-Gist visual descriptor.

Three different feature vector encoding techniques including ScSPM, LLC and SPM are employed to encode the SIFT-based descriptors. To evaluate the encoding capability of the feature vector encoding techniques SIFT-ScSPM, SIFT-LLC and SIFT-SPM descriptors were compared as shown in Table 1. To implement the three SIFT-based descriptors, SIFT-ScSPM, SIFT-LLC and SIFT-SPM, dense SIFT features were extracted on a dense grid with a step size of eight pixels from 16×16 overlapping patches. In SIFT-ScSPM a codebook size of 1,024 is used. To extract SIFT-ScSPM descriptors sparse coding is applied on the extracted dense SIFT descriptor. Next a multi-scale spatial maximum pooling is applied to obtain a linear SPM kernel based on sparse coding. The sparse constraint parameter value of 0·15 was used in SIFT-ScSPM. K-means clustering was applied on local features to obtain a feature dictionary of 1,024 visual words. Using the ScSPM algorithm, global features are obtained from the dense SIFT features.

Table 1. Scene categorisation performance on the MIT 67-indoor dataset.

To extract the SIFT-SPM descriptor, a SPM algorithm was applied on the dense SIFT descriptor to extract global features from the indoor image. In SIFT-SPM, a codebook size of 400 and three pyramid layers (8,400 dimensions) is used. K-means clustering is applied on a random subset of patches to form a visual vocabulary of 400 visual words from the training set. The trained visual words were used to quantise the extracted local features. Finally, histograms of quantised SIFT features are concatenated to form the SIFT-SPM descriptor with 8,400 bins.

To extract the SIFT-LLC descriptor, the LLC algorithm was applied on dense SIFT descriptors to extract global features from the indoor image. LLC codes are obtained by using a codebook with 1,024 entries trained by the k-Means clustering technique. A k-nearest neighbour value of five was used in the feature extraction stage of the LLC method. For each spatial sub-region in the SPM layer, the LLC codes of the dense SIFT descriptors are pooled (maximum pooling) together to form the SIFT-LLC descriptor.

The BoW-based descriptors implemented for indoor scene recognition similar to SIFT-SPM are HOG-SPM and SURF-SPM. To extract the HOG-SPM descriptor, local features were extracted on a dense grid with a step size of eight pixels from 16×16 overlapping patches. First, compute the image gradients along the horizontal and vertical directions of the greyscale indoor image with a resolution of $256\times 256$ pixels. Next, for multiple orientations the histograms are computed to produce a HOG feature. A SPM algorithm was applied on HOG features to extract the global features (HOG-SPM descriptor) from the indoor image. In HOG-SPM codebook size of 400 and three pyramid layers (8,400 dimension) are used. To extract the SURF-SPM descriptor, first SURF interest points (100 interest points) are detected in a grayscale indoor image with a resolution of $256\times 256$ pixels. The extracted local features are converted into the global feature (SURF-SPM) using the SPM algorithm. In SURF-SPM, a codebook size of 400 and three pyramid layers (8,400 dimensions) are used. To extract the CENTRIST descriptor (not using PCA), a Census transformed image is obtained from greyscale indoor images with a resolution of $256\times 256$ pixels by comparing the pixel intensity values in a 3×3 neighbourhood to the centre pixel; the neighbouring pixel is assigned bit ‘0’ if the centre pixel value is less than the neighbouring pixels otherwise the neighbouring pixel would be assigned bit ‘1’. The extracted CENTRIST descriptor is a 256-dimensional descriptor. The uniform LBP descriptor is extracted from a greyscale image with a resolution of $256\times 256$ pixels with 59 uniform patterns. A histogram of uniform LBP was computed over the circular neighbourhood (LBP8,1) with 59 bins. To extract the spatial pyramid CENTRIST descriptor, the indoor image was divided into 31 blocks, and the final spatial representation of CENTRIST descriptors is obtained by concatenating the extracted CENTRIST descriptor in 31 blocks to produce a 7,936-dimensional descriptor.

SIFT-ScSPM, Enhanced-Gist and Enhanced mGist Descriptor are extracted and integrated to produce the Enhanced SIFT-ScSPM Descriptor. To extract the spatial pyramid mCENTRIST descriptor, the hyper-opponent transformed indoor image in O1, O2, O3 and S channels was divided into 31 blocks, and the final spatial representation of CENTRIST descriptors was obtained by concatenating the extracted CENTRIST descriptor in 31 blocks to produce a 7,936-dimensional descriptor for each channel. The CENTRIST descriptors obtained in these four channels are concatenated to produce a 31,744 (7,936×4) dimensional mCENTRIST descriptor.

5. INDOOR SCENE CLASSIFICATION PROCEDURE

In this work, Support Vector Machines (SVM) with five different types of kernels such as linear, RBF, polynomial, chi-squared and Histogram Intersection Kernel (HIK) kernels were used to classify the extracted visual descriptors for 67-class indoor scenes.

The linear kernel for SVM classification can be computed as follows:

(4)$$K(x_i,x_j)=x_i\cdot x_j$$

where x i and x j are two feature vectors.

The RBF kernel for SVM classification can be computed as follows:

(5)$$K(x_i,x_j)=\exp \{-\gamma \Vert x_i -x_j\Vert^2\}$$

where γ is a positive scalar, x i and x j denotes two feature vectors.

The polynomial kernel can be computed as follows:

(6)$$k(x_i,x_j)=(x_i\cdot x_j+1)^d$$

where d denotes the degree of the polynomial.

The Chi-square kernel can be computed as follows:

(7)$$k(x,y)=1-\sum_{i=1}^N \frac{2(x_i-y_i)^2}{(x_i+y_i)}$$

where x i and y i denotes two feature vectors.

The Histogram Intersection Kernel (HIK) can be computed as follows:

(8)$$k(x_i,x_j)=\sum_{n=0}^N \min [x_i ({\rm n}),x_j ({\rm n})]$$

where x i (n) and x j (n) are, respectively, the N-bin histograms of x i and x j.

The performance of the visual descriptors with SVM classifiers was evaluated using ten-fold cross-validation. In ten-fold cross-validation, the scene recognition model was trained and tested ten different times by dividing the data into ten subsets. In each sequence, nine subsets were used for training and the remaining subset was used for testing the extracted features. Using the validation set in each sequence, the parameters of the SVM classifiers were computed. The average values and standard deviations of Classification Rate (CR), Precision (P), Recall (R), F-measure (F) and Area Under Curve (AUC) were calculated for ten sequences.

5.1. Data set for indoor scene recognition

The MIT-67 class dataset (Quattoni and Torralba, Reference Quattoni and Torralba2009) was used for indoor scene classification. This large scene classification dataset contains 15,620 images of 67 categories of complex indoor scenes. In the MIT 67 dataset, 100 images for each of the 67 classes were considered and a total of about 6,700 images were used for training and testing the classifier using ten-fold cross validation.

5.2. Performance measures used for indoor scene recognition

In this paper, five different performance measures such as Classification Rate, Precision, Recall, F-measure, and Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) were used for the indoor scene recognition task. Classification Rate is the percentage of correctly classified test images. Four measures, True Positives (TP), False Positives (FP), False Negatives (FN) and True Negatives (TN) can be calculated using the obtained confusion matrix in each fold (ten-fold cross validation).

The Precision rate is computed as follows:

(9)$$P=\frac{TP}{TP+FN}$$

where TP and FN denote the true positive and false negative, respectively.

The Recall rate is computed as follows:

(10)$$R=\frac{TP}{TP+FP}$$

where TP and FP denote the true positive and false positive respectively.

The F-measure is computed as follows:

(11)$$F=\frac{2PR}{P+R}$$

where P and R denote the precision rate and recall rate respectively.

To evaluate the indoor scene classification algorithm another performance measure, AUC (Xiao et al., Reference Xiao, Ehinger, Hays, Torralba and Oliva2016), was used.

6. EXPERIMENTAL EVALUATION AND RESULTS

This section presents an experimental evaluation of the performance of the proposed visual descriptor and state-of the-art visual descriptors.

6.1. Classification results on the 67-indoor-scene data set

Indoor scene recognition results of the fourteen selected visual descriptors on the MIT-67 indoor scene classification dataset is reported in Table 1. The indoor scene recognition experiments were carried out using a DELL Laptop computer with an Intel i7-7500U CPU operating at 2·70 GHz and a RAM of 16 GB. All of the feature extraction and scene classification experiments were implemented in a MATLAB (R2017a) environment. As shown in Table 1, the proposed Enhanced SIFT-ScSPM Descriptor has the highest Classification Rate (50·15%) compared to the other visual descriptors. The Enhanced SIFT-ScSPM visual descriptor also has higher precision, recall, F-measure, and AUC values than other scene recognition visual descriptors. DMGH had the lowest Classification Rate (CR) of 14·81% compared to other scene recognition algorithms.

Classification rates of SURF-SPM and HOG-SPM using the Chi-square SVM were (28·70% and 28·78%) and HIK-SVM (28·37% and 28·51%), respectively. Classification Rates (CRs) of CENTRIST using the Polynomial SVM and Chi-square SVM were 22·61% and 25·27%, respectively as reported in Table 1. The other scene recognition visual descriptors such as Gist and Enhanced-Gist using the RBF-SVM had CRs of 29·70% and 32·08%, respectively. SIFT-SPM had CRs of 38·03% and 36·78% using the Chi-square SVM and HIK-SVM, respectively. Enhanced-mGist and mGist visual descriptors when using the Linear-SVM had CRs of 39·25% and 37·58%, respectively. mCENTRIST and CENTRIST-spatial pyramid visual descriptors obtained CRs of 42·85% and 39·10%, respectively using the HIK-SVM and Chi-square SVM. SIFT-ScSPM and SIFT-LLC visual descriptors achieved CRs of 45·12% and 42·75%, respectively, using the SVM classifier with the linear kernel. It should be noted from Table 1 that the highest classification rate of 50·15% was obtained for the proposed Enhanced SIFT-ScSPM visual descriptors by using the SVM classifier with a linear kernel.

In this work, structural recognition methods such as Uniform LBP, CENTRIST, CENTRIST-spatial pyramid, mCENTRIST and DMGH descriptors are used. Computational costs of the Enhanced SIFT-ScSPM visual descriptors, several state-of-the-art visual descriptors and structural methods of recognition are reported in Table 2.

Table 2. Computational cost of visual descriptors and SVM classification.

From the experimental results, it is inferred that higher CRs are obtained for MIT-67 indoor scene classification dataset based on the combination of Bag of Words (BOW)-based visual descriptor (SIFT-ScSPM) and Gist-based descriptors (Enhanced Gist, Enhanced mGist)) extracted from the indoor image. For real time indoor scene recognition, a scene recognition model which was trained offline with Enhanced SIFT ScSPM descriptors and SVM at a computational cost of 62,525·32 seconds was used to classify the indoor test image. Using the offline trained scene recognition model, computational cost for real time indoor scene classification with Enhanced SIFT ScSPM descriptors and SVM was 9·32 seconds. However, using C++ or Python OpenCV for real time implementation of the indoor scene algorithm can further reduce the computational cost to a few seconds. Therefore, real time implementation of indoor scene recognition algorithm using Enhanced SIFT-ScSPM descriptors onboard the MAV is viable. From the computational cost, it is inferred that the proposed Enhanced SIFT-ScSPM visual descriptors are suitable for real time indoor scene recognition for indoor MAVs flying at a low flight speed of 0·3 to 0·5 m/s.

7. CONCLUSIONS

This paper presented a comparison of several state-of-the-art visual descriptors and proposed Enhanced SIFT-ScSPM, Enhanced mGist and SURF-SPM visual descriptors from both methodological and experimental perspectives for 67 indoor scene categorisation tasks. The Enhanced SIFT-ScSPM visual descriptor was proposed based on SIFT with Sparse coding based Spatial Pyramid Matching visual descriptor (SIFT-ScSPM) and Gist based descriptors (Enhanced Gist-Enhanced mGist) to categorise 67 highly variable complex indoor scenes for indoor navigation of Micro Aerial Vehicles. Indoor scene classification experiments were carried out by using the MIT-67 indoor scene classification dataset. For Enhanced SIFT-ScSPM descriptors, a linear kernel function was used to partition the data into 67 linearly separable indoor scene image categories in higher-dimensional feature spaces and the linear SVM classifier is less prone to overfitting for high dimensional feature vectors. Reliability and performance of the SVM classifier was tested by using ten-fold cross-validation. The proposed Enhanced SIFT-ScSPM descriptor shows better indoor scene classification results (higher classification rate, precision, recall, F-measure and area under curve) compared with known state-of-the-art and the proposed Enhanced-mGist and SURF- SPM visual descriptors. It is obvious from the results that Enhanced SIFT-ScSPM descriptors are suitable for real time indoor scene recognition and navigation of a MAV in GNSS-denied indoor environments. In the future, different local and global visual descriptors will also be investigated to improve the scene classification results for indoor scenes.

References

REFERENCES

Anbarasu, B. and Anitha, G. (2017). Vision-Based Heading and Lateral Deviation Estimation for Indoor Navigation of a Quadrotor. IETE Journal of Research, 63(4), 597603.Google Scholar
Bay, H., Ess, A., Tuytelaars, T. and Van Gool, L. (2008). Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110(3), 346359.Google Scholar
Cakir, F., Güdükbay, U. and Ulusoy, Ö. (2011). Nearest-neighbor based metric functions for indoor scene recognition. Computer Vision and Image Understanding, 115(11), 14831492.Google Scholar
Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human detection. Proceedings of IEEE conference on Computer Vision and Pattern Recognition, San Diego, CA, 886–893.Google Scholar
Elfiky, N., Khan, F.S., Van de Weijer, J. and Gonzalez, J. (2012). Discriminative compact pyramids for object and scene recognition. Pattern Recognition, 45(4), 16271636.Google Scholar
Haralick, R.M., Sternberg, S.R. and Zhuang, X. (1987). Image analysis using mathematical morphology. IEEE Transaction on Pattern Analysis and Machine Intelligence, 9(4), 532550.Google Scholar
Kawewong, A., Pimpup, R. and Hasegawa, O. (2013). Incremental Learning Framework for Indoor Scene Recognition. Proceedings of Twenty-Seventh AAAI Conference on Artificial Intelligence, 496–502.Google Scholar
Khan, S.H., Hayat, M., Bennamoun, M., Togneri, R. and Sohel, A. (2016). A discriminative representation of convolutional features for indoor scene recognition. IEEE Transactions on Image Processing, 25(7), 33723383.Google Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, 1097–1105.Google Scholar
Lazebnik, S., Schmid, C. and Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proceedings of IEEE conference on Computer Vision and Pattern Recognition, New York, 2169–2178.Google Scholar
Lecun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 22782324.Google Scholar
Lowe, D.G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91110.Google Scholar
Meng, X., Wang, Z. and Wu, L. (2012). Building global image features for scene recognition. Pattern Recognition, 45(1), 373380.Google Scholar
Ojala, T., Pietikäinen, M. and Mäenpää, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24(7), 971987.Google Scholar
Oliva, A., and A. Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. Progress in Brain Research, 55, 23-36.Google Scholar
Qin, J. and Yung, N.H.C. (2010). Scene categorization via contextual visual words. Pattern Recognition, 43(5), 18741888.Google Scholar
Quattoni, A. and Torralba, A. (2009). Recognizing indoor scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 413–420.Google Scholar
Rahman, M.M., Rahman, S., Rahman, R., Hossain, B.M.M. and Shoyaib, M. (2017). DTCTH: a discriminative local pattern descriptor for image classification. EURASIP Journal on Image and Video Processing, 2017(1), 124.Google Scholar
Soille, P. (2003). Morphological Image Analysis: Principles and Applications. Springer-Verlag.Google Scholar
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. and Y. Gong, Y. (2010). Locality-constrained linear coding for image classification. Proceedings of IEEE conference on Computer Vision and Pattern Recognition, San Francisco, CA, 3360–3367.Google Scholar
Wei, X., Phung, S.L. and Bouzerdoum, A. (2016). Visual descriptors for scene categorization: experimental evaluation. Artificial Intelligence Review, 45(3), 333368.Google Scholar
Wu, J. and Rehg, J.M. (2011). CENTRIST: a visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 14891501.Google Scholar
Xiao, Y., Wu, J. and Yuan, J. (2014). MCENTRIST: a multi-channel feature generation mechanism for scene categorization. IEEE Transactions on Image Processing, 23( 2), 823836.Google Scholar
Xiao, J., Ehinger, K.A., Hays, J., Torralba, A. and Oliva, A. (2016). SUN database: exploring a large collection of scene categories. International Journal of Computer Vision, 119(1), 322.Google Scholar
Xie, L., Wang, J., Guo, B., Zhang, B. and Tian, Q. (2014). Orientational pyramid matching for recognizing indoor scenes. Proceedings of IEEE Conference on Computer Vision Pattern Recognition, Columbus, OH, 3734–3741.Google Scholar
Yang, J., Yu, K., Gong, Y. and Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. Proceedings of IEEE conference on Computer Vision and Pattern Recognition, Miami, FL, 1794–1801.Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A. and Oliva, A. (2014). Learning deep features for scene recognition using Places database. Proceedings of Advances in Neural Information Processing Systems, 487–495.Google Scholar
Figure 0

Figure 1. The block diagram of the proposed method.

Figure 1

Figure 2. Visual illustration of SIFT keypoints detected in a corridor image.

Figure 2

Figure 3. Visual illustration of SURF keypoints detected in a class room image.

Figure 3

Figure 4. Visual illustration of HOG features detected in a library image.

Figure 4

Figure 5. Gist feature extraction output (a) Input image (b) Gist features.

Figure 5

Figure 6. Directional Morphological gradient feature extraction output. (a) Input Staircase image. (b) Horizontal Directional Gradient. (c) Vertical Directional Gradient.

Figure 6

Figure 7. Census Transformed value.

Figure 7

Figure 8. Census Transformed output. (a) Office image. (b) Census Transformed office image.

Figure 8

Figure 9. Illustration of basic LBP operator.

Figure 9

Figure 10. Visual illustration of mGist descriptor (a) RGB. (b) O1. (c) O2. (d) O3. (e) Sobel image of R channel. (f), (g), (h) (i) and (j). mGist descriptor output for the RGB image and the Hyper opponent transformed images.

Figure 10

Table 1. Scene categorisation performance on the MIT 67-indoor dataset.

Figure 11

Table 2. Computational cost of visual descriptors and SVM classification.