1. Introduction
Teleoperation has become a popular and emerging trend in research and development. It arises from the combination of various fields of research such as computer vision, communication & control, artificial intelligence, mechatronics, machine learning, deep learning. [Reference Alquisiris-Quecha, Maldonado-Reyes, Morales and Sucar1]. The teleoperation bridges the gap between humans and the robots. The controlling of robotic operations from a distance is termed as teleoperation [Reference Mi, Sun, Wang, Deng, Li, Zhang and Xie2]. The distance varies from billions of kilometres as in space applications to centimetres as in micro-applications or microsurgery. It extends the human capability to the places where it is impossible to present physically, dangerous or highly threatening to human life, for example disaster scenarios, radioactive environment, chemical industries, deep oceans, minefields. [Reference Yan, Huang and Lindgren3]. The replacement of individual with a robot that can be controlled remotely is the most desirable solution to these scenarios [Reference Alquisiris-Quecha, Maldonado-Reyes, Morales and Sucar1]. Recent studies reported that humanoid robots have made substantial progress in different application of teleoperation [Reference Ishiguro, Makabe, Nagamatsu, Kojio, Kojima, Sugai, Kakiuchi, Okada and Inaba4, Reference Ramos and Kim5, Reference Hirschmanner, Tsiourti, Patten and Vincze6, Reference Vongchumyen, Bamrung, Kamintra and Watcharapupong7, Reference Chen, Wang, Hu and Shen8, Reference Yin, Lan, Wen, Zhang and Wu9]. Generally, humanoid robots are designed to perform specific tasks in teleoperation. But it is quite difficult to control the humanoid robots with the normal controllers due to its high degree of freedom [Reference Sripada, Asokan, Warrier, Kapoor, Gaur, Patel and Sridhar10]. Therefore, a suitable interface is needed for controlling the robots. It has a significant influence on effective human-robot interactions (HRI) in teleoperation. The most common interfaces that have been used in HRI are keyboard, mouse, dials, joystick, gloves, inertial sensors, and exoskeletal systems [Reference Losey, McDonald, Battaglia and O’Malley11]. These are mechanical contacting devices which require unnatural arm or hand motions to accomplish tasks in teleoperation. As a consequence, the new interfaces for HRI have been developed to provide more comfortable ways of interacting with remote robots. Many researchers have integrated sensors technique interfaces such as Microsoft Kinect [Reference Chen, Wang, Hu and Shen8, Reference Yin, Lan, Wen, Zhang and Wu9, Reference Abdallah and Bouteraa12–Reference Balmik, Jha and Nandy14]. The Kinect sensor provides contactless, user-friendly, more human-centred interface for teleoperation. Moreover, it encourages the non-experts to interact with robots. Additionally, it is easy-to-use and has advantages of low cost [Reference Syakir, Ningrum and Sulistijono15]. The 3D Kinect camera simplifies the human-robot interaction process by capturing different human motions to control the remote robots. It allows to develop more natural HRI systems by identifying different motions of the human operator.
The similarities between humanoid robots and humans have increased the cooperation between humans and robots [Reference Yavsan and UÇar16]. Therefore, teleoperation research studies focus on human behaviour imitation and recognition to establish natural communication with robots. Many researchers recently developed imitation-based teleoperated systems that capture human motions and map them to robots [Reference Sripada, Asokan, Warrier, Kapoor, Gaur, Patel and Sridhar10, Reference Arduengo, Arduengo and ColomÉ17, Reference Jha and Chiddarwar18, Reference Khadir19]. The motivation behind these systems is to integrate human motions for interacting with robots same as human use motions for pointing to objects or while speaking. It has potential to control the robots in complicated situations where it is too difficult to programme a robot [Reference Sripada, Asokan, Warrier, Kapoor, Gaur, Patel and Sridhar10]. In the last few decades, human motion/gesture recognition is gaining a lot of attention from robotics groups. The increasing interest is due to the affordability of depth sensors such as Kinect. It provides RGB data along with the depth information; hence, it is called as RGB-D sensors [Reference Rolley-Parnell, Kanoulas, Laurenzi, Delhaisse, Rozo, Caldwell and Tsagarakis20]. Several researchers presented human motion recognition in their literatures which used Kinect sensors [Reference Maraj, Maraj and Hajzeraj21, Reference Li22, Reference Ababneh, Sha’ban, AlShalabe, Khader, Mahameed and AlQudimat23, Reference Dong, Xia, Yan and Zhao24, Reference Shieh, Wang, Wu and Liang25]. These studies either used depth images [Reference Cho and Jeong26, Reference Zhu, Zhang, Shen and Song27, Reference Ahmad, Illanko, Khan and Androutsos28] or skeleton data [Reference Yan, Huang and Lindgren3, Reference Ajili, Mallem and Didier29, Reference Liu, Shahroudy, Wang, Duan and Kot30] to classify the arbitrary motions. It serves reliable teleoperated systems where a robot predicts the human behaviour towards them to perform the associated tasks.
The contribution of this work consists of three aspects:
-
1. A Kinect-based teleoperation for NAO robot is presented in which the robot imitates the full-body human motions.
-
2. A novel Adaptive balance technique for NAO (ABTN) is proposed to balance the whole body.
-
3. A deep 1D-CNN is developed for the robot to recognise the human motions with a high accuracy.
The remaining paper is arranged as follows: A comprehensive literature study on teleoperation and convolution neural network-based human motion recognition is presented in Section 2. The detailed description of the proposed framework is given in Section 3. The experimental results and comparison with the state-of-the-art methods are examined in Section 4. And, Section 5 concludes the paper and discusses the future work.
2. Related work
The previous studies related to gesture recognition and imitation of human motions in order to control the robots are presented in this section. Vongchumyen et al. [Reference Vongchumyen, Bamrung, Kamintra and Watcharapupong7] presented a teleoperation of a humanoid robot that uses the depth sensor of the Kinect to capture the human motions. The joint positions are considered as vector to find out the joint angles. It controls the servo motors of the robot using Restful API. The developed system is able to control only the upper body of the robot. Zhang et al. [Reference Zhang, Cheng, Gan, Zhu, Shen and Song31] proposed an algorithm for NAO robot to mimic the human whole-body motions. It uses eight kinematic chains to map human motions to the robot for imitating the actions. Stable motions are achieved by constraining the CoM. Chen et al. [Reference Chen, Wang, Hu and Shen8] performed the imitation of whole-body human motions for NAO. It introduces new computing methods for joint angles to control the robot. After analysing the differences in the joint angles, a gain factor is defined for mapping of joint angles during the control process. It also slows down the changing rate of joint angles to stabilise the robot motions. Alquisiris et al. [Reference Alquisiris-Quecha, Maldonado-Reyes, Morales and Sucar1] developed a system to control the NAO robot through gestural interpretation. The Kinect sensor is used to capture the user’s movements. The joint angles are obtained using linear regression methods and achieve whole-body imitation of human motions. But in this system, if the motions are not smoother enough, then the robot abruptly falls. The balance control of the lower body is necessary for the robot performing various tasks in teleoperation. The problem of maintaining whole-body balance is quite critical. But in previous studies, fewer methods are introduced for balance control. Either the changing rate of joint angles is slowed down or smoother motions are taken. Here, we proposed a novel ABTN method to solve this problem. This method determines the dynamic centre of mass (CoM) and computes the joint angles of the ankle from knee and hip joints to avoid falling off the robot while performing any whole-body motions.
Several gesture recognition approaches such as hidden Markov model (HMM) [Reference Ajili, Mallem and Didier29], support vector machine (SVM) [Reference Zhi, de Oliveira, da Fonseca and Petriu32], extreme learning machines (ELM) [Reference Yavsan and UÇar16], artificial neural network (ANN) [Reference Cicirelli, Attolico, Guaragnella and D’Orazio33], convolutional neural network (CNN) [Reference Hu and Zheng34], dynamic time wrapping [Reference Zhi, de Oliveira, da Fonseca and Petriu32] have been applied to recognise the postures in teleoperation. In the last couple of years, literature has proved that one dimensional CNN is the most effective approach for human motion recognition [Reference Jang, Kim, Park and Jang35, Reference Li, Zhong, Xie and Pu36, Reference Hu and Xu37, Reference Liu, Shahroudy, Wang, Duan and Kot30]. Hu and Xu [Reference Hu and Xu37] proposed CNN for skeleton-based human action recognition. It uses two convolutional layers in the architecture for learning the spatio-temporal features. Data augmentation and segment pooling are implemented for long sequences recognition. Li et al. [Reference Li, Zhong, Xie and Pu36] employed CNN framework for human action detection and classification. The skeleton coordinates are fed directly to a 7-layer CNN model. It achieves 89.3accuracy for the NTU RGB+D dataset. Kiranyaz et al. [Reference Kiranyaz, Avci, Abdeljaber, Ince, Gabbouj and Inman38] reported that 1D-CNN has low computational complexity and is easier to train than 2D or 3D CNNs. It makes the real-time and low-cost hardware implementation feasible because of its simple and compact architecture. Despite having several advantages, there are very few literatures on 1D-CNN focusing on skeletal motion recognition, which remains this field unexplored. Therefore, one-dimensional convolutional neural network (1DCNN) is proposed to identify the human motions using skeletal data and the model performance is compared with the state-of-the-art methods to achieve remarkable results.
3. Proposed framework
In the proposed system framework, a motion capturing device that is Kinect v2 is used to capture the human motions. The 3D skeleton positions of motions are obtained from the motion capturing device. The captured motion information has some breakpoints in the times series. Therefore, human skeleton positions are interpolated using cubic spline interpolation. It samples the data uniformly and provides a smoother joint motion. The interpolated human joint positions are used to extract the joint angles as a feature for this experimental study. An analytical method presented by Zhang et al. [Reference Zhang, Niu, Yan and Lin39] is used for mapping the human motions to the humanoid robot NAO. The extracted joint angles are filtered using Savitzky-Golay filter of 5th order with a window size of 13. It smoothens the joint motions by removing noise present in the data. The filtered joint angles are given to the CNN module, which identifies the human motions to be performed by the robot. The framework employs a whole-body control mechanism so that the humanoid robot does not fall while performing the imitation of recognised motion. The overview of the proposed framework is presented in Fig. 1. The proposed 1D-CNN architecture along with the whole-body control mechanism for the humanoid robot is described in detailed in below subsections.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig1.png?pub-status=live)
Figure 1. Overview of the proposed framework.
3.1 Centre of Mass (CoM)
The estimation of the CoM is vitally important for whole-body balance when a slight change in the CoM may lead to catastrophic effects. For calculating the CoM of humanoid robot NAO, we have used forward kinematics to traverse through the kinematic chain and then multiply the local CoM with respect to the link. The weighted average of all the CoMs will provide the CoM of the whole body.
Affine Transformations: Affine transformation is a linear mapping method that transforms vectors from one coordinate space to another preserving dimensional ratios. An affine transformation matrix is a
$(n+1)*(n+1)$
matrix where n is the number of dimensions [40]. Here, we have used translation and rotation affine transformation.
Translation Matrix: The translation matrix moves vectors by a fixed distance in a specific direction. Here dx, dy, and dz define the distance of translation along the x, y, and z-axis, respectively.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqnU1.png?pub-status=live)
Rotation Matrix: The rotation matrix rotates vectors by a fixed angle about a specific direction. Here,
${R_x}, {R_y},$
and
${R_z}$
are rotation matrices that rotate the vectors about the x, y, and z-axis, respectively, by an angle
$\theta.$
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqnU2.png?pub-status=live)
Denavit and Hartenberg Parameters: The Denavit and Hartenberg (DH) transformation is used to define the motion of actuators connected by rigid links of a kinematic chain. Firstly, the Z-axis is defined along the axis of rotation/translation of the parent joint. The DH parameters are derived from common normal between consecutive Z-axes.
d – depth along the previous joint’s z-axis
$\theta$
– angle about the previous z-axis to align its x with the new origin
a – distance along the rotated x-axis
$\alpha$
– rotates about the new x-axis to put z in its desired orientation.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqnU3.png?pub-status=live)
Whole-body CoM: The whole-body CoM is the weighted average of the individual joint CoM matrices. The CoM of individual kinematic chain C(I,N) is calculated as the cross product of affine matrices of joints and translation matrices of CoM with the mass of individual links.
TotalCoM = 1/M * [TorsoCoM + HeadCoM + LArmCoM + RArmCoM + LLegCoM + RLegCoM]
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn1.png?pub-status=live)
where M= M is the mass of the robot = 5.305350006 Kg
$m_i$
= mass of
$i{\rm th}$
link
$J_i$
= affine transform matrix of
$i{\rm th}$
joint with respect to torso frame (dimensions are in mm, thus divided by 1000)
A(x, y, z) = translation matrix for CoM position with respect to the preceding joint
I = initial row of Tables I and II
Table I. Translation and DH translation parameters for joints of NAO robot.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab1.png?pub-status=live)
Table II. Masses and CoM parameters of NAO robot [42].
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab2.png?pub-status=live)
N = final row of Tables I and II
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn2.png?pub-status=live)
where A(dx, dy, dz) = translation matrix of joint offsets for specific kinematic chain
$T(a, \alpha, d, \theta$
) = DH transformation matrix for joints
The steps to calculate the whole-body CoM can be summarised as given below:
-
1. To calculate CoM of individual kinematic chain, we are considering corresponding initial to final rows of Table I, that is translation and DH parameters for different joints and Table II, that is masses and CoM parameters.
-
2. We then take the cross product of the Affine Matrix from Forward Kinematics in equation (2) and Translation Matrix of the CoM of individual links from Table II in equation (1).
-
3. Next, the Forward Kinematics in equation (2) of the respective kinematic chain is the cross product of the Translation Matrix of Joint Offsets [41] and DH Matrix of the Joint angles taken from Table I.
For example, for calculating torso CoM we are using C(1, 1) in equation (1) by considering row 1 from both Tables I and II simultaneously where we multiply mass and COM of torso from Table II and determine kinematic chain of torso using DH parameters from Table I. Similarly, we have calculated HeadCoM = C(2, 3) by considering rows 2 to 3 from both Tables I and II. Likewise for RightArmCoM = C(4, 8), LeftArmCoM = C(9, 13), RightLegCoM = C(14, 19), LeftLegCoM = C(20, 25).
3.2 Whole-body control
This module enables whole-body control using the proposed adaptive balancing technique for NAO (ABTN). This technique involves a novel mechanism for pitch and roll control inspired by the work presented in ref. [Reference Chen, Wang, Hu and Shen8]. Here, the joint angles of the ankle are computed from knee pitch and hip pitch to make the robot stand upright. The aim of this approach is to keep the dynamic CoM just calculated above inside the base area of its foot while performing whole-body motion. This is also robust for complex motions like one-leg stand where the total base area reduces to only one foot. It detects if the robot is about to perform one-leg stand and controls the ankle roll to balance the robot. This approach is experimented using Webots simulation software, ROS environment, and the Naoqi package. The computation procedure of joint angles for pitch and roll control to maintain the balance of NAO lower body are derived below
3.2.1 Pitch control
The pitch control is essential for single leg stand, crouching, and squatting postures. The proposed algorithm uses adaptive pitch control for ankle, knee, and hip joints to keep the CoM of the robot inside the foot base area.
Ankle Pitch: For the calculation of ankle pitch, we have used the filtered nao robot joint angles and distance between the joints calculated from TF (transforms) or nao documentation. Figure 2 represents the side view of right leg joints of NAO robot in squat position where A, K, H, T are ankle, knee, hip, torso joints, and
$\alpha_p, k_p, h_{p},t_p$
are the respective joint angles. The vectors
$\overrightarrow{AK}, \overrightarrow{KH}, \overrightarrow{HT}, \overrightarrow{TCoM}$
are used to represent the joint positions with base_footprint that is the point on the ground between two legs as the origin. From the geometry of the robot as in Fig. 2, the vector
$\overrightarrow{ACoM}$
connecting ankle joint and CoM is written as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn3.png?pub-status=live)
The CoM should be inside the foot base area for which the real part of the
$\overrightarrow{ACoM}$
should be in the range
$[\!-3, 9]$
cm. To ease the calculation part,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn4.png?pub-status=live)
Let
$L_1$
be the distance between ankle (A) & knee (K),
$L_2$
be the distance between knee (K) & hip (H),
$L_3$
be the distance between hip (H) & torso (T), and
$L_4$
be the distance between torso (T) & CoM. Then, from Fig. 2,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn5.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn6.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn7.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig2.png?pub-status=live)
Figure 2. A side view representation of right leg joints of NAO robot in a squat position for pitch control.
After dividing equation (7) by
$\sin(\alpha_p)$
, we get
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn8.png?pub-status=live)
From Table III, it can be seen that ankle pitch lies between the range of (
$-67.97^{\circ}$
, 53.4
$^{\circ}$
). So, we constrain the calculated ankle pitch within these joint limits. After assigning the values of joint limits if the Real(
$Z_0$
) = Real(
$\overrightarrow{ACoM}$
) is within [
$-3$
cm, 9 cm] that is CoM is inside the foot base, then no further changes are required and the angles are published in the ROS.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqnU4.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqnU5.png?pub-status=live)
Table III. NAO leg joints limits for pitch control.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab3.png?pub-status=live)
However, if Real(
$Z_0$
) exceeds its limits, then we adjust the knee pitch to balance the robot.
Knee Pitch: There are various motions that a human can do with ease, but the robot can fail to perform with just mapping of joint angles and controlling ankle pitch. To perform such actions, we need to adjust the incoming knee pitch slightly to keep the robot balanced. From Fig. 2 and equation (7), knee pitch is derived as given below
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn9.png?pub-status=live)
Let,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn10.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn11.png?pub-status=live)
Hip pitch: The above formula is used to find the knee pitch when ankle pitch alone cannot suffice the balancing criteria. However, there may be a situation when both ankle pitch and knee pitch together cannot balance the robot. For that, we need to slightly adjust the hip pitch to balance the robot. Then from Fig. 2,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn12.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn13.png?pub-status=live)
The overall procedure of pitch control to balance the whole body is presented in Algorithm 1. It aims to compute
$\alpha_p, k_p, h_p$
such that
$Real(Z_0)$
always lies in range
$[\!-3, 9]$
cm. The
$constrain(j\_angle, Min, Max$
) function constrains the given joint angles
$j\_angle$
within nao joint limits. Min is the lower limit, and Max is the upper limit of a particular joint. For both legs, the hip pitch (
$h_p$
) and knee pitch (
$k_p$
) from the motion file are used to calculate ankle pitch (
$\alpha_p$
) from the function
$calcAnklePitch(h_p, k_p,t_p)$
. The computed ankle pitch is checked whether it is in its joint limits through the
$constrain(j\_angle, a_{Min}, a_{Max})$
function. If the ankle pitch is less than
$a_{Min}$
(lower limit of ankle pitch) given in Table III; then, ankle pitch is assigned to
$\alpha_p = a_{Min} = -67.97$
(for right leg as an example), and if ankle pitch is greater than
$a_{Max}$
(upper limit of ankle pitch), then it is assigned to
$\alpha_p = a_{Max} = 53.4$
. Then if the
$Real(Z_0)=RealZ(\alpha_p,h_p,k_p, t_p)$
is in range of
$[\!-3, 9]$
cm where
$RealZ_{Min}=-3$
and
$RealZ_{Max}$
= 9, then the angles are published in ROS, otherwise knee pitch is calculated from
$calcKneePitch(\alpha_p, h_p, t_p)$
function. The calculated knee pitch is constrained within joint limits as given in Table III by the function
$constrain(k, k_{Min}, k_{Max})$
. And checked if the
$Real(Z_0)$
is in range of
$[\!-3, 9]$
cm; then, the angles are published in ROS. Otherwise, hip pitch is calculated from the function
$calcHipPitch(\alpha_p, k_p, t_p,);$
, it is also constrained within its joint limits and
$Real(Z_0)$
is again checked for
$[\!-3, 9]$
cm range. If it is in the range then joint angles are published otherwise exit.
Algorithm 1. Proposed Pitch control algorithm
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab4.png?pub-status=live)
3.2.2 Roll control
Similar to pitch control, the robot also needs roll control to balance itself in a one-leg stand posture. The algorithm predicts the position of CoM and also ensures that it remains inside the single foot base area. It controls the ankle roll and hip roll to maintain the whole-body balance of the NAO robot.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig3.png?pub-status=live)
Figure 3. A front view representation of left leg joints of NAO robot in one-leg stand position.
Ankle roll: The motions involving single leg stand is difficult and challenging to balance as compared to motions with double leg stand. The ankle roll is a function of
$L_{HA}$
that is the length of Hip to ankle and it depends on the ankle pitch and knee pitch. This is because of the distance between hip and ankle changes when the knee bends. The Fig. 3 represents the front view of the robot where it is balancing on its left foot. All the lengths between the joints are taken from TF (transforms). So, from Fig. 3 the ankle roll (
$\alpha_r$
) is written as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn14.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn15.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn16.png?pub-status=live)
A side view representation of right leg joints of NAO robot in one-leg stand position is given in Fig. 4. The imaginary part of the
$Z_1$
vector is considered to get the vertical projection of the leg. The ankle pitch (
$\alpha_p$
) and knee pitch (
$k_p$
) are calculated from the above pitch control algorithm. Then from Fig. 4,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqnU6.png?pub-status=live)
Putting the value of
$L_{HA}$
in equation (16),
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn17.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig4.png?pub-status=live)
Figure 4. A side view representation of left leg joints of NAO robot in one-leg stand position.
If Real(
$\rightarrow$
ACoM)
$\neq$
0, then hip roll is calculated by the formula given in equation (18),
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_eqn18.png?pub-status=live)
The plot of joint angles of a motion when the left leg is on the ground and the right leg is moving sidewards is shown in Fig. 5. From the Table IV, we can see that the right and left hip roll (
$h_r$
) have dissimilar ranges. So, it is inverted by multiplying with
$-1$
to map them in the same range and ease the calculations. Thereafter, the difference of the right and left hip roll is taken to predict which leg should balance the robot and which leg should perform the action. While performing the motion, if the difference exceeds a threshold angle
$2^{\circ}$
(determined experimentally), the left hip roll is overridden to balance the robot. In Fig. 6, the red dotted line represents the threshold region. If the difference of LHipRoll and RHipRoll (red line) is greater than the positive threshold value, the right leg will balance and the left leg will perform the given motion. When the difference is lower than the negative threshold value, the left leg will balance and the right leg will perform the motion. The LHipRoll line in Fig. 6 represents that the joint angle values are overridden by the proposed algorithm to balance the robot while performing a motion in which left leg is on the ground and right leg is moving sidewards.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig5.png?pub-status=live)
Figure 5. A plot of raw joint angles when the NAO’s left leg is on the ground and right leg is moving to the side.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig6.png?pub-status=live)
Figure 6. A plot of processed joint angles after implementing the proposed roll control algorithm to balance the whole body of NAO when the left leg is on the ground and right leg is moving to the side.
Table IV. NAO leg joints limits for roll control.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab5.png?pub-status=live)
The Algorithm 2 explains the proposed roll control mechanism for balancing the whole body of NAO robot. The
$constrain(j\_angle, Min, Max)$
function constrains the given joint angles
$j\_angle$
within NAO joint limits. Min is the lower limit, and Max is the upper limit of a particular joint. Here, LHR & RHR are the left and right hip roll taken from the motion file to calculate balanceLeg that is on which leg the robot should balance. If the balanceLeg is lower than negative threshValue which is the threshold determined experimentally as
$2^{\circ}$
, then the balanceLeg will be the left leg and the right leg will perform the given motion. When the balanceLeg is greater than positive threshValue, balanceLeg will be the right leg and the left leg will perform the given motion. Ankle roll (
$\alpha_r$
) is calculated by taking the ankle pitch (
$\alpha_p$
), knee pitch (
$k_p$
), and hip roll (
$h_r$
) from the motion file. The function
$constrain(\alpha_r, a_{Min}, a_{Max})$
is used to constrain the ankle roll within its lower limit
$a_{Min}$
& upper limit
$a_{Max}$
given in Table IV according to the left or right balanceLeg. Then if the
$Real(Z_0)=RealZ(\alpha_r,h_r,t_r)$
is in range of
$[\!-3, 9]$
cm where
$RealZ_{Min}=-3$
and
$RealZ_{Max}$
= 9, then the angles are published in ROS; otherwise, hip roll is calculated. The
$constrain(h_r, h_{Min}, h_{Max})$
is used for assigning the hip roll between the lower joint limit
$h_{Min}$
and upper joint limit
$h_{Max}$
of hip given in Table IV. And checked if the
$Real(Z_0)$
is in range of
$[\!-3, 9]$
cm, then the hip roll and ankle roll are published to the ROS for whole-body balance otherwise exit.
Algorithm 2. Proposed Roll control algorithm
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab6.png?pub-status=live)
3.3 Human motion recognition using CNN
The concept of robots recognising the human motions has become essential for reliable teleoperation. The prerequisite for this system is to have an ability to predict the intentions of human operators to perform a meaningful task. One of the major challenges of human motion recognition is the human motion variability. The same motion is performed differently by the same person over time and also different person perform the same motion in different ways. In this case, the robot should efficiently recognise the human motions and perform the tasks associated with it. To handle this issue, a novel deep 1D-CNN is presented in this work. The architecture of the proposed 1D-CNN consists of 7-layer deep network structure represented in Fig. 7. It comprises four convolutional layers, one max-pooling layer, one global average pooling layer, and one fully connected layer. The proposed 1D-CNN is designed for the automated classification of whole-body human motions.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig7.png?pub-status=live)
Figure 7. Block representation of the proposed deep 1D-CNN architecture.
Zero Padding: The foremost challenge comes while dealing with 1D CNN is the length of the input samples must be same. But in human motion recognition, the motions captured by Kinect may have different durations. Hence, the length of input motions are different. It is necessary to adopt a preprocessing technique which converts the input samples of different lengths into a fixed length. Therefore, zero padding is introduced to satisfy the condition imposed by the input layer of CNN. Zero can be padded at the start or end of the input vector. It can also be padded both at the start and end of the input vector by keeping the data values in the middle of the vector. Here, the input samples are mapped joint angles which are fed to the proposed 1D-CNN. The input sample with the maximum length (max_len) is determined. The length of the remaining input samples is increased to the max_len by padding zero so that each sample has an equal length.
Proposed deep 1D-CNN Architecture: The first layer is the convolution layer that extracts features from the given fixed-length input signals of shape 120
$\times$
14. The convolution operations is performed using 100 filters with a kernel size of 3 on each input vector with stride = 1. The convoluted joint angle signals of n frames are used to generate feature maps containing the most prominent features. In the second layer, the convolution operation is repeated with the same 100
$\times$
3 size on the previously obtained feature maps with stride = 1. In the third layer, max-pooling layer is used to generate new feature maps by taking the maximum values in the specified regions on the feature maps obtained from the previous convolution layer. It reduces the size of feature maps according to the pooling size. In this layer, the feature maps are reduced by one third using 3-unit regions. In the fourth and fifth layer, the convolution operation is repeated with 256
$\times$
3 size combinations to learn higher-level features. The Rectified Linear Unit (RELU) is used as an activation function in all the convolution layers to impart non-linearity in the network structure. The output of the fifth layer is fed to the global average pooling layer. It generates new feature maps by taking the average values in the specified regions of the feature maps obtained from the previous layers. It also helps to avoid overfitting at this layer by reducing the parameters in the model. The features obtained from the global average pooling layer are placed as the input to the fully connected layer. This layer connects each neuron of the layer to each neuron of the next layer. The dropout rate of 0.8 is applied in this layer. It loses the 20% of hidden layer neurons; thus, dependency between the neurons is reduced. It is equivalent to training on different NNs with the reduced neurons and helps to learn more robust and diverse features. This is applied to avoid overfitting and improve generalisation during the learning phase in the NN. The output of the fully connected layer is fed to the softmax function, which performs the classification of human motions. It computes the probability of each motion by mapping the output of neurons into [0, 1] intervals. The motion with the highest probability is the predicted human motion. The detailed parameters of each layer of the deep 7-layer 1D-CNN are given in Table V. The conventional backpropagation algorithm with a batch size of 121 is used to train the CNN model for 1000 epochs. The Adaptive moment estimation (Adam) is the optimisation algorithm adopted to update the parameters of the proposed 1D-CNN with a learning rate of 0.0001. It enhances the efficiency of the training process and converges at a faster rate. The categorical cross-entropy loss function is used, given that the model learning is a multi-class classification problem. All the chosen parameters of proposed 1D-CNN are determined by brute force technique after observing the performances of the validation sets.
Table V. The detailed parameters of the proposed deep 1D-CNN model.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab7.png?pub-status=live)
4. Experimental results
In this section, the experimental results of the proposed framework are presented. The developed deep 1D-CNN model automatically classifies ten different whole-body motions. The dataset is split into three parts: training, validation, and test sets. In this experimental study, 60% of the data is used for the training set, 20% is used as a validation set, and the remaining 20% is used as a test set. These distributions are randomly selected. The training set is used to train the CNN model, and the validation set is used to tune the CNN parameters. The trained model is then applied on the testing set. The data in the testing phase are never seen by the model in the learning phase. The experiment performed is a supervised learning process. The training and validation performance graphs of the proposed 1D-CNN model during 1000 epochs are shown in Fig. 8(a) and (b). It is clearly seen from the graphs that there is no overfitting problem in the proposed model. The training accuracy attains 98.30% in the training phase and the validation accuracy achieves 95%. The training loss value in the learning phase is 0.0170, and the validation loss value of the model is 0.760. The critical performance achieved by the trained model after applying on the test data is analysed with standard evaluation criteria. The criteria used to assess the proposed model performance are precision, sensitivity, specificity, and F1 score. The overall accuracy of human motion recognition is also calculated for developed 1D CNN, NN, and HMM. The graphical representation for comparison of the evaluation metrics of the proposed CNN model and state-of-the-art methods is presented in Fig. 9. Ajili et al. [Reference Ajili, Mallem and Didier29] used HMM to develop a gesture recognition system for nao robot teleoperation. The joint angle features were extracted by applying Laban movement analysis. The extracted features were fed into the HMM. A grid search between 5 to 100 was done to find the optimum number of symbols and states. The best recognition results obtained with the state = 5 and observation symbols = 12. All the predicted gestures are converted into commands and sent to the NAO robot. The developed HMM-based gesture recognition system is implemented on our dataset and attains an accuracy of 89.7%. Jawed et al. [Reference Jawed, Mazhar, Altaf, Rehman, Shams and Asghar43] developed a NN based system to classify the whole-body postures of the patients for rehabilitation. The human skeleton was tracked by placing the Kinect at 0.6 m distance and 1.4 m height. The joint angles were extracted from the skeleton joint points and fed to the classifier. The NN-based recognition system is applied on our dataset and achieves an accuracy of 92.3%. The proposed 7-layer CNN model provides 95% recognition performance, which is more than the overall accuracy attained by the NN and HMM.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig8.png?pub-status=live)
Figure 8. Training and validation performances of the proposed deep 1D-CNN model with number of epochs (a) Model accuracy (b) Model loss.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig9.png?pub-status=live)
Figure 9. Detailed classification performance of the proposed model and other state-of-the-art methods.
The judicious selection of model parameters plays a key role in achieving high recognition performances. It may cause an overfitting or underfitting problem with inappropriate parameters. To analyse the performance of 1D-CNN, different sizes of convolutional kernels are taken into consideration in our experimental study. The effect of changing the filter size on the proposed model performance is presented in Fig. 10(a). The model performance is tested for the kernel size ranging from 3 to 10. The highest performance is achieved at kernel size = 3. The graph illustrates that the selection of the right kernel size is very important as it affects the performance of the model drastically. The effect of changing pool size on the model performance is shown in the Fig. 10(b). It concludes that it does not contribute much in increasing the model performances. It can also be seen from the graph that if we keep increasing the pool size, then the recognition performance reduces in our case. Based on the multiple runs, pooling size of 3 is found to be best for the recognition. The performance of the model obtained with different learning rate is presented in Fig. 10(c). The learning rate affects the recognition performances of the model more than the pooling size. In our case, small changes in the learning rate sometimes causing overfitting of the model. The best result is obtained with learning rate of 0.0001. The selection of optimisers has significant effects on the recognition performances of the classification system as shown in Table VI. The Stochastic Gradient Descent (SGD) with momentum and RMSProp optimiser has almost same recognition performance in our dataset. The Nadam optimiser provided a good recognition performance but not as good as AdaMax. The highest recognition is achieved by the Adam optimiser. The proposed 1D-CNN with selected model parameters is less complex and simpler to use. Moreover, it performs better recognition than the state-of-the-art methods. The motion predicted by the proposed model is published to ROS node. The NAO robot subscribes the defined node and imitates the same motion. For example, a human is performing a motion in which both hands are raised on the side and left leg is also on the side as shown in Fig. 11. The NAO robot is imitating the same human motion efficiently with a balance control.
Table VI. Performances of the proposed deep 1D-CNN classification system with different optimisers.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_tab8.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig10.png?pub-status=live)
Figure 10. Training and testing performances of the proposed deep 1D-CNN model with (a) increasing kernel size (b) increasing pooling size and (c) increasing learning rate.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220729180536646-0636:S0263574722000169:S0263574722000169_fig11.png?pub-status=live)
Figure 11. Imitation of human motion by NAO robot.
If the motion information provided by the Kinect is applied directly to the robot then the robot may not perform the task accurately due to noise, experimental error, robot joint constraints, tracking error etc. Also, if a human performs a slightly different motion, then the robot may confuse or recognise the wrong motion, thus performing the wrong task. Furthermore, it will cause a problem for balance control. The proposed teleoperation framework provides a good solution in this scenario. To tackle this issue, the proposed 1D-CNN is trained with motions of different persons. So that the robot learns the way of performing the same gesture by the different persons. A motion library is created with motion names and its corresponding best motion information. Using the proposed recognition approach, it recognises the motions accurately and picks its best motion information to imitate the motion. Once it predicts the motion, the robot automatically prepares itself to balance its body and imitates the motions precisely.
5. Conclusions and future work
In this study, a teleoperation framework is proposed based on Kinect and a humanoid robot, with the capability of motion recognition and balance control. The proposed adaptive balancing technique for NAO facilitates the robot to complete full-body postures efficiently. It works effectively in single as well as double supporting phases. It estimates the dynamic CoM by solving forward kinematics of the humanoid robot and applying weighted average of mass with the CoMs of individual links with respect to the previous joint frames. It targets to keep the dynamic CoM inside the stable region, that is, the foot base of the robot. We developed a robust 1D-CNN-based model that enables the robot to recognise the full-body motions of humans. It uses the joint angles mapped from 3D skeleton information and performed with a high accuracy. After recognising the motions performed in front of the Kinect, the NAO robot automatically implements the proposed balancing technique and imitates the same motions. The proposed framework is reasonable and presents a novel paradigm for future research studies. It brings up new opportunities to build frameworks that easily integrate sensor-based techniques that produce signals such as EEG, EMG. It encourages gesture/motion-based natural interactions between human and robot. Our future work aims to extend the balance strategy to stabilise the robot dynamically under the effect of external forces and moments using inertial measurement units and model-based closed-loop feedback control system algorithms. Further, we will design advance features to facilitate the robots to understand the human intentions.
Conflicts of Interest
The authors declare that they have no conflict of interest.
Financial Support
None.
Ethical Considerations
None.