Modeling and Control Human Arm With Fuzzy- Genetic Muscle Model Based on Reinforcement Learning: The Muscle Activation Method

Background: The central nervous system (CNS) is optimizing arm movements to reduce some kind of cost function. Simulating parts of the nervous system is one way of obtaining accurate information about the neurological and treatment of neuromuscular diseases. The primary purpose of this paper is to model and control the human arm in a reaching movement based on reinforcement learning (RL) theory. Methods: First, Zajac’s muscle model has improved by a fuzzy system. Second, the proposed muscle model applied to the 6 muscles, which are responsible for a two-link arm that moves in the horizontal plane. Third, the model parameters are approximated based on the genetic algorithm (GA). Experimental data recorded from healthy subjects for assessing the approach. At last, the RL algorithm has utilized to guide the arm for reaching tasks. Results: The results show that: (1) The proposed system is temporally similar to a real arm movement. (2) The RL algorithm can generate the motor commands obtained from electromyographies (EMGs). (3) The similarity of obtained activation function from the system has compared with the real data activation function, which may prove the possibility of RL in the CNS (basal ganglia). Finally, in order to have a graphical and effective representation of the arm model, the virtual reality environment of MATLAB has been used. Conclusion: Since the RL method is a representative of the brain’s control function, it has some features, such as better settling time, not having any peek overshoot, and robustness.


Introduction
When a person performs a reaching movement, he/she chooses a unique trajectory among an infinite number of possible ones. It has suggested that the central nervous system (CNS) optimizes arm movements to reduce some kind of cost functions. [1][2][3] Cost functions usually contain movement-related variables that should optimize during the movement. 4 Some computational control models have shown that CNS generates a set of motor commands to optimize cost functions. [5][6][7][8][9][10] These models predict the arm trajectories the same as the experimental data, but they do not describe the process of learning.
The present paper discusses an optimal learning control method based on reinforcement learning (RL). RL acquired an optimal control strategy through a trial-anderror process with no explicit "teacher. " 11 RL algorithms are considered as a model for Dopamine-based learning in the brain, 12 where the dopaminergic projections from the substantia nigra to the basal ganglia functions as the prediction error.
The innovation of this study is to use the FGIHM model as an adaptive muscle model 13 and control it by an RL method that is a model of brain function. Modeling of the CNS and the muscle-skeletal system is a step forward to understanding the complexity of the system. This knowledge can help clinicians soon to treat neurodegenerative diseases.
In a study, 14 this classical RL algorithm applied to implement human arm movement. Major drawbacks of Classical RL in the biological system (human arm) are their requirements for many trials during learning procedure to achieve a simple point-to-point motion.
In other words, the computational cost is so high, and the simulation process takes a long time. To overcome Int Clin Neurosci J. Vol 15 was proposed. The Q-learning technique is one of the RL methods that utilize the action-value function. The action-value function gives the expected utility of taking a given action in a given state and following a fixed policy after that. 16 Recently, a method of Q-learning, multiagent state action reward state action (MA-SARSA), 16 was introduced to increase the convergence of system states and avoid the local optima traps. This approach takes advantage of continuous reward functions to improve the learning speed in which a classical RL with discrete rewards does not. 16 Another advantage of MA-SARSA is overcoming combinatorial explosion through multiagents that are learning separately. The combinatorial explosion happens when multiple actuators are needed to control a complex agent in a dynamical environment optimally.
In other research combining an artificial bee colony algorithm called the BCR algorithm is used to control the two-link arm with 6 muscles. 17 The chaos phenomena may occur in two-link arm control based on the RL algorithm by changing parameters of muscles. 18 Another method to determine the muscle activity is muscle synergy. 19 Some research to understanding the function of the neuromuscular system has used in anthropomorphic systems. 20 In this paper, a reaching movement by a two-link arm with 6 fuzzy-genetic muscles has investigated. The authors previously showed that Zajac's muscle model is unable to take account of planar movement details during the experiment because of few tuning parameters. 13 In the proposed muscle model, 3 fuzzy parameters are added to Zajac's model to overcome the mentioned drawbacks. In order to validate the approach, an experiment is conducted on three right-hand male subjects (mean age 27), and tuning the model with a genetic algorithm (GA). Then MA-SARSA algorithm is used for finding each muscle activation function. The results show that: (1) MA-SARSA can control this model with the redundancy of responses for reaching movement with settling time, rise time, and peek overshoot the same as a human arm reaching movement. (2) RL can yield the activation level of muscle the same as the one extracted from real electromyography (EMG).

The Kinematics of Muscle-Joint Space
The two-link arm model with 6 muscles has shown in Figure 1 where MLi denotes the length of each muscle, (θ1, θ2) are the joint angles. a1∼8, b1∼4 represent each position at which the muscles connect to the bones. The time derivative of equation (1) can be expressed as follows: where 6 1 l × ∈   is the muscles contractile velocity vector, where: a bL a a r Ml a bL a a r Ml The relation between muscle forces and joint torques can be expressed as follows: where 6 1 F × ∈  is the vector of muscle forces, 2 1 τ × ∈  is the vector of joint torques and

The Modeling of the Two-Link Arm in the Horizontal Plane
The following relation for the two-link planar arm has existed if Lagrange's equation used: where M is the inertial matrix of the arm, V is the nonlinear term which includes the Coriolis and centrifugal force; τ is the input torque for arm joints, and G presents part of the moment encountering gravity force. 22 Substituting (4) in (5), yields: where F is the force of the 6 muscles, and W is the Jacobian matrix from the joint space to the muscle space.

Experimental Setup and EMG Signal Processing to Obtain Muscle Activation
In order to set parameters of the muscle model for each person, EMGs, shoulder joint, and elbow joint angles have been recorded simultaneously by the BIOPAC system, Model MP150. 23 Each subject has asked to sit behind a special table. The height of this table is adjustable that the shoulder and the person's body meet the 90 o angle (Figure 2).
Three healthy subjects have tested in this experiment (male subjects aged from 21 to. 27 All the cases are righthand and without neural-muscular problems. Subjects have asked to perform the task without any concentration to minimize the co-activation of antagonist's muscles. Ag-AgCl surface electrodes have used to obtain EMGs. For bipolar recording, the electrodes of 8 mm (8 mmAg-AgCl BIOPAC-EL208S) diameter were attached to the subject's skin. 23 The places of electrodes are used. 24,25 EMGs has recorded from muscles of short biceps head (BSH), long biceps head (BLH), Triceps Brachii medialhead (TRIA), long triceps head (TRIO), pectoralis major (PMJ), and deltoid (DEL). Besides, the angles of the elbow joint and shoulder recorded by the electrogoniometer (Figure 3).
The EMG signals amplified by 5000 gain 20 with a sampling rate of 1 kHz. The recording duration was 15 seconds. There was 20 seconds rest between recordings. Subjects have asked to move their hands between 2 marked points. The distance between the start point and endpoints is 50 cm.
In order to obtain normalized muscle activation level (NAL) from raw EMG, the following steps have done 27 : (1) A high-pass filter. (2) Full signal rectification (absolute value). (3) A low pass. (4) Signal normalization to the maximum EMG of each channel. Figure 4 shows the block diagram of the NAL method. The raw EMG signal with arbitrary amplitude has mapped into a normalized signal in the range of zero to one.
The above procedure has applied to a biceps signal, which has shown in Figure 5.

Enhancement of Zajac's Muscle Model With Fuzzy System
A Zajac-type muscle model is used to generate muscle force. 28 According to this model, the muscle force (F muscle ) is the sum of active one (F α ) and passive one (F p ) as follows 26 : Where α is the muscle activation level of the muscle in the range of 0-1 (0 for passive muscle, 1 for fully activated muscle), F max is the maximum force of muscle, l o is the optimal fascicle length. The exponential shape determined by the variable K sh , which has set K sh = 3. Finally, the output F muscle depends on the velocity, F α and F p as following 26 :  28 In order to enhance the Zajac model, 3 fuzzy scaling coefficients added to F p , F α and F muscle . The F p and F α coefficients are dependent on length, and the F muscle coefficient is dependent on the velocity of muscle length. By applying these coefficients, the tuning parameter of the model is increased (by setting fuzzy scaling coefficients to one, the relation reduced to Zajac's muscle model).
In order to calculate the coefficients, a fuzzy system designed with Mamdani inference system ( Figure 6). 28 The fuzzification and defuzzification are performed by Gaussian membership functions as follows: This function has 2 parameters (c and σ), which are determined by the GA.

Tuning the Fuzzy Muscle Parameters by Genetic Algorithm
A GA method has used to modify scaling fuzzy coefficients of the model (section 2.4). These coefficients change based on input and output membership functions of the model. Membership functions depend on the length and muscular contraction velocity. GA adjusts muscle coefficients, which are muscle parameters for mapping EMG to elbow and shoulder joint angles.

Control of the Modeled Arm Based on Reinforcement Learning Method
In order to control the model, MA-SARSA is utilized. 31 The block diagram of this algorithm has shown in Figure  7. In this method, agent updating is performed based on receiving rewards and punishments from system conditions and applying them to all of the agents and adjusting each one separately. or using this algorithm, the reward functions have defined as follows: where β is a gain that controls the magnitude of the reward, d is the distance from the goal position, and n is the shape factor parameter. MA-SARSA is using a continuous penalty function for every action that does not support the goal as follows: where  is the bias term, and v is the velocity of the tipping arm. 32

Simulation Procedure
The two-link arm model with 6 fuzzy-genetic muscles (TLA6FGM) simulated in Simulink and SimMechanics toolbox of MATLAB software: MATLAB/Simulink/ SimMechanic 2008 (ver 7.6.0.324). For solving differential equations, Runge-Kutta, with the constant step of 0.01 second, is used. The GA preceded 800 iterations, and in the 100 th and 200 th iteration, random noise added to its output, and the fitness function has been set based on mean square error. Physical parameters of arm and muscle connection points derived from. 21 Three subjects used, and they have asked to do extension and flexion in the horizontal plane. Before implementing the RL method, first, the TLA6FGM was tuned for each subject. Finally, the tuned TLA6FGM controlled with the RL algorithm. The RL algorithm, based on an MA-SARSA, has performed for 1500 episodes, which has 1500 steps.

Results
In Figure 8, the proposed model of the arm with 6 fuzzy muscles controlled by two controllers; state feedback (PID) and RL. Figure 8a shows the RL controller superiority over the state feedback controller in Figure 8b.
The angles and angular velocity of the model controlled by RL and state feedback have shown in Figure 9. The dash (--) lines came from the experiment, and the red line is RL controller responses, and the blue lines are for the state feedback. Figure 10 compares the reaching distance of the tip in the 2 controllers, and the error estimated based on the Euclidean distance from the target.
In Figures 11 and 12, the activation level of the RL controller and state feedback compared for every 6 muscles.    Figure 13 the trace of the two-link arm with 6 models controlled by RL has shown. Figures 13a and 13b show arm motions with 500, and 1500 trails, and Figure 13c shows the tangential velocity profile of the hand from time 0-0.52 seconds. Figure 14 shows the robustness of MA-SARSA and PID-GA under the disturbance field, which applied to the tip of the arm.
In order to have better insight into the arm movement, the virtual reality of MATLAB has used to demonstrate the performance of the model. Figure 15 shows the postures of the arm in different viewpoints.

Conclusion
In the present paper, a fuzzy-genetic muscle model based on Zajac's model was proposed (FGMZM). The essential advantages of this model are: (1) FGMZM is more robust than Zajac to the input noise. (2) It can be customized depending on the number of adjustable parameters. FGMZM has been used to model the two-link arm with 6 muscles. Finally, the arm model has controlled with MA-SARSA algorithm, and the results have compared with the PID-GA state feedback control algorithm.
In a comparison between the MA-SARSA method and the PID-GA state feedback controller (Figure 8), the MA-SARSA has better settling time (less than a second like human reaching arm movement) than the PID-GA controller (more than 5 seconds).
Moreover, MA-SARSA does not have any peek overshoot (Figure 9). This comes from a speed regulation process in MA-SARSA, which is similar to a bell-shaped curve (Figure 13c). The speed pattern at the beginning and the end of the process is low, but in the middle, the motion speed is high. In PID-GA state feedback, peak overshoot has always seen because this controller utilized the error correction method for controlling procedure.
Disturbance effects to the tip of the arm and its drift from the no-load trajectory has shown in Figure 13. MA-SARSA has better robustness to the noise than the PID-GA state feedback controller because MA-SARSA changed the controller structure. Structural changes come from the learning process of trial-and-error, which allows the controller to study all the possible conditions.
The effect of the learning process on the reaching movement has illustrated in Figures 13a and 13b with sets of weight parameters after the 500 th and 1500 th trials. At the early phase of the learning (500 th trial), the tip reached the target, but it passed away because the speed of tip at the reaching point is not acceptable. As the number of trials increases up to 1500, MA-SARSA can reach the goal with suitable cost function (Figure 13b). The velocity profile is almost bell-shaped, and if the number of trials increases, the velocity profile became smoother bell-shape typically (Figure 13c), and the hand trajectory tends to follow an approximately straight line.
The MA-SARSA controller drives out different parameter sets in which all of them achieved the reaching movement goals. Among them, there is a specific set that matches mostly to the biological controller. Hence, MA-SARSA has the capability of modeling a part of the brain named basal ganglia, 32 which plays a vital role in the skilled movement.
Tahara et al suggested a task-space feedback for controlling two-link arm driven by 6 muscles. 21 In comparison, when the arm reaches the target based on the task-space feedback control, it has to swing around the target, and sometimes overshoot has happened. This       journals.sbmu.ac.ir/Neuroscience http phenomenon has not seen in the human arm movements. In this sense, the RL control method is more reliable with the reality of human arm behavior than Task-space feedback control.
For further investment, it has suggested using the biological controller in a hierarchical structure to study the controlling procedure of the human brain more precisely. Besides, an experimental setup should design to implement and validate such a structure.
The virtual reality of MATLAB has a few block sets which are not proper for human movement demography. Therefore, the authors define some new block sets for animating human movements. In the future, this toolbox will be completed and presented for other research collogues to invest the human body movements.