Reinforcement learning for linked multicomponent robotic systems

LOPEZ GUEDE, JOSE MANUEL

Reinforcement learning for linked multicomponent robotic systems

LOPEZ GUEDE, JOSE MANUEL

Dirigida por:

Manuel Graña Romay Director/a

Universidad de defensa: Universidad del País Vasco - Euskal Herriko Unibertsitatea

Fecha de defensa: 12 de abril de 2012

Tribunal:

Darío Maravall Gómez-Allende Presidente/a
Javier de Lope Asiaín Secretario/a
Félix de la Paz López Vocal
Ramón Ferreiro García Vocal
José García Rodríguez Vocal

Departamento:

Ciencia de la Computación e Inteligencia Artificial

Tipo: Tesis

Teseo: 115791 DIALNET

Resumen

The thesis is focused in a specific kind of system, the Linked Multi-componentRobotic System (L-MCRS) consisting in a collection of automous robots attachedto a one dimensional object which a passive, flexible and/or elasticelement constraining the dynamics of the autonomous robots in a non-linearfashion. Therefore, modeling and prediction of the system dynamics needs totake into account the linking element as well as the mobile autonomous robots.In fact, the kind of practical tasks best suited for this kinds of systems is relatedto the manipulation and transport of the one dimensional object. Theparadigmatic example is the transportation or deplyment of a hose for fluiddisposal. The present dissertation follows a line of research of the group thathas laid some background supporting the present work. First, some proof ofconcept physical systems have been built and tested where the expected effectof the linking element is demonstrated. The hose sometimes hiders the motionof the robots, sometimes introduces drifts, and sometimes drags lagging robots.Some of these systems have been commented in Chapter 2 of this dissertation.Second, a theoretical framework for the accurate modeling and simulation ofthese kind of systems was provided. The Geometrically Exact Dynamic Splines(GEDS) allow modeling the hose and the forces playing inside it as a response ofthe external forces exerted by the robots and the environment. In this dissertation,the GEDS model has been adapted to be embedded in the computationalexperimentation required by the Reinforcement Learning (RL) approach.Although the physical model demonstrations provide some evidence of thelinking element efect, simulation does provide a repeatable and fully controlledexperimental setting to provide additional evidence supporting the intuition thatthe L-MCRS belongs to a category of systems different from the disconnectedcollection of robots (D-MCRS). The reasoning is that a control scheme derivedwith a D-MCRS in mind would not be able to deal with a L-MCRS identicalin every respect except for the existence of the linking element, that is, the distinctionbetween system categories lies in their controlability. The experimentalsetup in Chapter 3 was the formulation of a minimalistic L-MCRS model wherethe linking element is a compressible spring that exerts some force only when thesegment between two robots extends beyond a limit size. A distributed controlsystem was defined for a path following task with a mixed formation, where eachrobot unit control was designed to follow a reference position by a Proportional-Integral controller. Keeping the formation was the role of a distributed controlprocess, where the rear robot position corresponds to the coordination variable.The robots performed a consensus-based asynchronous distributed estimationof the coordination variable allowing for the successful completion of the taskwhen no linking element was present. The introduction of the linking elementproduced easily observable interactions between units, rendering the controllersystem ineffective to solve the path following task. The experiment demostratesthat the L-MCRS are specific category of systems from the point of view ofcontrollability.Reinforcement Learning (RL) allow autonomous learning of control systems.The main aim of the Thesis is to show that RL can provide a solution to theautonomous control design problem for L-MCRS. First we have identifed asuitable problem as prototypic instance of the L-MCRS control problem. Suchproblem is the deployment of a hose to make the tip reach a desired position.The state variables of the Markov Decision Problem (MDP) and the rewardsystem are they sensitive elements of the definition of the Q-learning system.The de denition of the state variables includes the decision about the discretizationresolution of the configuration space where the hose is moving. We have foundthat the discretization resolution can have a strong effect in the computationalcost of the process and in its success rate. Low resolution imply smaller statespaces and higher success because rough approximations to the solution arebetter tolerated. High resolutions imply greater computational complexity andlower success, because the exploration time grows exponentially with systemsize. Nevertheless we reach very high sucess rates in some instances of thelearning experiments.The state variables are determined by the abilities of the agent. The basicability is to sense its current position, which allowed in all systems. Next is theability to perceive the hose and determine if it has become an obstacle. Thisability is the minimal perception required and it may correspond to very simplesensors in real life experiences. The system is able to provide good results withthis minimal perception ability, under specific reward systems. The ability todetermine if the hose is inside a some specific region of the configuration is afurther sophistication of the perception ability of the system. An additionaldegree of perception is the ability to sense the position of two specific points ofthe the hose, allowing to have an implicit model of the hose to reason about.Finally, the ability to predict the danger of undesired termination state one stepahead is the last perception stage reached in our modeling, providing the bestresults as expected.In all the cases, the reward policy has a bigger impact on the learning performance.Basically they give positive reward for reaching the goal position,negative reward for reaching a failed state and diverse ways to value the inconclusivestates. When there is only positive reward for reaching the goal statethe results are good, meaning that negative reinforcement is not so influentialas expected from an intuitive point of view. Simplistic ways to give value toinconclusive states, such as zero value or a value proportional to the distance ofthe tip to the desired position, give good learning performance.We have tested a single-robot and a two-robots configurations with similarresults, the two robot system improving someplaces the single-robot configuration.For the two robot configuration we have tested single robot reward policiesapplied to the robot at the tip of the hose, the other robot remaining rewardless,with not-so-bad learning results, suggesting that teaching the "guiding robot"may be enough for the task. Besides, we have tested two-robot specific rewardsystems improving the single robot reward systems.Finally, learning time is highly dependent on the simulation time employedto reproduce the experiences on the real system. We have tested the improvementintroduced by storing the visited state transitions and their correspondingobserved rewards in a variation of Q-learning call TRQ-learning. We find improvedresults with TRQ-learning due to reduced need for exporation and fastercomputation.As lines of future work we find highly interesting the research on methodologicalimprovements in the definition of the RL algorithms allowing fasterand more successfull learning processes. The hierarchical decomposition of thesystem into diferent layers of abstraction can allow the progressive refinementof learning results until reaching the final stage of the most realistic modelingand simulation. Learning in the simple models can be fast and the refinementlearning can be much faster than the brute force approach on the whole model.Such approaches would need innovative ways to define the equivalence betweenmodels and how the transition between levels of abstraction could be made.We are also interested in bringing into real life systems the results of thelearning on the simulated model and exploring more realistic systems closerto the industrial applications, such as the hose deployment from a compactinterleaved state. The physical system design problems are challeging and havebeen only scratched by research groups interested in similar problems, suchas the GII from the Universidad de A Coruña. Innovative hose graspers andminimal mobile robot configurations, that may even be folded with the hose inthe resting state, or power transmision systems are extremely appealing lines ofresearch.