WikiJournal Preprints/Story based tracking control with hierarchical planning

= Cognitive planning =

Introduction
The term “Tracking control” is usually used inside the domain of optimal control. A nonlinear equation is random-sampled with the idea, to bring the system into a goal state. An application is inverted pendulum balancing or biped walking. The basic idea is, that a higher instance (mostly the human) is setting the subgoal and the tracking controller is following.

A possible complicated system is apart from the balancing task a symbolic game engine, e.g. a textadventure or realtime strategy game. At first, there is no trajectory given which can be followed. Instead, the trajectory is equal to the plan which is run by the player. For example, a realtime strategy game can be played with the plan:


 * 1) move player1 to place A
 * 2) move player2 to place C
 * 3) use player1 and player2 together to attack enemy A

Such a plan is the trajectory for solving the game. The question is now: how will be the tracking controller looks like to follow the plan? At first, a tracking controller is something different from copying the given plan. Instead he is trying to reach the same goal. In the area of computer-animation so called keyframes are used for describing subgoals. Can this be adapted to real time strategy games?

In the domain of project management so called “plan tracking” is used for getting feedback about a project before it is over. The idea is to have a base line against the current situation is validated. That means, a project can be on the plan or off the plan. Let us make a practical example. In the window left, the human is playing an realtime strategy game. In the window right is the AI playing the same game. How can we determine if the AI is doing the same actions?

Case based reasoning with an textadventure
A textadventure is a game-engine which is controlled with language input. A way through the text-adventure can be for example the following:


 * 1) goto door
 * 2) open door
 * 3) goto room A
 * 4) take object
 * 5) goto room B
 * 6) place object

This is some kind of symbolic trajectory, or a plan. It is executed against the textadventure. Another walk through is the following:


 * 1) goto door
 * 2) open door
 * 3) goto room A
 * 4) goto room B
 * 5) close door

This plan is different from the previous one and it will result in another ending. Storing the plans is done in cases. Case-based reasoning is used for getting additional information about the plans. It means, to search the database for a certain plan fragment.

State-space trajectory
In the area of optimal control, the trajectory is a spline over time. On the x-axis the time in second is given while on the y-axis the angle of the pendulum. If the human operator is doing a task, a specific pattern is visible. In the domain of an textadventure there is no angle or other numerical parameter given but the game-engine can also be in a state which is enumerable. For example, the player can be in room A or B, and he can have item A or no item The overall state-space consists of 4 possible states, see the figure. On the x-axis we have like in the optimal control problem the time, and on the y-axis we have the hash value of the state-space which can be 0,1,2 or 3. A plan through the game results in a trajectory too.

Complex motion patterns with RRT planners
Every good RRT based kinodynamic planner has the same problem. The number of nodes is very small. It can not be increased and this prevents that the robot will reach complex goals. But RRT is not the problem, the problem is how to use it as part of a more complex system. Let us first investigate in which domain a RRT planner is really good: tracking control. That is a near time goal which can reached in under 2 seconds. The horizon of the robot ends with the nodes which are most far away from the current state. Everything behind the horizon can not be observed. The human operator can only make a movement, which is inside the horizon of the robot. For example, the robot should move around an object. The subgoal must be always inside the reachable area.

Reaching high-level-goal has to do with making actions over a long time period. Not 2 seconds, but 20 seconds. Reaching longtime goals is possible with taking the result of the short time planner as input and construct a new game with it. For example, the short-term tracking controller can bring the robot to every relative position around an object. This behavior is the starting point and can be extended to a push controller. Pushing means at first move the robot to a position near the object and then doing a push action.

The precondition for the long-term planner is, that some motion primitive are working. For example a “moveto” primitive and a push primitive. The question has to be answered how to combine these lowlevel primitive into a useful task. In the literature the problem is described as multimodal planning. That means, that motion primitives are given by the name. These primitive are sorted to fulfill a task. And every motion primitive itself is also solved by a planner. Perhaps an example would make things easier.

A motion primitive on the lowest level is “moveto(x,y)”. If we are saying “moveto(10,10)” that is equal to a subgoal. The goal is reached by a tracking controller which works with the RRT algorithm. Like I described in the above paragraph the “moveto” primitive has a time horizon of 2 seconds. The robot is doing random tasks and one of them reached the goal of (10,10). A complex task is executed with pasting many lowlevel primitives together, for example moveto(x,y), push(a), grasp(a). Every action has a timeperiod of not more than 2 seconds and they are each calculated with the RRT planner. But combined together it is possible to follow a long-term goal.

Let us go into the details because the combination of lowlevel and highlevel planning is important. The lowlevel “moveto(10,10)” primitive is creating first the RRT tree. In this tree 100 nodes are calculated and this one which is near the goal of (10,10) is selected. The solver brings the robot into the state and the task of the subaction is over. Now we can execute a different motion primitive for example “push(20)”. The parameter describes the direction in which the ball has to be shoot. And again the RRT tree is calculated. This time not the movement of the robot is the subgoal, but the direction of the ball. Again, 100 nodes are calculated and this one with the greatest similarity is selected.,

What we can say is, that every motion primitive is different. It has a different goal and different option which are flexible. A motion primitive is similar to a subprogram and has to be designed manually. The good news is, that the motion primitives are robust. They can compensate noise because the usage of a RRT planner. The easiest development kit contains only 2 motion primitive. They have different names and they are doing different things. But this improves the possible action space of the robot enormous. For example, he can execute a plan like: primitive1, primitive1, primitive1. Or he can execute a plan like primitive1, primitive2, primitive1.

A second positive information is, that inside the PDDL community is well defined how to combine motion primitive into high order plans. The typical pddl statements works with a symbolic definition of what the primitive will reach. For example, “ moveto(x,y)” results into “robot at (x,y)”. Even the moveto command wasn't executed yet, the planner can take the result and calculate the next step. That means, high level plans are created without trying, if the lowlevel planner works.

Let us describe an example. A robot should kick a ball into the goal. According to the pddl definition the robot must first approach the ball and then kick him. The solver calculates some options how exactly this can be done, but he didn't run a low level simulation. He generates only the symbolic plan “ moveto(ball), kick(to goal)”. And it is up to the lowlevel planner for realizing this subgoals.

What I'm trying to explain is, that the small horizon of a RRT planner and the limited number of nodes in the graph is not a problem. It is enough, if the planner is able to reach a goal inside the horizon of 2 seconds, because long-term goals are given by the pddl symbolic planner.

I'm not the first one who is describing the idea. In the year 2010 a paper was published which describes a pddl planner for high-level-goals in playing towers of hanoi, and a lowlevel planner which executes the subgoal on a robot. Like in the example with the kicking robot, the lowlevel motion planner has a limited horizon. He is not able to plan a longer timeperiod.

As a high-level-description language the authors are using C+ (not C++) which is a golog dialect. It works similar to the PDDL syntax, which means the effect of an action is given on a symbolic level. It is possible to generalize the idea. The high-level-task is comparable to a textadventure which is modeled with object-oriented programming. Doing a action there is very simple. For example an action with the name “pickup” results into that the object is taken up. The concrete realization is not defined, only the attribute of the symbolic game engine is set to true. The lowlevel tasks are realized with a motion planner which uses the well known RRT planner. Two possible motion primitive are given “move” and “rotate”. The overall system works surprisingly simple. At first the high-level-planner calculates a plan, for example:

move, rotate, move, move, rotate

and then the lowlevel motion planner is executing each primitive after each other.

So called “Task- and motionplanning” is a powerful techniques which combines a lowlevel RRT planner with a highlevel PDDL planner. It results into a robot system which can solve complex tasks without using too much cpu-time.

Creating a second physics engine on top of Box2d
Box2d is one of the fastest physics engines available. It is possible to create 1000 instances of Box2d in under a second to calculate the next step. This concept is called kinodynamic planning, because the Box2D Engine is sampled randomly. But perhaps is an alternative available which has a higher performance? The idea is to create a second physics engine on top of Box2d with the aim to predict future states. For example, if the robot is near the ball and pushes him a certain reaction will follow. The reaction can be calculated with Box2d in perfect accuracy or it can be hand crafted in a symbolic simulation. The idea is not to create a behavior tree which tells the robot what he should do, the idea is, to predict what will happen after a certain action is executed.

The advantage is, that handcrafted physics engine are faster. It is possible to calculate 1 million of them in under a second.

Blackboard architecture
In the context of case-based reasoning often so called “ Blackboard architectures” are cited. The idea is surprisingly simple. The program is able to store textual information in a class and other parts of the program has access to it. The implementation in C++ is a std::vector which is able to store a plain-text file. But why is this important?

Usually programming works object oriented, that means, the programmer has to create classes which contains methods and attributes. Then he starts his program and it will do something. This approach isn't flexible enough for Agent programming, because an agent has to decide at runtime between different actions and between different attributes. For example, the robot is entering a new level and sees that they are two enemies in the room. So he creates on his blackboard two new attributes. He sees also that there is a box which he can push, so he creates a new method called push. In classical C++ programming this behavior is hard to implement, because all the classes and methods must be available before the compilation. A blackboard system is some kind of self-modifying software which creates the runable code in realtime.

To be more specific, it is not executable code but mostly plaintext. On top of a blackboard system a parser is used for interpreting the text and executes methods in the normal C++ program. A second important feature is, that a blackboard can be shown to the user too, it is some kind of shared communication medium, so that the user is able to see, what the agent is doing right now and what the goals are.

Again, from a technical perspective a blackboard is very simple. It is simple a huge array which contains plain text, programming that can be done with 2 lines of code. The advantages comes, if around the blackboard other features are implemented, such a system is called in the literature an agent based blackboard system. It is build on top of an object oriented software and realizes artificial intelligence.

Cognitive architecture vs. robotics challenges
The history of Artificial Intelligence is dominated by Artificial General Intelligence, Psychology and cognitive Architectures. A common concept is to define a blackboard and a long-term memory and then a symbol reasoner is acting intelligently. This is pseudo-science because it will never fail. A AGI like cognitive architecture is always true and can not developed into something new. The idea is to proof that the concept works and the proof is done mathematically.

The better alternative is to think like an engineer who want's to solve problems. The technology is measured against how useful it is in real life. At first the problem is defined, for example guide a robot through a maze, and then the algorithm is programmed around this challenge. The different technologies are separated into classes for solving problems. That means, there is group 1 of algorithm which can solve maze-like problems, in group 2 are algorithm and software package which can solve robocup like games, and in group 3 there is an algorithm which is able to play angry birds.

The big picture
The main problem in robotics is to program an Artificial Intelligence. A human operator has all the knowledge, but his decisions are not stored in software. What a good AI can do is to emulate the human operator. This is called “Tracking control”. The idea is, that the software is replaying the motion of the teleoperator. Implementing tracking control needs the following subtechniques:


 * Parameterized Motor Skills (a method in sourcecode which can be controlled with different parameters)
 * RDF-triple (storing textual commands)
 * keyframes

Let us describe an example. The human operator is using the teleoperation mode of a robot gripper to grasp a ball. While he is controlling the system with a joystick the movements were translated into: keyframes (every second one keyframe), RDF-triple (an image to action parser), parametrized skills (translation of rdf-triple into lowlevel commands). The human operator is handling a concrete task and at the same time a machine readable capture is created. After the demonstration the task can be executed in the replay mode autonomously. The planner is using the stored keyframes, rdf-triple and lowlevel motor skills to bring the system in the desired state.

The concept can be called “story based tracking control”, because the demonstration of the human operator is translated into a machine readable story. It contains all the steps of a workflow and it is possible to tell the story again with small modifications.

= Semantic event chain =

Keyframe based task model
In the PDDL literature a so called “action model” is the high-level description of a task. An action model can consists of motion primitive like grasp and move. The main problem is how to describe an action model in a machine readable form. One way in doing so are annotated keyframes. This concept is used usually for describing tasks for humans, for example in the tutorial of how to build up a shelf, keyframes are used to describe visually the steps until the shelf is built.

The concept itself is not a ready-to-run algorithm it is more an idea how the software can work. The task contains of keyframes and under the keyframes a short description in natural language is given. This describes the possible actions. The next question is how to transfer this human-readable tutorial into low level motor commands. That is indeed a problem, which is unsolved right now. But there is one advantage: a domain can be separated with that definition into two layers. A high-level keyframe based task model and a lowlevel execution of the action model.

In the literature such a concept is sometimes be used but not very often. The problem seems that it is unclear if this makes sense or not. Keyframes have one major advantage: they can describe any domain. No matter if the task is “pick&place”, “ Dexterous grasping”, “Kitchen robotics” or “Autonomous driving”, in all cases it is possible to describe with keyframes the situation. At least for human operators in a manual. The more difficult challenge is to convert these description into a machine readable planner specification. For example, even if a robot has the keyframes for the pick&place task he is not able to reproduce the task autonomously.

I wouldn't call annotated keyframes an algorithm for robotics control, it is more a software engineering technique like UML and case-based reasoning which gives a general idea of how to develop a software. It is a specification which has to be implemented in run-able C++ code.

time pattern
Keyframes are snapshots over time. For example, one keyframe can be taken every 2 seconds. The idea is to structure a task into chunks, and planning an action from one keyframe to the next. An array of keyframes is called a spatio-temporal datastructure. This sounds more complicated then it is, spatio-temporal means only “everything”. Any domain is spatio-temporal.

Semantic event chains
Keyframes have a timecode and an annotation. A machine readable annotation is called a Semantic Event Chain. The syntax is based on a RDF-triple for example “left hand grasps ball”. The enriched keyframes have now the structure:


 * 1) timecode, e.g. 10sec, 20sec
 * 2) keyframe with absolute positions of the objects
 * 3) annotation in RDF format, e.g. “left hand grasps ball”

Task model as semantic event chains
The number of papers at Google Scholar about the topic “semantic event chains” (SEC) is around 50, most of them were published from 2012-2015. I want to summarize the broad idea.

A modern robotic software contains of a “Task and motion planning” module. That is a two layer architecture which has a symbolic level (task planning) and a geometric planner (motion planner). A SEC is located in the task planner, the general idea is to invent a pick&place game which has a small state space. The game contains two hands (left hand, right hand), some objects (ball, bottle, hammer) and actions (move, grasp, pick, place). The SEC game has the same structure like Maniac Mansion: on top the user see a graphical representation while on the bottom he can enter textual commands. On of these command can be “left hand grasp ball”. The syntax of the user commands is similar to the RDF syntax which means Subject, Verb, Object. The idea is to parse the textual commands easily.

The inner working of the game engine in the SEC game is not very complicated. It is not a realistic box2d physics engine, but the game has more in common with simple Cooking games which are realized in Flash and HTML5. That means, a command like “grasp” brings the hand over the object and playback a animation. The general idea is, that the human player can activate actions with textual commands, it is an interactive environment like in bad computer games invented in the 1990s.

Around the SEC game some helper functions like a trajectory tracker, and recording of motions in cases are realized. The overall idea is, two separate the robotic control system into a symbolic high level game and a lowlevel geometric planner. For example, if the SEC high level game outputs a sequence like “left hand move to ball, left hand grasp ball, right hand move to ball, right hand grasp ball” this sequence can be used as input for the lowlevel planner which has to execute the movements on the physical robot. So the SEC is simply an answer for the Task and motion planning problem. The grounding works in a way, that the human operator is interacting with an abstract high level game, and the low level planner has to realize these commands.

= Task Planning =

Hierarchical Planning
Under the umbrella term “DARRT” a study was published which explains a multi-modal RRT like planner. It is the same problem like the “task and motion planning” problem, which means that we have high-level actions like grasp and push and inside these actions subactions are executed. The vanilla answer to the problem is to solve each hierarchy separate (page 5 of the paper): first plan a sequence of modes (classical PDDL solver), then plan for each node the sequence.

also discusses multi-modal planning. He has on page 26 a so called “mode graph”. The figure looks similar to the 3d chessboard in Star Trek TNG. There are 4 levels given which represents the modes. On each level the normal RRT graph is generated and additionally the robot can switch between the modes.

Explaining this concept in detail is possible with an example. A walking robot can be in different gait-patterns. For example in the climbing mode or in the running mode. To walk through a terrain only a certain type of mode makes sense. For example, if the robot is on a hill and tries to run he will fail.

If multi-modal planning is really something new is unclear. Perhaps it is only a constraint for guide the planning process. Let us go into the details of how an Multi modal RRT planner works. At first we are starting at a point. The robot has different choices what he can do. According to the RRT algorithm we are adding a uniform sampled node and the graph is growing. Adding a node can in the simplest form remain on the same mode. That means, the robot is in climbing mode and explores what will happen if he climbs again. In the second case, he explores what will happen if he changes the mode to running. There are two possible outcome. The mode switch can be done successful. This result into a growing RRT graph onto a second layer. Or the exploration of the graph fails, that means the new running mode is blocked. On youtube is a video given, which shows the multi-modal planner MOPL in action. The robot is able to push a bottle of water and also can grasp it. The result is a human-like behavior.

Sokoban Example
In a simple example, two modes are given: walk which is equal to move the player by 4 steps at once and push which is equal to move the player by 1 step. A possible plan looks like:

run, run, run, (player enters room A), push, push, (player has moved a block), run, run (player enters room b).

In the start, the player has two choices: run or push. So the question is what of them will result into a goal? Multimodal planning is not the answer, it is only the problem description. It is up to the programmer to define transitions in which situation which mode is right. If he not defines such rules the state-space is huge. A possible rule could be, that “push” is only allowed if the player is near a block.

Let us go a step back. A multi-modal planner is equal to a task and motion planning problem. There are many layers which are important. On the high-level layer the decision between push and run has to be taken, on the lower level the details of an action are important for example in which direction the player should run. To make it more clear: he can push in 4 directions (left, right, down, up) and he can also run in 4 directions. If no additional constraints are given in each timestep the player can choose between 4+4=8 actions. After 10 timesteps the number of plans is 8^{10}.

Sub-Task reward
The main problem in reinforcement learning is the missing of Sub-Task rewards. If the agent only gets a score after finishing the level and must fulfill subtasks like “take a key”, the number of actions is too high, that the agent will ever gets success in the gameplay. In the above cited paper, natural language instructions were used to provide rewards before the agent reaches the end of a level.

Is this technique also successful in automated planning? Yes, hierarchical planning and sub-task rewards is the same idea. In both cases the idea is to execute tasks inside the task. That means, the goal is not only to reach a certain point in time, but the goal is to fulfill subtasks and each of the subtask has a subgoal.

Let us take an example from the Atari gameplay. A subtask may be “ climb down the ladder”. The interesting aspect on this task is, that according to the gameplay of “MONTEZUMA’S REVENGE” the agent gets no score for doing so. It is something which he can do, but nothing which is rewarded. So why, should the agent fulfill the task? This was the bottleneck in early reinforcement learning which were implemented by Deepmind. Because the game doesn't rewarded the action, the agent do not climb down the ladder and the result was, that the agent failed to solve a level. That means, inside the level are actions given without rewarding them, but the agent must execute the actions, because otherwise he fails to solve the level.

From the perspective of agent programming, it is important to specify all the sub-tasks explicitly. one example is natural language. This is done very easy. At first, we need list of subtasks, and we define what the purpose is. The original game rules were extended by subtasks. In the best case, also a score is defined. The improved version of the game contains the subtasks as rules and now the agent is able to solve the game with success.

Not only Atari games contains subtasks which are not formally part of the game, but every other game too. The player acts, and after a timelag he gets a reward. The best example is perhaps the pong arcade game. There is no explicitly reward to move the paddle into the direction of the ball. But if the player doesn't do so, or did it too late, he will lost the gameplay. That means, pong has subgoals too, but they are not formally part the game rules. There are two form of them: hidden subgoals and explicit subgoals.

Planning as core technology for AI
What is the general principle to build an Artificial Intelligence system? In the classical AGI literature often the term cognitive architecture is used, which describes the system itself. It contains of an episodic memory, a long term memory and a visual buffer. This is not the best choice if the aim is to get a working software. Strong AI and AGI is some kind of anti-pattern how to never get a robot.

The better approach are planning centric systems. The general idea is to divide the problem into high-level-textual layer which is comparable to a textadventure and a low-level graphical layer which is equal to an inverse kinematics planner. On the high-level layer so called symbolic planners like PDDL and HTN-networks can be used for the lower level a randomized planner like RRT rapidly random exploring tree is the best choice.

Such a planner may have no sign of intelligence, and indeed it is an engineering approach not a philosophical one. The planner is build around a problem. It can be programmed in C++ or any other programming language. The challenge is to get the system very fast. It is very easy to construct a planner which needs to much cpu time for answering a query. I think, what AI Engineering has to answer is how exactly a task-motion-planner will look like which has to be solve a certain problem. For example for a kitchen robot or an autonomous car.

What is wrong with AgentSpeak
AgentSpeak is a high-level planning framework for BDI-agents. It is often used together with Jason, which is a Java frontend to implement real agents. The problem with Agentspeak is, that it was invented around a programming language. It is some kind of improved PDDL parser but with a plan library. Is it possible to encode all problems in Agentspeak? No, only symbolic problems can be described in the syntax. Or to be more specific: Agentspeak is a framework for programming a textadventure. It can be used for programming a Zork clone in which the player is in a room, and can open doors – all with the commandline.

The better approach is not using Agentspeak but to define what the aim is. That means, a textadventure can be programmed with a normal object-oriented language like C++ way more efficient. There is no need for a dedicated planning language. Instead the bottleneck is the grounding, which is the transition between high-level layer and the low-level layer for execution in real games.

The problems in programming an interface between a game and the agent framework is described in. The example Starcraft has a BWAPI interface, BW=broadwar, API=Application Programming interface. The BWAPI maps commands to the game and it is used for retrieve information about the game.

Dialogue systems
Textadventures can be seen as dialogue systems. The human operator has to enter commands, and the systems reacts. The idea is, that an AI supports the planning process. The interesting information is, that the dialogue system does not contain Artificial Intelligence. That means it is something different from an Agentspeak definition. Instead, the dialogue system is equal to a textadventure which is a manual controllable game. The difference to a jump'n'run game, that all the states and actions are defined on a symbolic level.

Something else is interesting, the game-engine of a textadvanture can be programmed with a PDDL like syntax. That means, PDDL is not used for solve the game, it is the counterpart, which reacts to the input of the player. For example, the player enters on the commandline “Take the key”. At first the parser recognize the command, and then the game engine checks if the player is near the key (that is the precondition). If this is the case, then the key is put to the player's inventory.

I think, the fact is important because usually, PDDL is described as a planning tool for solving such games, but not as part of the game engine itself. That means, it is possible to program a game engine which includes PDDL, but the resulting textadvanture has no built in Artificial Intelligence. So, what is the difference between a game-engine and a planner for the game engine? Let us go a step backward.

At first a domain has to be described as a game. A robot domain contains a textadventure like layer on top, and a lowlevel inverse kinematics layer on the bottom. The textadventure game engine is programmed with a combination of PDDL syntax, object-oriented programming and Prolog like languages. But even these advanced technology were used, the resulting game is controlled by a human operator. That means, it has no Artificial Intelligence.

The planner, for solving the game autonomously is something which is outside the game engine. That means, if the game was programmed carefully, the AI planner can be small. It is a simple brute-force algorithm, which tries out some alternatives and relies on the pre-programmed actions and states.

RRT pushplanning
describes how to use AI planning in a push manipulation task. The core algorithm is not oriented on cognitive architecture but on a good old RRT / A* pathplanner which is used for sampling a physics engine. As a bonus, the paper contains a hierarchically planner which dvides the problem into chunks of work: contact planning and push-planning.The concept was described earlier by. The idea is to sampling motion primitives in a RRT graph and get a plan to a goal.

It is hard to say if this technique works, because it is relatively new. Pathplanning is researched in detail by the literature of the last 30 years, but using the same concept for motion planning is the exception. The main problem would be the concrete programming of such a planner, for example the Box2D engine has to be used together with an RRT graph. A second problem is the grounding problem, that means to combine the subproblems into an overall solution. If i have understand the idea right, the aim is to activate random motion primitive, generate a manipulation graph and then take a solution from that graph.

Let us go into the details of an hierarchical planner. Suppose, the robot in on place A, and the goal is to bring him on Place B and also that he has an object in his hand. How can we do this? The idea is to use some kind of planning algorithm, but a normal planner will get problems because of the huge problem space. So there is a need for a hierarchical planner. One is planning only the movement in space, the other is planning to take an object. This looks complicated and it is.



The robot has to go to the object, take the object and then go to the goal. It is a hiearchical planning problem. On the lower level a geometric pathplanner is needed, and on the higher level a planner for solving an textadventure. Both types of game engines can be sampled. It is possible to determine all possible moves in the geometric space and also all possible action in the textadventure. For example: in the text-adventure layer the robot has the following possible actions: moveto(object), moveto(goal), moveto(place). An action like moveto(object) will start on the lower level the geometric planner.

Task and motion planning
What is the underlying principle of an intelligent robot? Is it perhaps Artificial General Intelligence and cognitive architectures? No, because they were not invented with engineering in mind but they are pure ideology. The better alternative is called “Task and motion planning” and is discussed in detail in the literature, for example by. The main idea is, to solve a robot problem with a pathplanner like algorithm which is extended with a symbolic level. Sounds it complicated? Yes it is, task and motion planning is notorious difficult to grasp and to implement in real software. And also the cited paper from the above (published in 2018) is not very accurate. But, it is the right direction. In principle Task&motion planning will be able to control a robot.

Let us analyze the paper in detail. On page 6 the authors are describing the high-level-planner and they are using the PDDL/strips language, in which any action can have a precondition and an effect. Unfortunately, PDDL is not very advanced, it was designed with AI in mind. Let us describe what the real problem is. The problem is not how to solve a given problem, instead the task is to program a textadvanture, which is also called a dialogue based system. For example: the user is entering the following commands:

user: takealook system: there is a key. user: takekey system: got it user: walknorth

and so on. This system has no AI, it is only a normal textadventure, in the background an object oriented inventory system is working which is managing the simulated world. PDDL is sometimes used for emulating such games. That means, it is not a planning tool, but a textadventure creation tool. And yes, PDDL is not the best choice. The easiest option for programming a dialogue system is object oriented programming. And again, the AI aspect is not needed, instead the user has to enter manually the commands.

The reason why such a dialoge system is important for a robot system has to do, that the game reduces the statespace. And in this reduced statespace a planner can be executed for fuzzing different options the user has. It is simply a brute-force planner which is running against the dialogue system.

This aspect is not given by the paper itself. But that doesn't mean, that task & motion planning is a bad idea. It is only a detail question, of how to realize such a system. The general idea of combining a low level layer with a high-level-layer is right, that means, it will result into a working robot control system.

But let us go back to the paper. On page 7 the usage of pddl is explained in detail. The idea of the paper is, to use pddl for describing a dialogue system, like i mentioned before. That means, the pddl language is used for an text-adventure creation kit to define what will happen if the player takes the key. Why the authors is choosing pddl and not an object-oriented language like Python or C++? Because he didn't understand what the need is. He mistaken a high-level-planner with a high-level-game engine.

This is a common error, because in AI history it is often unclear, which part of a system is done autonomously and which manually. The first step in designing a robot is to create a manually controllable game which includes a high-level and a lowlevel layer. This can be seen as an adventure game, which is controllable by textual input but has also a visual representation. A working example is Monkey Island or Maniac Mansion. Such games are played by a human operator. For progremming them no PDDL syntax is needed but a normal C++ program is enough.

Only the second step has to do with AI. The idea is to use a brute force sampler against the game engine, for bringing Monkey Island into a certain game state. This is only possible, because the game itself is available. It is important to separate between both steps. Step 1 is to create the game itself on a lowlevel and highlevel layer. Step 2 is about an automated planner which solves the game.

The reason why the paper is using PDDL as a symbolic language is given on page 41. Here is the aim to generate the PDDL syntax from a given game backwards. That means, it is not defined by a programmer, but is learned at runtime.

Literature
The good news about "task and motion planning" is, that the literature before the year 2000 is very small. The topic is mentioned only sometimes. This has the advantage, that the readers doesn't need to search for extensive literature from that time. After the year 2000 the situation changed, there were many papers published to that topics, mostly in the context of computer animation and robotics. Sometimes with the idea of reinforcement learning in mind, sometimes not. So it is right to say, that task and motion planning is subject from the behavior based robotics, which was founded by rodney brooks around 1990 and is the opposite to Good old fashioned AI before.

With a looking backwards, it is possible to describe early AI projects like the Shakey strips project with task and motion planning. But this perspective isn't given by the literature, because the papers from the 1970s described Shakey something else as with task and motion planning. That means, there was a changed mindset in the literature, a new way of describing problems. In most papers from the beginning of 2000 the task and motion planning problems wasn't solved. What the authors have done is to describe the problem, to say “yes, it is a task and motion planning problem. It has nothing to do with an episodic memory or a cognitive architecture, instead it is a combination of a lowlevel and a highlevel layer”.

A bit surprising is the fact, that "task and motion planning" (TAMP) is not the answer, it is the question. For example, a possible TAMP domain can be a blocksworld game in box2d in which the robot should bring the game in a certain state. The term TAMP is used in describing the challange, that means the game has two layers, and a sampling based planner can be used. How this exactly should be programmed is open. That means, nobody knows the answer. So it make sense to describe type of robotics challanges like micromouse as a TAMP-problem. That means, it is possible to write a paper or a real robot which solves a certain TAMP problem, for example with a RRT based sampling planner plus a pddl like high-level planner. Another possible solution can be reinforcement learning or neural networks. It is up to the user to implement one of these solutions. This is perhaps the most dominant difference to so called cognitive architectures like SOAR. Soar was designed as the answer to a problem, that means SOAR is some kind of artificial intelligence, while TAMP is a problem.

Symbolic planning with PDDL
Most papers explaining the task and motion planning problem with a two layer architecture. The high-level layer is driven by PDDL, while the lowlevel layer is done with a sampling planner. But why is PDDL used for the high-level layer? Perhaps the idea is, to use a language which can be solved by a planner, but in my opinion PDDL is a bad choice for implementing a high-level layer.

Let us first describe the goal of a high-level-layer. The idea is, to interact in an abstract state space, for example with textual commands like moveto and grasp. PDDL is used for implementing this behavior. According to the PDDL syntax the command moveto is doing something. A high-level-layer is an abstract form of a computer game. For example, the sokoban game can be seen as an abstract pick&place game. Sokoban itself has no accurate box2d engine, instead the movement actions are simplified. Is PDDL a general purpose language for inventing abstract games? No, a normal C++ sourcecode is better suited for that purpose. It is possible to describe in C++ textadventure, sokoban and other games. And the feature of PDDL that a solver is possible is not very important, because with a sampling strategy any game can be sampled.

I see the importance of dividing the robot control problem into two layer: high-level symbolic and low-level geometric. But I think, both layers can be programmed in C++. PDDL is not necessary. Let us describe how to develop each layer. The dominant task of a layer is work as a prediction engine, that is equal to a game engine. The player gives a command for example “ moveto”, and the engine brings the system into a new state. Such prediction steps can be done on symbolic level and on geometric level. On the geometric level in most cases a box2d like engine is the best choice. For the high-level layer, either a textadventure dialogue system or a sokoban abstract engine is a good idea. So in reality, we need two separate games for each layer.

Perhaps a small example. Suppose we have an pick&place example. The high-level game works on a simple map with commands like moveto, opengripper, closegripper and so forth. The position of the object is discrete like in sokoban. A solver brings the game into a goal state, and the steps to that goals are called keyframes. The lowlevel layer takes the keyframes and calculates for a box2D engine the force to bring the system into the subgoal state.

But let us go back to the paper. In general, the same idea is used, but with one exception. The high-level layer is implemented in PDDL. That means, pddl is used for implementing a game engine, which predicts future states. That is not totally wrong, but a normal C++ program can do this task more efficient. The only what a high-level layer must do is to react to an input. If the player for example enters “moveto(3,2)” the engine must update the position of the robot to that absolute position. In reality, the movement works more complicated because at first the steering wheel has to be adjusted and so forth. But the high-level layer can abstract from these details. It is ok, to set the absolute coordinates.

The same wrong idea of using PDDL for implementing the high-level planner is given by. On page 4 the construction of the symbolic domain is explained with PDDL. In a complex sourcecode a simple kind of game is described. Sure, it works, but C++ would be easier to implement. The symbolic game contains some action primitives and after executing them, a variable in the game is set. This can be done with C++ much more easier. I think, that most authors are preferring PDDL for the high-level layer has historical reason, They were grown up with Shakey and STRIPS and they believe, that a symbolic game can only be implemented in a Artificial Intelligence language like LISP, Prolog or PDDL. They are believing that predicate logic is absolutely necessary to solve the description with a sampling based planner. But it is not. A textadeventure programmed in C++ can also be solved by a brute-force-sampler. The only precondition is, that the state space is not too big. But let us go back to the paper. On page 6 a plan is shown, which solved the task. In general it seems possible to combine a lowlevel with a high-level engine. That means, the idea of implementing a “Task and motion planner” is correct.

Critics on PDDL
PDDL is known as a planning language for symbolic domains. It is – according to the literature – the number one choice for modeling task and motion planning domains. But have the authors understand what PDDL really is? The hypothesis is, that PDDL is not a planning language, but a domain specific language for realizing dialogue systems. A dialogue system is a textadvanture which contains symbolics states. It is not necessary to plan such games, it is also possible to use pddl interactivity.

Let us make a small example. We have programmed in PDDL a pick&place domain. The user has some motion primitives like pick, place, opengripper, and according the pddl syntax the system is reacting to the input. What can the user do with his pddl program? He can play a small game, at first he open the gripper, then he picks the ball and so on. What PDDL is doing is the same like a textadventure game engine is doing. It parses the commands, and changes the internal game state. For example set a variable true, if the ball is in the gripper.

In that example, the core function of pddl (the planning capability) was leaved out. That means, the system works only with the user in the loop. And it works quite well. So, what exactly is PDDL?

Now let us analyze if C++ is really an imperative language or it is possible to use it as a planning language. Suppose we have programmed the same pick&place domain in C++ with objects and classes. That means, we have 3 different classes for the ball, robot and the gripper and the user can send commands to the classes. The program reacts to the commands and changes variables. Now, the idea is to run the program as a planner. We define a goal state, for example the ball should be in the gripper, and a random sampler is trying out different actions until the goal is reached. Will it work with C++? It works quite well, so in reality C++ is a planning language.

Let us investigate how exactly a C++ program can be utilized in a planning problem. At first we need a game-engine class, in that class we are initialize the subclasses of the game. That game-engine class can be copied in memory and on the copy we can testing out different plans. For example we can send a random action sequence to it, and see what the result will be. Writing a random sampler for a C++ game engine class is very easy. The brute force approoach can be realized in under 100 lines of code. That means, C++ is the better PDDL. I would guess, that any PDDL domain can also be programmed with C++ classes.

We have to separate between symbolic game-engines (high level) and geometric game engines (low level). A geometric game engine works like a dialogue system and a textadventure. The player has abstract textual commands and can run these commands against the game engine.

Modelling of Task planning
Describing in detail the aspect of task planning is important. Sure, PDDL is one option in task-modelling but it is not the best. A task model is according to the definition a high-level symbolic game which contains of actions and a game engine is parsing these action. Task modeling is basically programming a game. It is not done with a realistic physics engine like Box2d but with a qualitative physics engine which fits to a certain domain.

An easy example of a task model is a map in which the robot can move around. The game works not accurate, it has only actions like left, right and so on. The game engine first parses one action, and then it modifies the position of the robot. That means, there is no collision checking, no physics simulation it is only the map and the robot. A task model can be seen as prototype for a game. It is fast programmed and gives a general outlook over the game. Most details are missing.

Task modelling is important because it results into subgoals which can be solved by lowlevel planners. A task model is some kind of keyframe generator which gives a general idea what the task is. A task model alone is not sufficient to control a real robot. It must be extended by a geometric game engine below.

Let us reference to some example of task models in the literature to get a general idea. The GOAP architecture (goal and action planning) can be seen as a task model, a dialogue system in a textadventure too, also a strips model for planning the steps of a robot and of course the HTN planning domain is also a task model. In all these cases a game engine is programmed which is able to parse input commands and brings the system into a next state. On top of the game engine a planner can be used for getting the information which steps are necessary to fulfill certain constraints.

In one aspect the current literature is wrong. Most authors belief, that a high-level task model can only be realized in a high-level goal oriented language like PDDL. That is wrong, a textual game engine can be programmed in normal languages like Java and C++ very well. It is the better approach in doing so.

The reason why has to do how planning in a task model works. A task planner is simply a sampling solver which tries out random plans against the textual game engine. He becomes fast, if the domain is formulated on a high level. That means, if the game engine contains only 12 action primitives and 5 internal variables which can be true or false, it is not so hard to test 10k different plans against the engine for figuring out how to fulfill the constraints. Not PDDL makes a good task planner, but the modeling of a domain which has to be abstract.

Task planning vs. Task modelling
In the literature, PDDL is described for both: modelling and planning. The planning aspect is less important, because planning can be done with a sampling based brute force algorithm. The speed is depended from the CPU speed, and reducing the problem space is useful. In reality the more important problem is task modelling, because this has to be programmed in sourcecode and writing sourcecode is done by humans.

PDDL is the standard choice in task modelling. The idea is to describe a domain with an abstract language. But I'm in doubt if PDDL is really the best idea. Like I mentioned in a chapter before, C++ fits well in programming a textadventure, so it is perhaps a more general task-modelling language.

A open question is, how exactly a task model looks for a certain domain, for example for a dual-arm dexterous manipulator or for autonomous driving. In general a task model is a prototype for a game, it is not realistic and contains no details. Nearly all task models are containing motion primitives which are textual commands the user can activate. There are comparable with ontologies and i would guess that UML is a good starting point for designing a task model.

In the area “Learning from demonstration” a task model is generated autonomously. The idea is to observe a human operator and create a task model from his actions. But I'm in doubt, if this automatic form of software engineering really works.

A task model is a game engine which is able to predict future states. For example, in a pick&place scenario is game engine answers to a command “pickobject” with the text “object is in gripper”. It is a symbolic game engine. The good news is, that low level game engines are available for example Box2d can predict very accurate what will happen if the robotgripper is moving towards an obstacle. What is missing right now is a high-level task model. That is something which is derived from the box2D engine on a symbolic level.

is trying to use machine learning for generating a task model. The paper starts with a pessimistic introduction:

""Acquiring a domain-specific task model is an essential and notoriously challenging aspect of building knowledge-based systems.""

The alternative is a manual driven software engineering process, which results into a task ontology. Per definition the task ontology is created by hand, from scratch and by humans. That means it is done with pen&paper like software programming.

Task ontology for cooking game
A task ontology is a theoretical construct. To explain it in detail we need an example. Usually a task ontology is used for modeling games, it is a high-level game engine which predicts the outcome of user actions. A famous example is the cooking domain, on youtube there are many so called “Kitchen games” available. They are working all with a task ontology. The kitchen games are technically very easy. The player has on the screen some objects like butter, milk and a mixer. The game is similar to a drawing game in which the objects can be moved. But apart from that visual representation the game contains some tasks like: go shopping, bake the cake, bake preparation and so on.

The player must execute certain steps until a level gets completed. The details are described in the ontology. So we can say, that the kitchen game is an example for a concrete software and the task ontology is the technology in the background. To learn more about the subject it is a good starting point to program such a kitchen game from scratch. If this is possible, the concept of task ontologies becomes clear.

GDL, PDDL and C++
The classical planning language were STRIPS and PDDL. The ICAPS conference is centered around these idea. A recent development is the “General Game playing” approach which has developed their own planning language, called GDL. GDL is a dialect of KIF (knowledge interchange format language). But there more general question is: do we need Strips, PDDL and GDL? The hypothesis is, that we do not need any of these language.

Let us describe for which purpose GDL was invented. It is a programming language for inventing games. For example, Tetris, pacman, pong and sokoban. Is it possible to program a game in GDL with less sourcecode then in C++? The answer is no: most GDL programs are very complicated and they are not well suited for programming real games. The main idea why GDL and PDDL is used by the AI planning community is because of the ability to run a solver against a specification. If a domain is already programmed in GDL and PDDL it is very easy to answer a question like “What is the next move, if the game should be in a certain state”.

The problem is, that this feature is not very important in reality and can be emulated with a C++ program too. Let us describe the situation at C++ in detail. Suppose, we have programmed in C++ a game engine, for example for sokoban and are now interested in planning something. Doing so is surprisingly easy, all what the programmer has to do is to instantiate the game-engine class in his main program. He get a temp-gameengine, and then he puts random value into the game engine. The amount of sourcecode in doing so is less then 20 lines of code. If the programmer is really good, he can not only program a brute-force solver, but a dedicated graph search algorithm like Rapidly-exploring random tree around the game-engine which speeds up the procedure.

What I want to explain is, that it is very easy to search in an existing symbolic game-engine for a plan. It is supported out of the box for PDDL like languages but can be implemented for C++ game engines too. The problems in reality are on a different aspects. The first problem is, how to program a certain game in sourcecode (creating of an action model) and the second problem in reality is, what happens if the game state is too huge, so that the planner needs to much cpu ressources?

C++ is compared to PDDL and GDL the superior programming language for implementing games. It is the fastest and most flexible language available. If the aim is to do some AI planning, the programmer can simply create a temp object of the game engine class and testing out random plans. It is a weakness of the ICAPS conference that they didn't recognize the limitations of PDDL and GDL. It makes no sense to develop planning algorithms around PDDL.

= Example =

Experimenting with Task and motion planning
According to the literature, integrated task and motion planning is hard for cluttered environment. Let us examine this hypothesis on an example. In the figure a pick&place task is given: a robotic gripper should take object 1, which is located behind object 2. On the geometrical level, the gripper must first push the object 2 away, and then he can pick up object 1. Doing the task manually by a human operator is easy, but who can this be done autonomously with a symbolic planner?

Every high-level or lowlevel planner is foremost a prediction engine. On the geometrical level, the engine is able to determine a future state. If the example is realized in the box2d physics engine, the coordinates of each object are calculated. For example, the player is moving the gripper upwards, and the engine detects a collision with object 2. A symbolic high-level engine has to work according the same principle. What we need is some kind of dialogue system which calculates the reaction to an input by the user.



I don't know how this can be realized in a computerprogram, but it is possible to write down the interaction with a fictional prototype.

engine: object 1 and object 2 are a triangle, object 1 is on top, object 2 is in the middle. engine: ... gripper is on bottom. Your input please! user: move gripper up engine: gripper collides with object 2. Your input please! user: move gripper down engine: you have reached the same initial situation, your input please! user: pick object 2 and move it away engine: done, object 2 is on middle right, a free path to object 1 is available user: pick object 1 engine: done, gripper is holding object 1

That was in short a textual interaction with cluttered environment. Like i mentioned in the introduction, it is difficult to realize such a system in sourcecode. One option is use the pddl syntax for describing future states, another option is to see the game as an textadventure and use object oriented syntax for the following states. In both cases, the box2d physics engine is programmed a second time from scratch, but on a symbolic level. This is called a qualitative physics engine.

Now i want to go into the details. The picture with the situation can be programmed with a box2d engine. The advantage is, that box2d is very realistic. But a sampling like planner takes too much time, for bringing the system into a goal state. Because every framestep has to be calculated as a single step, which takes a lot of cpu time. The idea in Task and motion planning is, to imagine a second high-level layer which works on a symbolic level. This second layer is game, with different rules. The game has also a graphical output, but the physics works different. It is equal to a sokoban game, that means the blocks are in a matrix, and the user has only discrete actions. For example, he can move left to an object and move it away. After he is doing so, the object gets his new position. The position is not calculated with mathematical precision but with a simple sokoban like algorithm. That means, the user is pushing an object, and then it will be moved.

Grounding
The idea of use a geometric and semantic layer at the same time is called in the literature grounding. It means, that both layers are synchronous. Perhaps an example: in the box2d layer we are moving the gripper upwards. We are doing so, by sending a force to the gripper, in the singlestep mode this results into a new state. If we want to move in the qualitative physics engine the gripper upwards, we are sending the command “Moveup” to the gripper. Our game engine will react.

But how exactly is the semantic engine reacting? This depends on the implementations. For example, if we have programmed the semantic layer like a sokoban game, then the objects can be on one place, and after pushing they will move. The interesting effect is, that our simplified sokoban engine and the box2d engine can be asynchronous, that means the grounding has failed. I would assume, that this behavior is quite normal, because it is hard to simulate a box2d engine on a higher level.

The answer is, that one of the layers has to adept, the box2d layer is the right candidate to be flexible. At first we are implementing the game on a higher level, for example we are defining commands like up, down, grip and rotate. And these commands has to be realized on the lower level. That means, if the grounding isn't working, always the lowlevel box2d engine has failed. The lower level has the obligation to be synchronous to the higher level.

Implementing a sokoban like game on a higher level is easy. The command up moves the gripper 50 pixels to north. This is realized simple by changing the absolute position. After repainting the game the gripper has moved up. In the box2d world the interaction works different. It is not possible to change positions on a absolute basis, instead we can only apply forces and rotate joints. The grounding works with keyframes. We have a goal state from the high level in which every object has a certain position, and the low level planner has to transform the box2d engine into that goal state. Grounding means, that the low level layer is reaching the state of the high-level layer.

Keyframe control
The idea of grounding the lowlevel to the higher level can be explained in detail. The high-level semantic layer of the game is controlled with sloppy precision. That means, the user give the command “gripper up”, and that puts the gripper 50 pixels upwards. It is some kind of easy pacman like game. The idea is, to use the interface for producing keyframes. For example, if the player moves the gripper up, then grasp an object and moves the gripper down, he has generated 3 keyframes.

These keyframes are used as subgoal for the lowlevel geometric layer. THe box2D planner has to bring the simulation to each of the keyframes. That means, he has to try out minor adjustments and lowlevel commands for emulating the game on a precise level. If the lowlevel planner fails, the grounding becomes asynchronous.

In any timestep, the high-level layer is always right, that means, if the player wants to move the gripper 50 pixels up, he there is nothing which stops him. The open question is, how to bring the box2d game into that situation. Perhaps, it is necessary to accept some mistakes, or to give the feedback that it is not possible.

Creating an action model with keyframes
Suppose we want to create for a pick&place task a high-level action model. What is the best practice method? The idea is to describe the task first for a human, and simply a create a visual tutorial. In the figure an example is given. The tutorial consists of three keyframes plus an annotated text what the subactions are. It is similar to a description in a manual for a washing machine. The user gets some easy step and must follow the instruction.



It is possible to store this description in a human readable format. The keyframes can be stored with the position of the objects, while the text can be saved in the RDF-triple format. At the end this results into a datastructure, which contains of step #1, step #2 and step #3.

For grounding the annotated keyframe a lowlevel planner is needed. That is a software tool which converts the next keyframe into commands for the robot. As an input the planner gets the keyframe and the textual annotation and he must find the low level actions which can be send to the gripper. A lowlevel action for a gripper can be the absolute position, which is equal to a command for a Box2D physics engine. Transforming the task plan into lowlevel actions is difficult but it is possible. For example the text annotation “under object” has to be converted into absolute coordinates first.

The good news is, that the time-horizon for the lowlevel planner is very short. Between two keyframes there are only a small timespan, so the number of possible plans is low. That means, a sampling based planner which is guided by the rdf-triple is enough to find the correct low level actions.

Parameterized Motor Skills
A bit more complicated is to extend keyframes with Parameteric Motor Skills. The details are described by, but I want to give only the summary. In the figure we had 3 keyframes which are describing the motion. The idea is to extend the motion description with a parameter. A parameter is a value for a feature. The original keyframe #1 was “move gripper under object”. An additional parameter would be: “speed=5, position=right middle”. That means, we can execute the same motion with different adjustments, for example with speed=2, or with position=left middle”. The parameters are provided on the textual level, they are part of the RDF-triple. But they have consequences for the follow-up keyframes.

Let us compare a task model without and with parameters. Without parameters the task model has a small amount of actions for example, move, push, left and right. We can also specify, which object we want to push, but that's it. With additional parameters for speed and goal-positions the number of possible actions which are send to the task model is higher. Let us investigate how this affects the rdf-triple. In the vanilla version the RDF triple consists of three items, e.g. “push object left”. The amount of possible actions is small. In the parametrized version, the syntax can be more elaborated, for example “push object left middle with speed=5”. It is not only a simply RDF-triple, it is a complete sentence which has to be parsed in detail. One option to translate the input into executable code would be “ push(object1,left-middle,5)”. That means, the function in the program code gets a lot of variables as input.

is a phd dissertation, which describes keyframe and trajectory based learning from demonstration. A skill like pouring is demonstrated by a human operator many times, each demonstration is recorded and access to the recordings is possible over a parameter. A possible command for replay would be “pour object with trajectory #32”.

= References =