Monday, May 19, 2008

Sanket

संकेत

Monday, May 01, 2006

DUPLET OF SATYRS


By


Sanket Korgaonkar



1. Introduction:

Reinforcement Learning typically involves an agent interacting within an environment. The agent takes actions and his state in the environment changes. Usually, the agent’s state changes only due to his own actions.

This doesn’t represent real life behavior correctly. Often our state changes despite ourselves. In a competitive class, where every student is trying to do well, the student’s grade doesn’t entirely depend on his own actions. He has to do better than others in order for him to get an A. Thus whether the agent will achieve its goal doesn’t just depend on his behaving correctly, but also on the actions of other agents.

This is the central issue I wanted to address in this project. I wanted to study if ‘Reinforcement Learning’ can be used to train an agent to be achieve success when it’s measured relative to other agents. I wanted to investigate if an agent can be made aware of the existence and behavior of a competing agent.

I formulated a grid navigation problem to embed these issues. The problem comprises of a 10 x 10 grid world in which there is a unique location of cheese blocks or rewards, the agent has to find the most optimal path to this location and bring as many cheese blocks back to his base as possible.

There are two agents which are made to compete against each other for cheese blocks. The change in their policies is studied in the hope of understanding if they have actually learnt the presence of another agent.

The learning is implemented by TD – lambda.

A note on nomenclature:

Satyr is a creature from Greek Mythology. He is half-man and half-animal. They were known to be dangerous creatures willing to take any risk in order to seek worldly pleasures. Analogically the agents in this game of cheese also seek to gain rewards and that indeed is the sole aim of their lives. Hence the name :‘ Satyr’.

2) Problem Definition:

· To bring back as many cheese blocks as possible to the agent base in the most optimal way. If another agent is present then try to bring more cheese blocks than him.

· The problem is posed in a 10 x 10 grid with one cheese store and as many agent bases as the number of agents.

· The State Representation consists of the x and y co-ordinates of the agents and a sensory input that tells the agent if the current state is the cheese store or not.

· Each agent has four actions viz. to move north, south, east or west. Depending upon the position of the agent in the grid, some of these moves are possible and some are not.

· Thus the number of states is 10 * 10 = 100 and the dimensions of the Q table are 10 x 10 x 4 = 400.

3) Literature Review:

a. The Random Walk problem was the inspiration behind the model used to set up the project. This has been discussed by Sutton in the context of RL in [2]. Also the problem of a non-stationary environment was discussed by Sutton in [2].

b. The use of multiple agents has been discussed by Remi Coulom in [1]. The idea to use multiple agents and then to make them compete against one another was inspired from [1].

c. Tesauro G. in [4] has used a reinforcement learning agent to train against itself. The problem here is easier though in the sense that for an agent it doesn’t matter where the opponent’s move comes from, it just has to react to it with its best possible judgment. In this project, the actions of the competing agent have a delayed effect. Rather the agent comes to know them only at the end of an episode.

d. Different representations of the reward concept are presented in a survey by [5]. I studied them to investigate if my ‘Reward Vector Concept’ was in any way similar to the methods discussed.

e. Sridhar Mahadevan and Nicholas Marchalleck in [6] discuss a new reward representation for continuous tasks. I studied their work to find out if my concept was in any way similar to what they discuss.

f. Sian S. S. in [7] discusses how learning can be extended to multiple agents and the issues with multiple agents.

g. Ming Tan in [3] discusses a comparison of agent performances at RL tasks when they were attempted independently and when the agents communicated with each other. The paper concludes that a better performance can be achieved if agents carefully share sensory information.

h. The motivation to modify the DNA of agents rendering them different came from work of Mahadevan S. in [8] where he discusses automation of behavior based robots.

4) Approach:

1) The project is completed in several incremental stages.

2) In the first stage, the navigation problem is solved with just one agent and the most optimal policy is learnt.

3) Second stage comprises of increasing the number of agents, and solving the same problem.

4) In the third stage, the agents are made to compete against each other, when the cheese store is empty, the number of cheese blocks each agent has is counted and the winner gets a positive reward where as the loser gets a negative reward.

5) In the fourth and final stage, the agents have their own DNA and in essence, their controls are different, the agents are made to compete and the change in their policies is observed.

5) STAGE 1 : Single agent

1) This problem was solved with optimistic initial values.

2) The agent had a negative reward for each step it took, it was given a positive reward when it found the cheese.

3) The positive reward was propagated back to the starting location based on the eligibility traces.

4) The learning parameters (learning rate, immediate reward, end reward, gamma, lambda etc) were adjusted and several runs of the code were conducted to ensure convergence to an optimal policy.

Issues:

1) In stage 1, the problem faced was that of convergence to the non optimal paths. What would happen is that the agent would start by exploring the world around and eventually they would stumble onto the treasure cell. This first path would almost invariably not be the most optimal. Because of eligibility traces, all the states that led to the treasure were backed up using their frequency of visits. Thus these states would get higher values. The next time around the agent started from its base, it would prefer this path than anything else.

Solutions:

1. A negative reward was given to each step, the optimistic initial values were set in such a way as to encourage exploration Thus the agent would thus not get locked in one single path.

6) STAGE 2: Multiple Agents

1) The same learning mechanism was replicated to make another agent.

2) The two agents were given separate bases and were asked to bring back cheese blocks.

3) The reward was absolute and not relative.

4) The agents eventually converged on to a single path, often of the same length.

5) When testing the two agents care was taken that the treasure square was equidistant from both the agents.

Problems:

1) Synchronizing the two agents was an issue.

2) A natural choice would have been to use threads, but that would mean relinquishing the control over their execution.

Solution:

1) Since there were only two agents, I decided to execute them alternatively.

2) Which agent goes first was decided on a 0.5 probability.

7) STAGE 3: Competing Multiple Agents

1) In this stage, the existing code of stage 2 was modified to make the agents compete against one another.

2) The learning policy was kept the same but when the cheese store became empty, the episode ended.

3) The exploration policy was changed, in that the non-greedy move was picked by weighting it with its Q-Table value.

4) At the end of an episode, the agent with the more number of cheese blocks got a positive reward and the other agent got a huge negative reward.

5) The positive reward for the winning agent was weighted by a winning margin which meant that the agent was encouraged to learn to increase the winning margin.

6) The negative reward for the losing agent was weighted by a losing margin, which meant that the agent was encouraged to reduce the losing margin.

Problems:

1) The relative reward concept was a tough needle to thread. The agents had no sense of how many cheese blocks the other agent had, and so any positive or negative reward was associated with the agent’s state at the point the episode ended.

2) I did not want to give the agents explicit information about each other, since I believed it defeated the purpose of ‘Reinforcement Learning’ where by we try to teach the agent everything through a reward and not tell the agent what to do through the encoding of the states.

Solution:

1) I ended up implementing the reward based on who won at the end of an episode.

2) I weighted each agents reward with the margin by which the agent won or lost.

Results:

1) The agents performance improved as the episodes progressed.

2) Specifically neither of the agents lost two episodes in a row.

3) The winning agent stuck with his policy, while the losing agent’s policy was observed to change.


Analysis:

1) The results were encouraging but esoteric. Once an agent decided upon the path, it was hard to conceive that it would change the path at all.

2) But the paths did change and not only did they change, they ensured that the agent didn’t lose twice in a row.

3) I attribute this to the fact that since the exploration was weighted, the agents when choosing the non-greedy move made better choices as the episodes progressed.

4) An important but subtle aspect to be observed here was the fact that the path length of each agent didn’t necessarily indicate the time efficiency of the path.

5) In other words, if the two agents had the same path length, it didn’t mean that they would accumulate the same number of cheese blocks.

6) It must be noted that the agents eventually converged onto a final path, the states through which they found this path varied for each agent and hence they vary in the number of cheese blocks being brought back.

8) STAGE 4: Stealing Permitted

1) In this stage, the competitive relative reward was kept the same, but in addition, each agent was given the liberty to take the cheese blocks from whichever cell they found it in.

2) It was observed that the agents first started to bring the cheese blocks from the base, but once each of them had a certain number of cheese blocks, they would stumble across each others base.

3) The agents then resorted to stealing from each other and forgot the main cheese store altogether.

4) Since the episode was supposed to end when the cheese store got empty, such an eventuality never occurred in this stage of the project.

Problems:

1) I wanted the agents to understand the existence of a closer treasure cell (the other agents base), this was difficult to convey.

2) Once the agents started exploring and found the cheese blocks in the original treasure cell, it was natural for them to stick with that route and abandon exploration.

3) Thus the agents would still keep on bringing the blocks back from the original treasure cell without discovering the other’s base.

Solution:

1) I tweaked the learning parameters a bit.

2) I changed the immediate reward to a considerably large negative value.

3) I initialized all states at high positive values.

4) This meant that even though the agent had discovered the treasure cell through some path initially, the neighboring cell values were always greater and hence forced the agent to explore the world more.

5) In the course of exploration, the agents would come across each others base and then stick with that route.

6) It must be noted here that the agents actually learnt the distance of the treasure cell from their base. The agents formed a policy of taking cheese blocks form the nearer target, it didn’t matter to them if it was the original treasure cell or another agents base.

9) STAGE 5: DNA changed

1) I wanted to investigate the effects of changing the controls of each agent.

2) I made one agent incapable of stealing, to compensate for this, the agent was given the ability to carry three cheese blocks at a time.

Observations:

1) It was observed that the changed agent would discover the target cell and stick to its route.

2) The other agent would resort to stealing from the first agent.

3) It is interesting to note here that the agents didn’t learn to beat each other as episodes progressed.

4) The agent which won kept on winning and the agent which lost kept on losing.

New Features:

1) An important feature which I wanted to implement was sensory information about the other agent.

2) Specifically, I wanted to the agents to avoid coming across each other.

3) I implemented this feature by giving a negative reward to the agent if another agent was already present in its state.

Results:

1) The results were disappointing but not surprising.

2) The agents were totally confused about navigation.

3) They couldn’t figure out the location of the original cheese treasure and avoided stealing from each other totally.

4) Even when they found out the cheese store, they did not converge onto a single path, their paths kept on changing and were rarely if ever optimal.

Analysis:

1) I attribute this failure to the fact that the agents had conflicting information about the same states.

2) In cases where the agents found the cheese blocks, the states got backed up with a positive reward.

3) In other cases when the agents ran into another, the state got a negative reward and often it could be the same state.

4) As a result the Q-Table didn’t tell the agents the most optimal way to locate the cheese store while avoiding the other agent.

5) It was also inconceivable under the circumstances that the agents would learn to steal from each others bases when the owner wasn’t present.

10) Issues with Reinforcement Learning in handling dynamic non-stationary problems:

  1. Reinforcement Learning has problems in handling dynamic, non-stationary problems because in RL, training information is presented in the form of just a single numerical reward.
  2. When goals of the problem are conflicting e.g. Learning proximity to the treasure cell and Learning proximity to another agent, the single numerical reward is incapable of representing them. Thus these goals are rendered irresolvable for the agent.

Possible Solutions:

  1. A full blown state representation can solve this problem.
  2. If the goals can be embedded in the form of different states, then the agent will learn each state’s value independently and we can convey the information in a single numerical reward.
  3. This approach is not feasible always, since the number of states have a property of exploding when learning new goals or features is attempted.

Reward Vector Concept:

  1. To address this issue, I hereby propose the reward vector concept. In essence what I suggest is that instead of having one single numerical reward, it makes more sense to have a reward vector.
  2. Each element in the vector represents some feature of the problem space we want to learn.
  3. Each element of the vector can be learnt separately, and all the learning parameters of reinforcement learning can be tailored independently depending on the characteristics of the feature.
  4. Algorithm Reward Vector Learning :
    1. Instead of a single numerical value, declare a reward vector such that each element of the vector represents a feature of the problem space we wish the agent to learn.
    2. Thus R = {ra, rb, rc …. / a, b, c … are the features of the problem }, each value is a floating point numerical value.
    3. Update rule for learning the Reward Vector:

For ( i=0; i< # features; i++)

· Q(s,a,i) = Q(s,a,i) + (alpha * [ imm_reward + (gamma,i)Q(s’,a,i’) – Q(s,a,i)] )

    1. Final reward for each state

· Q(s,a) = sum( wi * Q(s,a,i) ) , where wi is the ith weight depending on the importance of the feature. All weights sum to 1.

    1. Update all states.

  1. The above algorithm is a simplified version of TD Lambda; we need to include eligibility traces.
  2. An important point to be noted here is that not all features might induce a temporal relationship amongst the states, thus lambda and gamma values can be set differently. Refer step 4.c.
  3. The importance of features might change as the episode progresses. For e.g. it makes more sense to encourage stealing in the later part of the episode when the agents have accumulated substantial cheese blocks. Thus the weight corresponding to the ‘agent proximity feature’, in step 4.d. can be set to go from a high to low value or decreasing linearly as a function of step size. Thus the learning strategy will take more risks of stealing from another agent as the game progresses. Care must still be taken that all weights sum to 1.

Advantages of the Reward Vector Concept:

  1. The reward vector concept allows a more intuitive and natural representation of the rewards. Specifically two elements in the reward vector can have different values and can convey different information. For e.g. a state might be close to the goal state and thus the reward corresponding to the ‘proximity to the goal state’ will have a high positive value. The same state has a high likelihood of another agent being present, thus the reward corresponding to ‘proximity to an agent’ can have a negative value. We need not fear confusing the agent.
  2. Individual features of the reward vector can be learnt independently and we can choose how much we want to emphasize each Q table we learn depending on the importance of the corresponding feature at that point of time in the episode.
  3. It is of utmost importance to note that the reward vector concept is not limited in its application to this project alone. It can be deployed in any application where we have conflicting goals to learn while being in a single state, or where we might wish a better separation of the features of our goal.

11) Future Work & Relationship to Model Learning:

  1. Model Learning means being able to predict the next state in an episode depending upon the current state.
  2. This project can be extended to include model learning – specifically, it will be very interesting to enable agents to predict their opponent’s move and act accordingly.
  3. The Reward Vector concept can be tied in with Model Learning and deliver a very effective reinforcement learning technique.
  4. A good area of investigation is to find out what happens when function approximation techniques are used to solve the problem. Specifically the use of neural networks bears real promise. When combined with the reward vector, neural nets can be used to learn complicated real world problems. Thus it will render reinforcement learning a more useful technique.

12) Conclusions:

  1. A reinforcement learning agent can be trained to respond to an adapting environment. Agents respond to each other and their learning can be improved through mutual experience.
  2. However, state representation requirements can be intractable. It is especially difficult to sustain the Markov Property when dealing with such problems.
  3. The Reward Vector Concept can be used to reduce the state space and to train the agent on different features of the problem.
  4. The Reward Vector Concept can be combined with Model Learning to better solve dynamic non-stationary problems.
  5. State space can be reduced by making suitable assumptions in accordance with the goal.

13) REFERENCES

[1] Reinforcement Learning Using Neural Networks, with applications to motor control. – PhD thesis Remi Coulom. http://remi.coulom.free.fr/Thesis

[2] Reinforcement Learning: An Introduction – Richard S Sutton & Andrew G Barto. Random Walk Problem. http://www.cs.ualberta.ca/~sutton/book/the-book.html

[3] Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents; Min Tan, GTE Laboratories Incorporated.

[4] Self training TD-Gammon by Tesauro G. 1994.

[5] Reinforcement Learning: A Survey : Leslie Pack Kaelbling, Micheal L. Littman, Andrew W. Moore. Journal of Artificial Intelligence Research 4 (1996) 237-285.

[6] Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning; Sridhar Mahadevan and Nicholas Marchalleck, USF, Tampa, Florida. http://www.cs.umass.edu/~mahadeva/papers/mlc97.ps.gz

[7] Extending Learning to multiple agents : issues and a model for multi – agent machine learning by Sian, S. S. (1991)

[8] Automatic programming of behavior-based robots using reinforcement learning. Proceedings of AAAI-91. (pp. 768-773)