Bob Kentridge 1995

Comparative Psychology: Lecture 7.

Operant conditioning.

Last week we examined the development of experiments and theories of intstrumental learning, culminating in Skinner's operant conditioning procedure and his theory that instrumental learning could be explained in terms of response-reinforcer associations whose validity was signalled by discriminative stimuli. This week I want to look at conditioning procedures in more detail and then, just as we did with classical conditioning, consider some factors determining the effectiveness of operant conditioning and what is learned during operant conditioning. Although we could discuss various experiments germane to these questions it might be easier to get and intuitive grasp of the problems of instrumental learning if we start out by considering how to train an animal to respond in a Skinner box.

How to shape a rat.

In principle, and sometimes in practice, it is possible for a rat to learn to press a bar in a Skinner-box by trial and error. If the box is programmed so that a single lever-press causes a pellet to be dispensed, followed by a period for the rat to eat the pellet when the discriminative-stimulus light is out and the lever inoperative, then the rat may learn to press the lever if left to his own devices for long enough. This can, however, often take a very long time. The methods used in practice illustrate how much the rat has to learn to tackle this simple instrumental learning situation. The first step is to expose the rat to the food pellets he will later be rewarded with in the Skinner box in his home cage when he is hungry. He has to learn that these pellets are food and hence are reinforcing when he is hungry. Now he can be introduced to the Skinner-box.
You'd see a Skinner Box
image here if you were using a graphical web browser like Mosaic 
or Netscape.
Initially we may put a few pellets in the hopper where reinforcers are delivered, plus a few scattered nearby, to allow the rat to discover that the hopper is a likely source of food. Once the rat is happy eating from the hopper he can be left in Skinner box and the pellet dispenser operated every now and then so the rat becomes accustomed to eating a pellet from the hopper each time the dispenser operates (the rat is probably learning to associate the sound of the dispenser operating with food - a piece of classical conditioning which is really incidental to the instrumental learning task at hand). Once the animal has learned the food pellets are reinforcing and where they are to be found we could return to the trial and error learning scheme I described earlier, it would, however, still probably take some time for the rat to learn that bar-pressing when the SD light was on produced food. The problem is that the rat is extremely unlikely to press the lever often by chance. In order to learn an operant contingency by trial and error the operant must be some behaviour which the animal performs often anyway. Instead of allowing the rat to learn by trial and error we can use a 'shaping' or 'successive-approximations' procedure. Initially, instead of rewarding the rat for producing the exact behaviour we require - lever pressing - he is rewarded whenever he performs a behaviour which approximates to lever pressing. The closeness of the approximation to the desired behaviour required in order for the rat to get a pellet is gradually increased so that eventually he is only reinforced for pressing the lever. We might start by reinforcing the animal whenever he is in the front half of the Skinner-box. Once he begins to spend more time in this area we reinforce him only if he is also on the side of the box where the lever is. After this we might only reinforce him if his head is pointing towards the lever and then later only reinforce him when he approaches the lever, when he touches the lever with the front half of his body, when he puts touches the lever with his paw and so on until the rat is pressing the lever in order to obtain the reinforcer. The rat may still not have completely learned the operant contingency - specifically he may not yet have learned that the contingency between the operant response and reinforcement is signalled by the SD light. If we now leave him to work in the Skinner-box on his own he will soon learn this and will only press the lever when the SD light is on. We will return later to the question of what the animal learns both about the quality of the discriminative stimulus and the way the SD comes to control operant behaviour. Once we have reached this stage we can begin to see the differences between Skinner's conception of the SD as being a signal which discriminates the presence of an operant-reinforcer contingency and previous S-R theories of learning. In the Skinner-box it is now possible to change the contingency between responses on the lever and the delivery of food pellets so that more than one response may be required in order to obtain reinforcement. A whole range of rules can govern the contingency between responses and reinforcement - these different types of rules are referred to as schedules of reinforcement.

Schedules of reinforcement.

There are as many different types of schedules of reinforcement as people can concoct. They are, however, generally constructed from combinations of a few basic types - schedules in which the contingency depends on the number of responses and those where the contingency depends on their timing. The most straightforward schedules are ones that depend on the number of responses made - these are called ratio schedules. The ratio of the schedule is the number of responses required per reinforcement. The schedule we have been using up to now, where one reinforcer is delivered for each response, is called a continuous reinforcement schedule - it has a ratio of 1. A schedule where two responses had to be made for each reinforcer has a ratio of 2 and so on. We can also distinguish between schedules where exactly the same number of responses have to be made for each reinforcer - fixed-ratio schedules, and those where the number of response required can differ for each reinforcer around some average value - a variable-ratio schedule. A schedule where exactly 20 responses were required for each reinforcer is called a fixed-ratio 20 or FR20 schedule. One where on average 30 response are required is called a variable-ratio 30 or VR30 schedule. The contingency between responses and reinforcement can also depend on time. We might, for example, reinforce the first response an animal makes after the SD light has been on for 20 seconds. Any responses it makes during that 20 seconds are irrelevant. This is called an interval schedule. Where the interval which must elapse between the onset of the SD and the first reinforced response is the same for all reinforcers the schedule is called a fixed-interval or FI schedule. Again, the intervals could also vary around some average - this is called a variable-interval or VI schedule. It s possible to combine these schedules in various ways and even to construct other basic types of schedule (e.g. ones where animals are reinforced for maintaining specified intervals between responses - differential reinforcement of low rate of response or DRL schedules). The important thing about these different schedule, however, is the differences in response patterns and learning that they produce. These differences may tell us about part of what is learned in operant conditioning. Let us summarise the different basic types of schedules and then consider the characteristic response patterns they produce. The most characteristic response patterns are produced by FI and FR schedules. Responses in operant-conditioning experiments were traditionally recorded using a pen recorder in which a pen was drawn across paper at a constant rate, the pen was moved up a small amount each time an animal made a response, a larger diagonal movement recorded the occurrence of reinforcements. A constant response rate will produce a strength line with a constant slope proportional to that rate on such a recorder. A pause in responding will be shown as a horizontal line. Animals produce constant rate responding to FR schedules with a distinct pause in responding after each reinforcement.
You'd see an FR cumulative response record
image here if you were using a graphical web browser like Mosaic 
or Netscape.
The rate of response is inversely proportional to the ratio requirement - the more responses required per reinforcer the lower the response rate. The length of the post-reinforcement pause in responding also increases as the ratio increases. The pattern of animal responding on FI schedules is quite different. After each reinforcement animals respond on FI schedules with gradually accelerating response rates which produces a 'scalloped' record:
You'd see an FI cumulative response record
image here if you were using a graphical web browser like Mosaic 
or Netscape.
The main feature of variable schedules is that, in animals, ratio schedules produce larger response rates than interval schedules for the same reinforcement density. For example, one animal might be trained on a variable ratio schedule and the times at which it received reinforcement could be noted. These time could then be used to form a 'yoked' variable interval schedule for another animal - an interval schedule where the interval between SD onset and the onset of a response-reinforcement contingency is determined by the times at which the first animal received each reinforcement. Typically the second animal would produce much slower response rates on the yoked schedule even though the frequency of reinforcement received by the two animals was more or less the same.

Scallops and matching.

The rate at which an animal responds on a schedule is one measure of the strength of association that it makes between response and reinforcement. For example, the interesting thing about FI schedules is that, in these terms, responses become more strongly associated with reinforcement as the time since SD onset increases. The animal does not learn the FI contingency precisely, but does learn an approximation which keeps the effort it expends to obtain reinforcement relatively low while maintaining a good chance of obtaining that reinforcement almost as soon as it becomes available. We can study the way animals choose between reinforcers by presenting them with the opportunity to respond on two schedules simultaneously if we have a Skinner- box with two levers and two SD lights. One lever may, for example operate on a VR20 schedule while the other operated on a VR10. If we conduct a series of experiments with different combinations of schedules we can discover the way in which animals allot their resources between reinforcers of different values. It turns out that animals do not distribute their responses ideally - always making all of their response to the richer schedule, but again distribute their response in a way which serves to minimise responding given only approximate information about the different contingencies. Animals allot responses between schedules in proportion to the numbers of reinforcers they obtain on each schedule. This is known as the matching law - it has been studied not only by psychologists, but also more and more by economists since this behaviour of rats often corresponds to economic behaviour in humans.

Extinction, reinforcers and punishers.

We can also examine the strength of associations by looking at resistance to extinction - that is, how long an animal will keep responding in the presence of an SD even though responses now produce no reinforcement. One of the main features of all schedules of reinforcement which are partial - that is which aren't continuous, where every single response is not reinforced, is that they produce learning which is much more resistant to extinction than continuous reinforcement - the sort of discrete trial learning studied by Thorndike and Watson. This is probably a good place to discuss contingencies where the results of responding are something other than desirable. In discussing classical conditioning we were often considering associations made to aversive events and we did not draw much of a distinction between these and appetitive events such as food presentation. In operant conditioning appetitive and aversive events produce different patterns of earning. We can, in fact, distinguish between four different consequences of responding in operant conditioning: Although animals can learn all of these contingencies it is very clear that they have quite different consequences in extinction. When a contingency fails to apply to a behaviour actively produced by the animal it is clear that the contingency is not longer in operation. On the other hand, if behaving leads to aversive consequences then, in extinction the animal is unlikely to produce the behaviour and hence to discover that the contingency no longer applies.

What is learned in operant conditioning?

We now know enough about operant conditioning to begin asking what exactly is learned in it. It is clear that the operant conditioning paradigm is more complex than classical conditioning and so the question of what is learned can be approached at a number of levels.

Response Differentiation.

As we saw in discussing shaping the animal has to learn the nature of the operant - this is known as response differentiation. We can see by changing the response requirement, for example the force required to depress the lever, that animals learn response requirements very precisely. It is also clear, however, that they are not simply learning a set of muscle movements. Once a rat has learned to press one lever he does not have to relearn the whole process if he is extinguished and presented with a new learning task - we do not need to go through the whole shaping procedure again even if the lever is now on the opposite side of the Skinner-box.

Stimulus discrimination.

In addition to response differentiation the animal must also learn to discriminate the discriminative stimulus. This task can most clearly be seen when a number of stimuli can be presented to the animal only one of which is the true SD. If the factor which distinguishes the SD is its colour then we soon see that the number of responses the animal (a pigeon in this example - rats are colour-blind) makes to colour which differ slightly from the SD colour is far fewer than would be the case if the SD did not have to be distinguished like this.
You'd see a stimulus generalistation/discrimination curve
image here if you were using a graphical web browser like Mosaic 
or Netscape.

R-S or S-R associations?

Last week I mentioned that Skinner did not consider operant conditioning to be based on stimulus-response associations, but rather on response-reinforcer associations. Let us briefly look at some evidence. A rat is trained to make one type of response for one reinforcer - say chocolate-drops and a different response for a second reinforcer - food pellets. If the value of the first reinforcer is now reduced, for example by presenting it in the animals' home cage in conjunction with a chemical which makes the rat nauseous, then, when the rat returns to the Skinner box he will produce much lees of the first type of behaviour. If the animal had learned a stimulus-response association - i.e. when put in the Skinner-box there is an association with producing the first behaviour, then we would not expect to see less of behaviour one even before the animal has obtained a reinforcer. If, on the other hand, the association is between response and reinforcement then devaluation of the reinforcer would be expected to have just the effect observed on behaviour. We can also demonstrate the R-S nature of operant associations by presenting additional reinforcers not contingent on responding on one of a pair of schedules. Imaging the same initial situation, however, now instead of devaluing chocolate-drops by paring them with poison we begin to present some chocolate drops to the animal in the Skinner-box whether or not it has met the schedule contingency. The animal now again makes fewer of the first 'chocolate-drop' responses. Again, the stimulus (the SD, being in the box and so on) have not changed yet a change in the contingency between response and reinforcement has effected behaviour.

Contingency or contiguity?

This leads us to another question. Instrumental learning normally clearly depends on a contingency between response and reinforcement, but must this always be the case? Normally, if a contingency is not present - if responding has no effect on whether reinforcement is obtained, then no learning occurs. There is, however, the possibility that a contingency is perceived where, in fact, there is none. To truly assess the contingency between response and reinforcement we need to know both the chances of obtaining a reinforcer if we respond and the chances of obtaining a reinforcer if we don't respond. If we never evaluate the latter probability because we are responding all the time then we may attribute a contingency to responding where there is none. The opposite can also occur. An extreme example of this is 'learned helplessness'. In the first part of a learned helplessness experiment an animal is subject to unavoidable shocks - there may be a potential path to escape, for example a wall to jump over, but escape is impossible, for example because the wall is too high. Soon the animal learns that escape is impossible and ceases attempting it. If the animal is now moved to a different situation in which escape is possible it will, nevertheless, fail to learn. Because it never performs escape behaviour it does cannot discover that the chances of being shocked when it makes an escape attempt now are different from those it experience when not behaving. The lack of contingency perceived between behaviour and shock is illusory. In these circumstances then conditioning is really being controlled by the contiguity of response and reinforcer not their contingency. It should, however, be emphasised that, in general, the effectiveness of instrumental learning depends on contingency.

Learning schedules.

Finally, as we noted above, when one considers partial schedules of reinforcement the animal learns approximations to the contingencies in operation, based on the samples of those contingencies he is exposed to. He does not learn the contingency as we have idealised it. If he did then on FI schedules no response would be produced until the interval was up and in concurrent schedules he would maximise not match.

Sources.

Schwartz is quite a good source for further details - I drew on parts of chapter 5 and 7, but there is a lot more there to read - this lecture only scratches the surface. It is interesting and alarming to read about how far Skinner thought principles of operant conditioning could be applied to everyday life - he wrote a 'utopian' novel about a future society based on scientific behavioural principles called 'Walden Two' - it doesn't seem like much of utopia to me. He also comments on the status that his science of behaviour leaves freedom of choice and morality with in 'Beyond Freedom and Dignity'.