Bob Kentridge 1995
Comparative Psychology: Lecture 7.
Operant conditioning.
Last week we examined the development of experiments and theories of
intstrumental learning, culminating in Skinner's operant conditioning
procedure and his theory that instrumental learning could be explained
in terms of response-reinforcer associations whose validity was
signalled by discriminative stimuli. This week I want to look at
conditioning procedures in more detail and then, just as we did with
classical conditioning, consider some factors determining the
effectiveness of operant conditioning and what is learned during
operant conditioning. Although we could discuss various experiments
germane to these questions it might be easier to get and intuitive
grasp of the problems of instrumental learning if we start out by
considering how to train an animal to respond in a Skinner box.
How to shape a rat.
In principle, and sometimes in practice, it is possible for a rat to
learn to press a bar in a Skinner-box by trial and error. If the box
is programmed so that a single lever-press causes a pellet to be
dispensed, followed by a period for the rat to eat the pellet when
the discriminative-stimulus light is out and the lever inoperative,
then the rat may learn to press the lever if left to his own devices
for long enough. This can, however, often take a very long time.
The methods used in practice illustrate how much the rat has to
learn to tackle this simple instrumental learning situation.
The first step is to expose the rat to the food pellets he will later
be rewarded with in the Skinner box in his home cage when he is
hungry. He has to learn that these pellets are food and hence are
reinforcing when he is hungry. Now he can be introduced to the
Skinner-box.
Initially we may put a few pellets in the hopper where reinforcers
are delivered, plus a few scattered nearby, to allow the rat to
discover that the hopper is a likely source of food. Once the rat is
happy eating from the hopper he can be left in Skinner box and
the pellet dispenser operated every now and then so the rat
becomes accustomed to eating a pellet from the hopper each time
the dispenser operates (the rat is probably learning to associate
the sound of the dispenser operating with food - a piece of
classical conditioning which is really incidental to the instrumental
learning task at hand).
Once the animal has learned the food pellets are reinforcing and
where they are to be found we could return to the trial and error
learning scheme I described earlier, it would, however, still
probably take some time for the rat to learn that bar-pressing
when the SD light was on produced food. The problem is that the
rat is extremely unlikely to press the lever often by chance. In
order to learn an operant contingency by trial and error the
operant must be some behaviour which the animal performs often
anyway. Instead of allowing the rat to learn by trial and error we
can use a 'shaping' or 'successive-approximations' procedure.
Initially, instead of rewarding the rat for producing the exact
behaviour we require - lever pressing - he is rewarded whenever
he performs a behaviour which approximates to lever pressing.
The closeness of the approximation to the desired behaviour
required in order for the rat to get a pellet is gradually increased
so that eventually he is only reinforced for pressing the lever. We
might start by reinforcing the animal whenever he is in the front
half of the Skinner-box. Once he begins to spend more time in
this area we reinforce him only if he is also on the side of the box
where the lever is. After this we might only reinforce him if his
head is pointing towards the lever and then later only reinforce
him when he approaches the lever, when he touches the lever
with the front half of his body, when he puts touches the lever
with his paw and so on until the rat is pressing the lever in order
to obtain the reinforcer.
The rat may still not have completely learned the operant
contingency - specifically he may not yet have learned that the
contingency between the operant response and reinforcement is
signalled by the SD light. If we now leave him to work in the
Skinner-box on his own he will soon learn this and will only press
the lever when the SD light is on. We will return later to the
question of what the animal learns both about the quality of the
discriminative stimulus and the way the SD comes to control
operant behaviour.
Once we have reached this stage we can begin to see the
differences between Skinner's conception of the SD as being a
signal which discriminates the presence of an operant-reinforcer
contingency and previous S-R theories of learning. In the
Skinner-box it is now possible to change the contingency between
responses on the lever and the delivery of food pellets so that
more than one response may be required in order to obtain
reinforcement. A whole range of rules can govern the contingency
between responses and reinforcement - these different types of
rules are referred to as schedules of reinforcement.
Schedules of reinforcement.
There are as many different types of schedules of reinforcement
as people can concoct. They are, however, generally constructed
from combinations of a few basic types - schedules in which the
contingency depends on the number of responses and those where
the contingency depends on their timing.
The most straightforward schedules are ones that depend on the
number of responses made - these are called ratio schedules. The
ratio of the schedule is the number of responses required per
reinforcement. The schedule we have been using up to now,
where one reinforcer is delivered for each response, is called a
continuous reinforcement schedule - it has a ratio of 1. A
schedule where two responses had to be made for each reinforcer
has a ratio of 2 and so on. We can also distinguish between
schedules where exactly the same number of responses have to be
made for each reinforcer - fixed-ratio schedules, and those where
the number of response required can differ for each reinforcer
around some average value - a variable-ratio schedule. A
schedule where exactly 20 responses were required for each
reinforcer is called a fixed-ratio 20 or FR20 schedule. One where
on average 30 response are required is called a variable-ratio 30
or VR30 schedule.
The contingency between responses and reinforcement can also
depend on time. We might, for example, reinforce the first
response an animal makes after the SD light has been on for 20
seconds. Any responses it makes during that 20 seconds are
irrelevant. This is called an interval schedule. Where the interval
which must elapse between the onset of the SD and the first
reinforced response is the same for all reinforcers the schedule is
called a fixed-interval or FI schedule. Again, the intervals could
also vary around some average - this is called a variable-interval
or VI schedule.
It s possible to combine these schedules in various ways and even
to construct other basic types of schedule (e.g. ones where animals
are reinforced for maintaining specified intervals between
responses - differential reinforcement of low rate of response or
DRL schedules). The important thing about these different
schedule, however, is the differences in response patterns and
learning that they produce. These differences may tell us about
part of what is learned in operant conditioning. Let us summarise
the different basic types of schedules and then consider the
characteristic response patterns they produce.
- Fixed-Ratio (FR) in which the first response made after a
given number of responses have been in the presence of the discriminative stimulus is reinforced. For example on an FR 15 schedule every
15th response is reinforced.
- Fixed-Interval (FI) in which the first response made
after a given time interval is reinforced. For example, on an FI 20
sec. schedule the first response made after 20 seconds from the
onset of the discriminative stimulus is reinforced. The
discriminative stimulus would normally then be turned off during
the period the animal consumes its reinforcer.
- Variable-Ratio (VR) is similar to FR except that the
number of responses required varies between reinforcements. On
a VR 15 schedule 15 responses are required per reinforcer on
average, but one reinforcer may only require 3 responses while
the next is obtained after 22 responses.
- Variable-Interval (VI) is similar to FI except the
interval requirements vary between reinforcers around some
specified average value.
The most characteristic response patterns are produced by FI and
FR schedules. Responses in operant-conditioning experiments
were traditionally recorded using a pen recorder in which a pen
was drawn across paper at a constant rate, the pen was moved up
a small amount each time an animal made a response, a larger
diagonal movement recorded the occurrence of reinforcements.
A constant response rate will produce a strength line with a
constant slope proportional to that rate on such a recorder. A
pause in responding will be shown as a horizontal line. Animals
produce constant rate responding to FR schedules with a distinct
pause in responding after each reinforcement.
The rate of response is inversely proportional to the ratio
requirement - the more responses required per reinforcer the
lower the response rate. The length of the post-reinforcement
pause in responding also increases as the ratio increases.
The pattern of animal responding on FI schedules is quite
different. After each reinforcement animals respond on FI
schedules with gradually accelerating response rates which
produces a 'scalloped' record:
The main feature of variable schedules is that, in animals, ratio
schedules produce larger response rates than interval schedules
for the same reinforcement density. For example, one animal
might be trained on a variable ratio schedule and the times at
which it received reinforcement could be noted. These time could
then be used to form a 'yoked' variable interval schedule for
another animal - an interval schedule where the interval between
SD onset and the onset of a response-reinforcement contingency is
determined by the times at which the first animal received each
reinforcement. Typically the second animal would produce much
slower response rates on the yoked schedule even though the
frequency of reinforcement received by the two animals was
more or less the same.
Scallops and matching.
The rate at which an animal responds on a schedule is one
measure of the strength of association that it makes between
response and reinforcement. For example, the interesting thing
about FI schedules is that, in these terms, responses become more
strongly associated with reinforcement as the time since SD onset
increases. The animal does not learn the FI contingency precisely,
but does learn an approximation which keeps the effort it expends
to obtain reinforcement relatively low while maintaining a good
chance of obtaining that reinforcement almost as soon as it
becomes available. We can study the way animals choose
between reinforcers by presenting them with the opportunity to
respond on two schedules simultaneously if we have a Skinner-
box with two levers and two SD lights. One lever may, for
example operate on a VR20 schedule while the other operated on
a VR10. If we conduct a series of experiments with different
combinations of schedules we can discover the way in which
animals allot their resources between reinforcers of different
values. It turns out that animals do not distribute their responses
ideally - always making all of their response to the richer
schedule, but again distribute their response in a way which
serves to minimise responding given only approximate
information about the different contingencies. Animals allot
responses between schedules in proportion to the numbers of
reinforcers they obtain on each schedule. This is known as the
matching law - it has been studied not only by psychologists, but
also more and more by economists since this behaviour of rats
often corresponds to economic behaviour in humans.
Extinction, reinforcers and punishers.
We can also examine the strength of associations by looking at
resistance to extinction - that is, how long an animal will keep
responding in the presence of an SD even though responses now
produce no reinforcement. One of the main features of all
schedules of reinforcement which are partial - that is which aren't
continuous, where every single response is not reinforced, is that
they produce learning which is much more resistant to extinction
than continuous reinforcement - the sort of discrete trial learning
studied by Thorndike and Watson.
This is probably a good place to discuss contingencies where the
results of responding are something other than desirable. In
discussing classical conditioning we were often considering
associations made to aversive events and we did not draw much
of a distinction between these and appetitive events such as food
presentation. In operant conditioning appetitive and aversive
events produce different patterns of earning. We can, in fact,
distinguish between four different consequences of responding in
operant conditioning:
- Positive reinforcement: a positive or appetitive event is
contingent on responding - for example, responding might lead to
being fed or working to being paid.
- Negative reinforcement: cessation of negative or aversive
events is contingent on responding - for example, responding
might cause a series of shocks to stop or turning off the TV might
lead to an absence of Jeremy Beadle (or whoever else you dislike).
- Punishment: a negative or aversive event is contingent on
responding - for example, moving off a platform might lead to
shock or dangerous driving to a big fine.
- Negative punishment: the cessation of positive events is
contingent on responding - for example, moving into the centre of
a test box may cause pleasant brain stimulation to be halted or
being loud and obnoxious in the bar may cause you to get thrown
out of it.
Although animals can learn all of these contingencies it is very
clear that they have quite different consequences in extinction.
When a contingency fails to apply to a behaviour actively
produced by the animal it is clear that the contingency is not
longer in operation. On the other hand, if behaving leads to
aversive consequences then, in extinction the animal is unlikely to
produce the behaviour and hence to discover that the contingency
no longer applies.
What is learned in operant conditioning?
We now know enough about operant conditioning to begin asking
what exactly is learned in it. It is clear that the operant
conditioning paradigm is more complex than classical conditioning
and so the question of what is learned can be approached at a
number of levels.
Response Differentiation.
As we saw in discussing shaping the animal has to learn the
nature of the operant - this is known as response differentiation.
We can see by changing the response requirement, for example
the force required to depress the lever, that animals learn
response requirements very precisely. It is also clear, however,
that they are not simply learning a set of muscle movements.
Once a rat has learned to press one lever he does not have to
relearn the whole process if he is extinguished and presented with
a new learning task - we do not need to go through the whole
shaping procedure again even if the lever is now on the opposite
side of the Skinner-box.
Stimulus discrimination.
In addition to response differentiation the animal must also learn to
discriminate the discriminative stimulus. This task can most clearly
be seen when a number of stimuli can be presented to the animal only
one of which is the true SD. If the factor which distinguishes the SD
is its colour then we soon see that the number of responses the animal
(a pigeon in this example - rats are colour-blind) makes to colour
which differ slightly from the SD colour is far fewer than would be
the case if the SD did not have to be distinguished like this.
R-S or S-R associations?
Last week I mentioned that Skinner did not consider operant
conditioning to be based on stimulus-response associations, but
rather on response-reinforcer associations. Let us briefly look at
some evidence. A rat is trained to make one type of response for
one reinforcer - say chocolate-drops and a different response for a
second reinforcer - food pellets. If the value of the first reinforcer
is now reduced, for example by presenting it in the animals' home
cage in conjunction with a chemical which makes the rat nauseous,
then, when the rat returns to the Skinner box he will produce
much lees of the first type of behaviour. If the animal had
learned a stimulus-response association - i.e. when put in the
Skinner-box there is an association with producing the first
behaviour, then we would not expect to see less of behaviour one
even before the animal has obtained a reinforcer. If, on the other
hand, the association is between response and reinforcement then
devaluation of the reinforcer would be expected to have just the
effect observed on behaviour.
We can also demonstrate the R-S nature of operant associations by
presenting additional reinforcers not contingent on responding on
one of a pair of schedules. Imaging the same initial situation,
however, now instead of devaluing chocolate-drops by paring
them with poison we begin to present some chocolate drops to the
animal in the Skinner-box whether or not it has met the schedule
contingency. The animal now again makes fewer of the first
'chocolate-drop' responses. Again, the stimulus (the SD, being in
the box and so on) have not changed yet a change in the
contingency between response and reinforcement has effected
behaviour.
Contingency or contiguity?
This leads us to another question. Instrumental learning normally
clearly depends on a contingency between response and reinforcement,
but must this always be the case? Normally, if a contingency is not
present - if responding has no effect on whether reinforcement is
obtained, then no learning occurs. There is, however, the possibility
that a contingency is perceived where, in fact, there is none. To
truly assess the contingency between response and reinforcement we
need to know both the chances of obtaining a reinforcer if we
respond and the chances of obtaining a reinforcer if we don't
respond. If we never evaluate the latter probability
because we are responding all the time then we may attribute a
contingency to responding where there is none. The opposite can also
occur. An extreme example of this is 'learned helplessness'. In the
first part of a learned helplessness experiment an animal is subject
to unavoidable shocks - there may be a potential path to escape, for
example a wall to jump over, but escape is impossible, for example
because the wall is too high. Soon the animal learns that escape is
impossible and ceases attempting it. If the animal is now moved to a
different situation in which escape is possible it will, nevertheless,
fail to learn. Because it never performs escape behaviour it does
cannot discover that the chances of being shocked when it makes an
escape attempt now are different from those it experience when not
behaving. The lack of contingency perceived between behaviour and
shock is illusory. In these circumstances then conditioning is really
being controlled by the contiguity of response and reinforcer not
their contingency. It should, however, be emphasised that, in
general, the effectiveness of instrumental learning depends on
contingency.
Learning schedules.
Finally, as we noted above, when one considers partial schedules
of reinforcement the animal learns approximations to the
contingencies in operation, based on the samples of those
contingencies he is exposed to. He does not learn the contingency
as we have idealised it. If he did then on FI schedules no
response would be produced until the interval was up and in
concurrent schedules he would maximise not match.
Sources.
Schwartz is quite a good source for further details - I drew on parts
of chapter 5 and 7, but there is a lot more there to read - this
lecture only scratches the surface. It is interesting and alarming to
read about how far Skinner thought principles of operant conditioning
could be applied to everyday life - he wrote a 'utopian' novel about a
future society based on scientific behavioural principles called
'Walden Two' - it doesn't seem like much of utopia to me. He also
comments on the status that his science of behaviour leaves freedom of
choice and morality with in 'Beyond Freedom and Dignity'.