Proceedings paper


Dynamics of Tonality Induction: A new method and a new model

Carol Lynne Krumhansl and Petri Toiviainen, Cornell University and University of Jyväskylä

Tonality induction refers to the process through which the listener develops a sense of the key of a piece of music. The concept of tonality is central to Western music, but eludes definition. From the point of view of musical structure, tonality is related to a cluster of features, including musical scale (usually major or minor), chords, the conventional use of sequences of chords in cadences, and the tendencies for certain tones and chords to suggest or be "resolved" to others. From the point of view of experimental research on music cognition, tonality has implications for establishing hierarchies of tones and chords and for inducing certain expectations in listeners about how sequences will continue. One method for studying the perception of tonality, the probe tone method, has been used extensively and a new variant of it will be described here. In addition to experimental studies, considerable effort has been spent developing computational models producing various symbolic and neural network models. A new approach to computational modeling will be described, which lends itself to a dynamic geometric representation of tonality perception.

Probe tone methodology: The retrospective judgment

The experimental method introduced in the Krumhansl and Shepard (1979) study is sometimes referred to as the probe tone method. It is best illustrated with a concrete example. Suppose you hear the tones of the ascending C major scale: C D E F G A B. There is a strong expectation that the next tone will be the tonic, C, first, because it is the next logical tone in the series and, second, because it is the tonic of the key. In the experiment, the incomplete scale context was followed by the tone C (the probe tone), and listeners were asked to judge how well it completed the scale on a numerical scale (1 = very bad, 7 = very good). As expected, the C received the maximal rating. Other probe tones, however, also received fairly high ratings, and they were not necessarily those that are close to the tonic C in pitch. For example, the most musically trained listeners also gave high ratings to the dominant, G, and the mediant, E. In general, the tones of the scale received higher ratings than the non-scale tones, C# D# F# G# A#. This suggested that it was possible to get quantitative judgments of the degree to which different tones are perceived as stable, final tones in tonal contexts.

The subsequent Krumhansl and Kessler (1982) study used this method with a variety of musical contexts at the beginning of the trials. They were chosen because they are clear indicators of the key. They included the scale, the tonic triad chord, and chord cadences in both major and minor keys. These were followed by all possible probe tones in the chromatic scale, which the listeners were instructed to judge in terms of how well they fit with the preceding context in a musical sense. Different major keys were used, as were different minor keys, but the results were similar when transposed to a common tonic. Also, the results were similar independent of which particular context was used. Consequently, the data were averaged over these factors. We call the resulting values the K-K profiles, which can be expressed as vectors. The vector for major keys is: K-K major = <6.35, 2.23, 3.48, 2.33, 4.38, 4.09, 2.52, 5.19, 2.39, 3.66, 2.29, 2.88>. The vector for minor keys is: K-K minor = <6.33, 2.68, 3.52, 5.38, 2.60, 3.53, 2.54, 4.75, 3.98, 2.69, 3.34, 3.17>.

We can generate K-K profiles for 12 major keys and 12 minor keys from these. If we adopt the convention that the first entry in the vector corresponds to the tone C, the second to C#/Db, the third to D, and so on, then the vector for C major is: <6.35, 2.23, 3.48, 2.33, 4.38, 4.09, 2.52, 5.19, 2.39, 3.66, 2.29, 2.88>, the vector for C# major is: : <2.88, 6.35, 2.23, 3.48, 2.33, 4.38, 4.09, 2.52, 5.19, 2.39, 3.66, 2.29>, and so on. The vectors for the different keys result from shifting the entries to appropriate number of places to the tonic of the key.

Krumhansl and Kessler (1982) then used these data to study how the sense of key develops and changes over time. They used ten nine-chord sequences, some of which contained modulations between keys. Listeners did the probe tone task after the first chord, then after the first two chords, then after the first three chords, and continued until the full sequence was heard. This meant that 12 (probe tones) x 9 (chord positions) x 10 (sequences) = 1080 judgments were made by each listener. Each of the 90 probe tone ratings were compared with the ratings made for the unambiguous key-defining contexts. That is, each set of probe tone ratings was correlated with the K-K profiles for the 24 major and minor keys. For some of the sets of probe tone ratings (some probe positions in some of the chord sequences), a high correlation was found indicating a strong sense of key. For other sets of probe tone ratings, no key was highly correlated which was interpreted as an ambiguous sense of key.

Probe tone methodology: The concurrent judgment

As should be obvious from the above, the retrospective probe tone requires an intensive empirical effort to trace how the sense of key develops and changes, even for short sequences. In addition, the sequence needs to be interrupted and the judgment is made after the sequence has been interrupted. For these reasons, the judgments may not faithfully mirror the experience of music in time. For these reasons, we were motivated to develop an alternative form of the probe tone methodology. In this method, which we call the concurrent judgment, the probe tone is presented continuously while the music is played. The complete passage is sounded together with a probe tone. Then the passage is sounded again, this time with another probe tone. This process is continued until all probe tones have been sounded.

In our initial application of this method, the passage was J. S. Bach's Organ Duetto IV, BWV 805. Its duration is slightly longer than three minutes. The piece contains an interesting pattern of modulations including a repeated, highly chromatic passage. At the beginning of the session, the listener heard the entire passage from beginning to end without any probe tone so that they could become familiar with the piece. During each trial, the piece was repeated twelve times, each time with a different probe tone. The probe tone was sounded over six octaves spanning the range of the piece, similar to a 'Shepard tone'. The order of the probe tones was determined randomly and was different for each subject.

To reduce the effects of sensory dissonance, the probe tone was sounded only in the left ear, while the music was sounded only in the right ear. To help listeners continue to attend to the probe tone, it was pulsed at the beginning of each measure. Listeners were instructed to use a computer mouse to move a slider left and right to indicate the extent to which the probe tone fit with the music. The left end of the scale was labeled "Fits poorly" the right end of the scale was labeled "Fits well". The computer program, written in MAX, recorded the position of the slider every 200 msec. Because the task requires concentration, only highly trained musicians were run in this initial application.

A geometric map of key distances from the tonal hierarchies

Krumhansl and Kessler (1982) used the K-K profiles to generate a geometric representation of musical keys. The basic assumption underlying this approach was that two keys are closely related to each other if they have similar tonal hierarchies. That is, keys were assumed to be closely related if tones that are stable in one key are also relatively stable in the other key. To measure the similarity of the profiles, a product-moment correlation was used. It was computed for all possible pairs of major and minor keys, giving a 24 x 24 matrix of similarity values showing how similar the tonal hierarchy of each key is to every other key. The correlations between the C major profile and the 24 major and minor keys, and the correlations between the C minor profile and all the 24 major and minor keys were presented in Krumhansl (1990, p. 38). To give some examples, C major correlated relatively strongly with A minor (.651), G major and F major (both .591), and with C minor (.511). C minor correlated relatively strongly with Eb major (.651), C major (.511), Ab major (.536), and F minor and G minor (both .339). The same transposition-shift principle can be used to find the correlations for all pairs of major and minor keys.

A technique called multidimensional scaling was then used to create a geometric representation of the key similarities. The algorithm locates 24 points (corresponding to the 24 major and minor keys) in a spatial representation to best represent their similarities. It searches for an arrangement such that points that are close correspond to keys with similar K-K profiles (as measured by the correlations). In particular, non-metric multidimensional scaling seeks a solution such that distances between points are (inversely) related by a monotonic function to the correlations. A measure (called 'stress') measures the amount of deviation from the best-fitting monotonic function. The algorithm can search for a solution in any specified number of dimensions. In this case, a good fit to the data was found in four dimensions.

The four-dimensional solution located the 24 keys on the surface of a torus (generated by one circle in dimensions 1 and 2, and another circle in dimensions 3 and 4). Because of this, any key can be specified by two values: its angle on the first circle and its angle on the second circle. The result can be depicted in two dimensions as a rectangle where it is understood that the left edge is identified with the right edge, and the bottom edge is identified with the top edge. The solution obtained was similar to that shown in Figure 1 (see below). As can be seen, the locations of the 24 keys are interpretable in terms of music theory. There is one circle of fifths for major keys (...F#/Gb, Db, Ab, Eb, Bb, F, C, G, D, A, E, B, F#/Gb..) and one circle of fifths for minor keys (...f#, c#, g#, d#/eb, bb, f, c, g, d, a, e, b, f#,...). These wrap diagonally around the torus such that each major key is located near both its relative minor (for example, C major and a minor) and its parallel minor (for example, C major and C minor).


Figure 1. a) The configuration of a toroidal SOM trained with the 24 K-K profiles. b) the response of one subject, displayed on the SOM, at a point with a clear tonality (at 9.5 measures); c) the response of Model 1 at the same point as in b; d) the response of the subject at a point with a less clear tonality (at 49 measures); e) the response of Model 1 at the same point as in d; f) the response of the subject at a point with a weak tonality (at 89 measures); g) the response of Model 1 at the same point as in f;


Representing the sense of key on the torus

The continuous spatial medium in which the 24 major and minor keys are located affords representing the changing sense of key in a graphical form. Krumhansl and Kessler (1982) used a technique called multidimensional unfolding to do this. It is a method that is closely related to multidimensional scaling. Multidimensional unfolding begins with a multidimensional scaling solution, in this case the torus representation of the 24 major and minor keys. This solution is considered fixed. The algorithm then finds a point in the multidimensional scaling solution to best represent the sense of key at each point in time. Let P1 be the probe tone ratings after the first chord in a sequence; it is a 12-dimensional vector of ratings for each tone of the chromatic scale. This vector is correlated with each of the 24 K-K vectors, giving a 24-dimensional vector of correlations. The unfolding algorithm finds a point to best represent these correlations. Suppose P1 correlates highly with the K-K profile for F major and fairly highly with the K-K profile for D minor. Then the unfolding algorithm will produce a point near these keys and far from the keys with low correlations. Then the vector of correlations is computed for P2 , and this process continues until the end of the sequence.

In this manner, each of the ten nine-chord sequence used by Krumhansl and Kessler (1982) generated a series of nine points in the torus representation of keys. For nonmodulating sequences, the points remained in the neighborhood of the intended key. For the modulating sequences, the first points were near the initial intended key, then shifted to the region of the second intended key. Modulations to closely related keys appeared to be assimilated more rapidly than those to distantly related keys, that is, the points shifted to the region of the new key more rapidly.

Measurement assumptions of the multidimensional scaling and unfolding methods

The above methods make a number of assumptions about measurement, only some of which will be noted here. The torus representation is based on the assumption that correlations between the K-K profiles are appropriate measures of interkey distance. It further assumes that these distances can be represented in a relatively low-dimensional space (four dimensions). This latter assumption is supported by the low stress values (high goodness-of-fit values) of the multidimensional scaling solution. It was further supported by a subsidiary Fourier analysis of the K-K major and minor profiles, which found two relatively strong harmonics (see Krumhansl, 1990, p. 101). In fact, plotting the phases of the two Fourier components for the 24 key profiles was virtually identical to the multidimensional scaling solution. This supports the torus representation, which consists of two orthogonal circular components. Nonetheless, it would seem desirable to see whether an alternative method with completely different assumptions reproduces the same toroidal representation of key distances.

The unfolding method also adopts correlation as a measure of distances from keys, this time using the ratings for each probe position and the K-K vectors for the 24 major and minor keys. The unfolding technique finds the best-fitting point in the four-dimensional space containing the torus. It does not provide a way of representing cases in which no key is strongly heard because it cannot generate points outside the space containing the torus. Thus, an important limitation of the unfolding method is that it does not provide a representation of the strength of the key or keys heard at each point in time. For this reason, we sought a method that is able to represent both the region of the key or keys that are heard, together with their strengths.

SOM map of keys

The self-organizing map (SOM; Kohonen, 1997) is an artificial neural network that simulates the formation of ordered feature maps. The SOM consists of a two-dimensional grid of units, each of which is associated with a reference vector. Through repeated exposure to a set of input vectors, the SOM settles into a configuration in which the reference vectors approximate the set of input vectors according to some similarity measure; the most commonly used similarity measures are the Euclidean distance and the direction cosine. The direction cosine between an input vector and a reference vector is defined by

. (1)

Another important feature of the SOM is that its configuration is organized in the sense that neighboring units have similar reference vectors. For a trained SOM, a mapping from the input space onto the two-dimensional grid of units can be defined by associating any given input vector with the unit whose reference vector is most similar to that particular input vector. Because of the organization of the reference vectors, this mapping is smooth in the sense that similar vector are mapped onto adjacent regions. Conceptually, the mapping can be thought of as a projection onto a non-linear surface determined by the reference vectors.

We trained the SOM with the 24 K-K profiles. The SOM had a toroidal configuration, that is, the left and the right edges of the map were connected to each other as were the top and the bottom edges. The resulting map is displayed at the top of Figure 1. The configuration of the SOM is highly similar to the multidimensional scaling solution (Krumhansl & Kessler, 1982) and the Fourier-analysis-based projection (Krumhansl, 1990) obtained with the same set of vectors. Furthermore, Euclidean distance and direction cosine used as similarity measures in training the SOM yielded identical maps.

Representing the sense of key on the SOM

In addition to this localized mapping, a distributed mapping can be defined by associating each unit with an activation value. For each unit, this value depends on the similarity between the input vector and the reference vector of the unit. Specifically, the units whose reference vectors are highly similar to the input vector have a high activation, and vice versa. The activation value of each unit can be calculated, for instance, using the direction cosine of Equation 1. Dynamically changing data from either probe-tone experiments or key-finding models can be visualized as an activation pattern that changes over time. The location and spread of this activation pattern provides information about the perceived key and its strength. More specifically, a focused activation pattern implies a strong sense of key and vice versa.

Tone transitions and key-finding

All the key-finding models presented to date are static in the sense that they ignore the temporal order of tones. The order in which tones are played may, however, provide additional information that is useful for key-finding. This is supported by studies on both tone transition probabilities (Fucks, 1962; Youngblood, 1958; Knopoff & Hutchinson, 1978) and perceived stability of tone pairs in a tonal context (Krumhansl, 1979, 1990). Fucks (1962) found that, in samples of compositions by Bach, Beethoven, and Webern, only a small fraction of all the possible tone transitions were actually used (the fractions were 23, 16, and 24 percent, respectively). Furthermore, Youngblood (1958) showed that, in a sample of 20 songs by Schubert, Mendelssohn, and Schumann, there is an asymmetry in the transition frequencies in the sense that certain tone transitions were used more often than their inversions. For instance, the transition B-C was used 93 times, whereas the transition C-B was used only 66 times. A similar asymmetry was found in the study on perceived stability of tone pairs in a tonal context by Krumhansl (1990). The study showed that, after the presentation of a tonal context, tone pairs that ended with a tone that was high in the tonal hierarchy were given higher ratings than their inverses. For instance, in the context of C major, the ratings for the transitions B-C and C-B were 6.42 and 3.67, respectively.

Determining tone transitions in a piece of polyphonic music is not a trivial task, especially if one aims at a representation that corresponds to perceptual reality. Even in a monophonic piece, the transitions can be ambiguous in the sense that their perceived strengths may depend on the tempo and may vary from one individual to another. Consider, for example, the tone sequence C4-G3-D4-G3-E4, where all the tones have equal durations. When played slowly, this sequence is heard as a succession of tones oscillating in pitch. With increasing tempi, however, the subsequence C4-D4-E4 becomes increasingly prominent. This is because it is segregated from the stream of tones due to the temporal and pitch proximity of its members. With polyphonic music, the ambiguity of tone transitions becomes even more obvious. Consider, for instance, the sequence consisting of a C major chord followed by a D major chord, where the tones of each chord are played simultaneously. In principle, this passage contains nine different tone transitions. Some of these transitions are, however, perceived as stronger than the others. For instance, the transition G-A is, due to pitch proximity, perceived as stronger than the transition G-D.

It seems thus that the analysis of tone transitions in polyphonic music should take into account principles of auditory stream segregation (see Bregman, 1990). Furthermore, it may be necessary to code the presence of transitions on a continuous instead of a discrete scale. In other words, each transition should be associated with a strength value instead of just coding whether that particular transition is present or not. Below, a dynamical system that embraces these principles is described. In regard to the evaluation of transition strength, the system bears a resemblance to the model of apparent motion in music presented by Gjerdingen (1994).

Pitch transition model

Let the piece of music under examination be represented as a sequence of tones, where each tone is associated with pitch, onset time, and duration. The main idea of the model is the following: given any tone in the sequence, there is a transition from that tone to all the tones following that particular tone. The strength of each transition depends on three factors: pitch proximity, temporal proximity, and duration of tones. More specifically, a transition between two tones has the highest strength when the tones are proximal in both pitch and time as well as have long durations. These three factors are included in the following dynamical model.

Representation of input. The pitches of the chromatic scale are numbered consecutively. The onset times of tones having pitch are denoted by , , and the offset times by,, where is the total number of times the kth pitch occurs.

Pitch vector . Each component of the pitch vector has non-zero value whenever a tone with the respective pitch is sounding. It has the value of 1 at each onset at the respective pitch, decays exponentially after that, and is set to zero at the tone offset. The time evolution of is governed by the equation

, (2)

where denotes the time derivative of and the Dirac delta function (unit impulse function). The time constant has the value of . With this value, the integral of saturates at about 1 sec after tone onset, thus approximating the durational accent as a function of tone duration (Parncutt, 1994).

Pitch memory vector . The pitch memory vector provides a measure of both the perceived durational accent and the recency of notes played at each pitch. In other words, a high value of indicates that a tone with pitch and a long duration has been played recently. The dynamics of are governed by the equation


The time constant determines the dependence of transition strength on the temporal distance between the tones. In the simulations, the value of has been used, corresponding to typical estimates of the length of the auditory sensory memory (Darwin, Turvey & Crowder, 1972; Fraisse, 1982; Treisman, 1964).

Transition strength matrix . The transition strength matrix provides a measure of the instantaneous strengths of transitions between all pitch pairs. More specifically, a high value of indicates that a long tone with pitch has been played recently and a tone with pitch is currently sounding. The temporal evolution of is governed by the equation

. (4)

In this equation, the non-linear term is used for distinguishing between simultaneously and sequentially sounding pitches. This term is non-zero only when , that is, when the most recent onset of pitch has occured more recently than that of pitch . The term weights the transitions according to the interval size. For the parameter , the value has been used. With this value a perfect fifth gets a weight of about 0.37 times the weight of a minor second.

Dynamic tone transition matrix . The dynamic tone transition matrix is obtained by temporal integration of the transition strength matrix. At a given point of time, it provides a measure of the strength and recency of each possible tone transition. The time evolution of is governed by the equation


where the time constant is equal to , that is, .

To examine the role of tone transitions in key-finding, we developed two key-finding models. Model 1 is based on pitch class distributions. Model 2 is based on tone transition distributions. Below, a brief description of the models is given.

Key-finding Model 1

Model 1 is based on pitch class distributions only. It uses a pitch class vector , which is similar to the pitch vector used in the dynamic tone transition matrix, except that it ignores octave information. Consequently, the vector has 12 components that represent the pitch classes. The pitch class memory vector is obtained by temporal integration of the pitch class vector according to the equation

. (6)

Again, the time constant has the value . To obtain estimates for the key, vector is correlated with the probe-tone rating profiles for each key.

Key-finding Model 2

Model 2 is based on tone transitions. Using the dynamic transition matrix , it calculates the octave-equivalent transition matrix according to


In other words, transitions whose first and second tones have identical pitch classes are considered equivalent, and their strengths are added. Consequently, the direction of transition is not taken into account. To obtain estimates for the key, the pitch class transition matrix is correlated with the matrices representing the perceived stability of two-tone transitions for each key (Krumhansl, 1990).

Sample results

Figure 1 shows some sample results from one of the participants in the experiment, a highly trained musician. This musician is a graduate student of composition with more than twenty years performance experience on the piano and some additional years on other instruments. Figure 1 b shows the results for the listener at measure 9.5. A V-I cadence in A minor has just occurred and the melody contains a descending diatonic line ending on a half-note A, followed by a tonic - leading tone - tonic alternation. This is the conclusion of the opening passage played by the left hand only and the right hand joins at this point in time. As can be seen, the sense of tonality is strongly focused on A minor. Figure 1 c shows the results for Model 1 which are highly similar, again with a strong focus on A minor. (Model 2 results were in general highly similar to Model 1, agreeing with the subject slightly more than Model 1. Because of issues about how best to visualize the results of Model 2, we show only Model 1 here.) Figure 1 d, e shows the results at measure 49. The right hand contains what would be a tonic - leading tone - tonic in E major and E minor; the mode is ambiguous because both G and G# appear. This leads to an ambiguity that spreads to other close related keys which contain the other chromatic tones, C#, D#, and A#, that appear in this passage. Figure 1 f, g show the results at measure 89. As can be seen, no clear tonal focus is found. The music is highly chromatic; of the 12 tones of the chromatic scale, all but G# appears in the three preceding three measures. Thus, these results suggest that both listeners and the algorithm can generate musically interpretable, and highly dynamic representations of tonality.


Bregman, A. S. (1990). Auditory scene analysis. Cambridge, MA: M.I.T. Press.

Darwin, C. J., Turvey, M. T., & Crowder, R. G. (1972). An auditory analogue of the Sperling partial report procedure: evidence for brief auditory storage. Cognitive Psychology , 3, 255-267.

Fraisse, P. (1982). Rhythm and tempo. In D. Deutsch (Ed.),The psychology of music.. San Diego, CA: Academic.

Fucks, W. (1962). Mathematical analysis of the formal structure of music. I R E Transactions of Information theory , 8, 225-228.

Knopoff, L. & Hutchinson, W. (1978). An index of melodic activity. Interface, 7, 205-229.

Kohonen, T. 1997. Self-organizing maps.. Berlin: Springer-Verlag.

Krumhansl, C. L. (1990). Cognitive foundations of musical pitch. New York: Oxford.

Krumhansl, C. L., & Kessler, E. J. (1982). Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychological Review, 89, 334-68.

Krumhansl, C. L. & Shepard, R. N. (1979). Quantification of the hierarchy of tonal functions within a diatonic context. Journal of Experimental Psychology: Human Perception and Performance, 5, 579-94.

Parncutt, R. (1994). A perceptual model of pulse salience and metrical accent in musical rhythms. Music Perception, 11, 409-464.

Treisman, A. M. (1964). Verbal cues, language, and meaning in selective attention. American Journal of Psychology, 77, 206-219.

Youngblood, J. E. (1958). Style as information. Journal of Music Theory, 2, 24-35.


 Back to index