Proceedings paper


Towards an on-line model of music perception


Dirk-Jan Povel and Erik Jansen

Nijmegen Institute for Cognition and Information

Nijmegen, The Netherlands



1 Introduction

1.1 Aim

1.2 Theoretical framework

1.3 Influential earlier studies

1.3.1 Van Dyke Bingham (1910)

1.3.2 Cuddy, Cohen, & Mewhort (1981)

2 Tracing the perceptual mechanisms in music processing.

2.1 Introduction

2.2 Experimental studies

2.2.1 Series containing diatonic and chromatic tones

2.2.2 Series only containing diatonic tones

2.2.3 Accentuation patterns

2.2.4 Harmonic factors

3 Conclusions

4 References


  1. Introduction
    1. Aim
    2. In the past fifty years a great amount of studies have revealed a large number of factors that play a role in the perception of music (Krumhansl, 2000). However, knowing what factors play a role is only a first step towards understanding music perception. Real insight presupposes a theory that specifies how these factors function in the processing of music, or more precisely a theory that specifies what transformations are performed on an input leading to a mental representation. Only such a theory can make specific predictions about how a concrete series of tones is perceived. Frameworks for a theory of music perception have been proposed by Deutsch & Feroe (1981) and Lerdahl & Jackendoff (1983). Although experimental evidence has been reported supporting these frameworks, no concrete predictions can be derived from these theories.

      The goal of this study is to develop a computational model, based on a set of assumptions, that captures the on-line processing of music. The model construes music perception in terms of 1) the activation of pertinent musical knowledge stored in the listener's long term memory, and 2) the application of perceptual mechanisms that organize the elements in the input into a coherent mental representation. The viability of the model is investigated in experiments that examine how perception evolves while the stimulus is presented incrementally, by studying goodness judgments and the expectations that arise in the process. The model we are developing mainly pertains to the stage in which the elements in the input is transformed into a mental representation.

      The points of departure of this study are: 1) a theoretical framework and 2) two earlier experimental studies.

    3. Theoretical framework
    4. First we present a global outline of a model of music perception that schematically represents the primary processes in music perception. See Figure1.

      First, the scheme indicates that music perception is a process in which two types of information interact: bottom-up information, consisting of the series of pitches presented to the listener (represented as f1, f2, f3,... in the figure), and top-down information represented by all knowledge relevant to music perception stored in long term memory. Second, the scheme conveys the incremental character of music perception by the cyclic pattern in which pitch input is entered sequentially and a succession of processes is executed repeatedly. Third, three groups of processes are displayed, all relying on information stored in long term memory (LTM) denoted by the arrows, and each generating different perceptual products. The first group of processes relates to the establishment of the interpretative frames required: key-inference and meter-inference respectively. The second group of processes is concerned with encoding in which series of pitches in the input are grouped into chunks on the basis of structural regularities. In the third group of processes, the chunks generated in the first encoding phase are integrated into even larger chunks, leading to a complete mental representation of the input. Next we shall describe these processes in more detail.

      The process of music perception may be conceived as the mapping of the input on musical knowledge stored in the long term memory of the listener, and as the application of perceptual mechanisms to an input consisting of a sequence of pitches. The aim of the process is to transform a series of unconnected pitches into an integrated mental representation in musical terms. A sequence of sounds which is conceived musically (rather than linguistically or otherwise), will be mapped on two dimensions: the pitch-height dimension yielding the pitch of the sound, and the key-dimension yielding the attribute of scale-degree. The key-dimension is the hierarchically organized mental tone space in which the relations between tones and chords are specified (Krumhansl, 1990). It serves as an interpretational frame that supplies the musical function of the sounds. As soon as the pitches of a sequence have activated a specific key, they are identified as tones in a scale. All tones in a key are associated with a certain degree of 'stability' (e.g., the first tone of the scale is the most stable tone, the last tone, the 'leading tone', the least stable), and with a tendency to resolve to other tones (Cooke, 1959; Povel, 1996; Zuckerkandl, 1956). Thus, in making a musical interpretation of a tone series, the tones function simultaneously in these two dimensions. Each dimension plays a specific role in the formation of musical percepts:

      Figure 1. A global model of music processing

      The main characteristic of the pitch-height dimension contributing towards melody formation is obviously pitch, especially pitch proximity. Melodies tend to proceed by step, rather than by leap, thus forming smooth contours; segmentation of tones is based in part on the principle of proximity: tones relatively close in pitch will form perceptual clusters or groups.

      The characteristics of the key-dimension contributing towards melody formation are related to the properties of scales and chords. The availability of a scale enables to describe a sequence of consecutive scale tones as a 'run' using the next relations between the tones (the sequence C D E F G for instance can be represented as 4N(C): start with C and add four times the next (N) element). The key-dimension also allows the description of a series of tones as a chord. For instance, the first three tones of the series C E G A B C, may be recognized as a major triad and encoded as such. Another principle associated with the key-dimension is that of anchoring. This mechanism, first described by Bharucha (1984), links or 'anchors' an unstable tone to a more stable tone. Anchoring is based on the notion of a hierarchical relationship between the tones in a key, with for instance the diatonic tones (the tones of the scale) being hierarchically higher than the chromatic (non-scalar) tones (Lerdahl, 1988). Tones lower in the hierarchy, the less stable tones, are attracted by tones higher in the hierarchy, the more stable tones, (Povel, 1996; Zuckerkandl, 1956). Bharucha (1984, 1996) has shown that a tone can only be anchored to a tone close in pitch (1 or 2 semitones) that follows the tone to be anchored, usually but not necessarily, the immediately succeeding tone. Phenomenally, the anchored tone is perceived as an ornament of the tone to which it is anchored.

      Within the theoretical frame proposed, the process of music perception can now be conceived as follows: Given a tone sequence presented in a specific tonal and metrical context, perceptual mechanisms are applied to the input establishing relations between the elements in the input and lead to a representation in terms of clusters of tones. Thus we assume that the aim of a listener is to generate a mental description or code that encompasses as much as possible all elements in the input. If this process is successful, the listener will have the impression that (s)he understands the input, that it make sense musically. If, conversely, a listener does not succeed in finding such relations, no coherent musical percept will result.

    5. Influential earlier studies
      1. Van Dyke Bingham (1910)
      2. Two studies have played a role in shaping this research. The first is Van Dyke Bingham (1910) who studied the factors that determine whether or not a tone series is perceived as a melody. As an example he describes two sequences respectively containing the pitches: c' e' g' e' f' d' c' and c' f' d' g' e' f' c'. The first of these was judged by listeners as a coherent sequence in which the sounds seem to follow each other naturally, thus forming an esthetic unity, i.e. a melody. The second sequence, however, was judged to be a non-melody. Van Dyke Bingham asserts that the concept of tonality plays a decisive role in the processing of tone series as expressed in his definition of the term:

        'By a tonality is meant a group of mutually related tones, organized about a single tone, the tonic, as the center of relations. Subjectively, a tonality is a set of expectations, a group of melodic possibilities within which the course of the successive tones must find its way, or suffer the penalty of not meeting these expectations or demands of the hearer and so of being rejected as no melody.' (p. 36-37)

        From this study we borrowed some of the theoretical ideas proposed above, as well as the idea for a response, asking people whether a tone series can be conceived as a melody.

      3. Cuddy, Cohen, & Mewhort (1981)

    The second study is the seminal article by Cuddy, Cohen & Mewhort (1981) in which the authors studied the perception of tone sequences having "varying degrees of musical structure". Starting from the prototypical sequence: {C4 E4 G4 F4 D4 B3 C4}, they constructed a set of sequences by altering one or more tones thereby gradually degrading the "harmonic structure", contour complexity, and excursion size (interval between first and last tone). From the results of Experiment 1 in which subjects judged the "tonality or tone structure" of 32 seven-tone sequences, 5 levels of harmonic structure were constructed by combining 3 rules: 1) diatonicism (a series either or not consists of only diatonic tones); 2) leading-note-to-tonic ending; 3) the extent to which a sequence follows a I - V - I harmonic progression. These levels of harmonic structure were factorially combined with 2 levels of contour complexity and 2 levels of excursion, yielding 20 stimuli. The stimuli were recognized under transposition (Experiment 2), and the tonal structure of the stimuli was rated in Experiment 3. Findings indicate that the ratings were mostly influenced by the factor harmonic structure and less by contour and excursion.

    The importance of this study is twofold: the use of similarity judgments and goodness ratings to measure the perception of tone sequences, and its aim to discover factors that play a role in the perception of tone series. Yet the study is limited in a number of respects. First, the concept of harmonic structure is rather ambiguous: the ordering of the 5 levels of harmonic structure is not theoretically but empirically determined. This means that it is unclear how the three rules precisely determine the variable harmonic structure. Second, it is not clear what the subjects actually judged: besides being asked to judge the tonality or tonal structure, they were instructed 'to reserve the highest ratings for sequences with "musical keyness" or "completeness" and to assign lower scale values to sequences that contained "unexpected" or "jarring" notes.' (p. 875). Thus it seems likely that the subjects have judged how well the tone series sounded as a melody; this is supported by a study of Smith & Cuddy (1986) that obtained comparable results when listeners rated the same sequences on "pleasingness". Third, although the rules affect the 20 sequences used in the study, it is unclear to what extent the rules can be generalized to other tone sequences. For instance, the sequences {C4 E4 G4 F4 B3 D4 C4} and { C4 E4 G4 B3 D4 G4 C4}, violating the leading-tone-to-tonic-ending will probably be rated about as high as the prototypical sequence { C4 E4 G4 F4 D4 B3 C4} which obeys that rule; and the sequence {C4 E4 F#4 G4 F4 D4 B3 C4} violating the rule of diatonicism (but allowing anchoring) will probably be rated much higher than the sequence {C4 E4 G4 F4 D#4 B3 C4} used in the study. These examples do not undermine the general finding that harmonic structure plays a role, but indicate that the definition of harmonic structure in terms of stimulus characteristics is still incomplete. Finally, although the study shows that perception is strongly influenced by the presence of detectable structure in tone sequences, it does not indicate the concrete processes that are performed on the input resulting in a mental description.

  2. Tracing the perceptual mechanisms in music processing.
    1. Introduction
    2. As stated before the general aim of our research is to understand the on-line processes that a listener performs when perceiving music. This processing is conceived as the application of mechanisms that combine elements in the input into larger chunks. The concrete goal is to develop a computational model that describes how mechanisms are applied to the input leading to a more or less successful mental description of the input. The success of the undertaking is determined by how well predictions derived from the model are borne out in experiments.

      Thus the specific goal of this study is to understand why some tone series are perceived as a melody and other series are not. On the assumption that a tone series is considered a melody if the perceiver can create an efficient code that includes possibly all tones of the series, the challenge for the approach is to discover all perceptual mechanisms that listeners use in coding music.

      There are a number of tasks one may use to examine the perception of tone series. Listeners may be asked to judge the goodness or pleasantness (tonal structure etc.) of a series, or they can be asked to indicate whether the series contains jarring notes. In other tasks subjects judge the similarity between notes (for which a transposition paradigm may be used), or indicate which tones they expect at different moments in the series. In our experiments we have used goodness judgments and expected continuations. These experiments are described below.

    3. Experimental studies
      1. Series containing diatonic and chromatic tones
      2. In a few experiments (Povel & Jansen, 1998) we studied the perception of a series of tone sequences consisting of a subset of all orderings of the collection {C4 E4 F#4 G4 Bb4}. The presentation of a tone sequence was preceded by the chords C7 - F to induce the key of F-major.

        Based on a pilot study in which subjects judged how well fragments of the tone sequences sounded as a melody, it was hypothesized that a tone series is judged a melody if either one or both of the mechanisms chord recognition and anchoring can be applied to the series. Chord recognition is the mechanism that describes a series of tones as a chord, and anchoring is the mechanism that links a tone to a (chord) tone occurring later in the series. Applied to the stimuli used in the experiments a sequence of tones may be conceived as a chord, namely C7, which is feasible when the F#4, which does not belong to the chord, can be "anchored" to a subsequent G4. Anchoring (Bharucha, 1984) may either be immediate when the G follows the F# like in the tone series {C4 E4 F#4 G4 Bb4}, or more or less delayed when one or more tones intervene between the F# and G as in the series {E4 F#4 C4 G4 Bb4} or {Bb4 F#4 E4 C4 G4}.

        This hypothesis was tested in two experiments using a paradigm in which the participants heard stepwise lengthened fragments (beginning with a fragment of length three) and rated the melodic goodness of the fragment (Experiment 1) or played a few tones that completed the fragment (Experiment 2). It was found that goodness ratings were highest if the fragment only contained elements of the C7 chord, lower if the F# was immediately followed by the G, still lower if one tone intervened between F# and G, and lowest if the G preceded the F#. Unexpectedly, it was found that series in which two or three tones intervened between F# and G, were rated higher than those with only one tone between F# and G. As in these series the non-fitting F# occurred relatively early and the last three or four tones formed a C7 chord, this finding was tentatively explained by assuming that goodness ratings are mainly based on the most recent tones heard. Listeners' expectations collected in the second experiment corroborated the above findings: series that activate the chord C or C7 (according to the hypothesis) tended to be continued with the tone F, whereas series ending with the tone F# tended to be continued with the tone G, later followed by the tone F.

        Overall, the results support the hypothesis that the coding of these sequences was based on the application of the mechanisms chord recognition and anchoring. As the interaction between the two mechanisms is still not quite understood, we decided to subsequently study tone sequences only containing diatonic tones.

      3. Series only containing diatonic tones
      4. In this experiment 20 subjects rated the goodness of 60 orderings of the collection {D4 E4 F4 G4 A4 B4} on a 5-point scale. In the experiment each series was preceded by the chords G7 and C to induce the key of C-major and presented at a different pitch height. To explain the results a number of computational models were developed based on a set of general assumptions concerning music perception and a number of specific assumptions regarding the processing of music. The general assumptions were: The pitches that are the basic constituents of a tone sequence can be conceived in two ways: 1) As a sequence of pitches forming a contour the sequential regularities of which can be described in a code; 2) As a sequence of tones conceived within a key as a result of which the tones acquire the perceptual attributes stability and expectation. These assumptions lead to the hypothesis that a tone sequence will be judged a melody if the listener can mentally construct a code that includes all tones and in which the raised expectations are resolved.

        Regarding the coding aspect we assume a number of mechanisms that organize elements in the input into higher order mental units such as: runs, chords, trills, motives, ornaments etc.

        The expectations that are created when the input is interpreted in a key, are described in terms of vectors. A vector has a direction, that points to some future musical unit, and a magnitude representing the strength of the expectation. Specific assumptions regarding vector assignment are: 1) vectors may be created by all mental units in which the listener codes the input, e.g. tones and chords; 2) vectors are assigned by reference to the currently activated region. For example in the series {B4 F4 G4 D4 A4}, the first four tones will induce the chord G7, as a result of which the tone A will get a vector pointing towards the closest most stable element in the G7 chord, namely G. Specific assumptions regarding vector resolution are: a vector will resolve (disappear) 1) with time (the magnitude decreasing with some time function); 2) if the expected tone occurs either immediately or after some delay; 3) if the vector-carrying tone is integrated in a code (e.g. if the series {D4 E4 F4 G4 A4 B4} is conceived as a run, only the last tone B4 will carry a vector).

        Based on these assumptions, a model was developed that describes the coding of the tone series in terms of runs and chords, and the resolution of expectations in terms of the logic of the succession of recognized chords. The model was implemented as follows: Neighboring tones having an interval of 1 or 2 semitones are chunked into runs, while the remaining tones are recognized as triads on one of the seven scale degrees. Several assumptions regarding chord recognition were made as a tone series may in principle allow for several harmonic interpretations. For instance, the series {E4 A4 F4 D4 B4 G4} presented in the key of C-major, may activate several chords: vi, (E4 A4); IV, (A4 F4); ii, (A4 F4 D4); vii, (F4 D4 B4); V, (D4 B4 G4); V7, (F4 D4 B4 G4); and V9, (A4 F4 D4 B4 G4). Chord recognition was implemented as follows: 1) three different subsequent tones always lead to the unique identification of a chord; 2) two tones forming an interval of a fifth also always lead to the identification of a unique chord; 3) a series of two different tones forming an interval of a third is interpreted as the major triad (I, IV, or V) in which that third occurs; 4) vii is interpreted as V.

        Finally, the logic of chord order is based on Piston's (1941/1989) Table of usual root progressions, in which three categories of progressions were distinguished occurring in descending order of frequency. Examples of progressions in the three categories are respectively: I - V; I - vi, and I - iii.

        All these assumptions were incorporated in a series of computational models that describe the incremental processing of a tone series by specifying which chords are activated at each point in the series and computing the degree of logic of the chord progression. Depending on the weights assigned to each of the parameters the models explain 45% - 62% of the variance.

        The attempt to design a model that completely specifies all steps in the incremental processing of the tone series used in this study has been most instructive. Among other things it shows the considerable complexity of the process due to the parallel operation of the different mechanisms and their intricate interactions. As a result of this a fairly large number of assumptions are needed to develop a computational model that render concrete predictions about the processing of these apparently simple tone sequences. It should be noted that one of the reasons why we need so many assumptions is that the stimuli used are musically quite ambiguous: although the sequences are supposedly conceived in a specific key, still several alternative chordal interpretations are possible (as shown in the example above). We shall return to this issue later.

        Besides the large number of assumptions needed, there is the problem that the reliability of the responses provided by the subjects is relatively low (average correlation between subjects being .33), indicating that different subjects may have based their judgments on different aspects of the tone series. This poses a threat to the potential explanatory value of the model.

      5. Accentuation patterns
      6. In this study we examined whether inter-subject-correspondence increases if besides the key also a metrical interpretation is induced. To this purpose we presented the same 60 tone series from the previous study in four different metrical conditions. In condition 1, the first and fourth tone of the tone series were stressed in order to induce a three-part metric pattern. In condition 2, tones 1, 3, and 5 of the tone sequence were stressed inducing a two-part metric pattern starting on a downbeat. In condition 3, a two-part metric pattern starting on an upbeat was induced by stressing tones 2, 4, and 6 of the sequence. In condition 4, finally, none of the tones were stressed; stimulus presentation in this condition was therefore the same as in the previous study. Twenty subjects participated in the experiment. Surprisingly, it was found that the agreement between respondents was not higher for conditions 1, 2, and 3 than for condition 4 (mean correlations for the 4 conditions were respectively: .25, .29, 32, and .35). At present we have no explanation for this puzzling result, but the low correlations between subjects is rather disquieting. (See Jansen & Povel, 1999)

      7. Harmonic factors

    In the previous studies we have examined the perception of tone series that contained both steps (intervals of 1 or 2 semitones) and leaps (intervals larger than 2 semitones). We have seen that the processing of these series presupposes the operation of several perceptual mechanisms, notably chunking into a run, anchoring, and chord recognition. Because it is not yet completely understood how these mechanisms interact, several ad-hoc assumptions had to be posited making the modeling of the complete process rather problematic. Therefore we decided to study the perception of sequences only containing leaps, assuming that in these sequences only the mechanism of chord recognition will take effect. This allows us to study the mechanism of chord recognition in isolation.

    In an experiment, 32 six-tone sequences were constructed in which a segmentation into two groups of 3 tones was induced, each group consisting of one of the three triads I, IV and V. Chord progressions were formed by 4 different combinations of these triads. Contour complexity of the sequences was also manipulated. Listeners rated these tone sequences on a 7-point scale for melodic goodness. The results indicate that goodness responses are determined by the usualness of the perceived implied harmonic progression and the contour complexity of the sequences. This leads to the conclusion that the mental representation of these sequences consists of a description of its underlying harmony as well as its sequential structure. (See Jansen & Povel (2000) for details)

  3. Conclusions

The studies described in this paper lead to the following conclusions.

  1. In the procedure applied in the studies reported above we have attempted to discover the perceptual mechanisms that operate in different sets of well-defined stimuli. Next the mechanisms are incorporated in computational models that describe the transformations performed on the elements in the input. Predictions derived from the model are subsequently tested in experiments. Although the enterprise so far has not been completely successful, we have gained a considerable insight in the working of a number of mechanisms and in the ways they interact. We are confident that this approach will ultimately lead to a greater insight in how music is processed in the experimental settings used.
  2. One of the responses we have used, the judgment of goodness, may not be the most suited to examine the perception of tone series because it tends to yield rather unreliable data. Although we initially thought that this response is directly related to the process of perception, it turns out that because of its relative vagueness, listeners seem to use different criteria in their replies. However, this is a general problem in music perception studies: it is difficult to ask questions that directly relate to concrete aspects of the stimulus. The expectations that arise while listening to music may be more suited as a firsthand indicator of the processes involved.
  3. As shown above, the single tone sequences used pose a problem in that they tend to be musically underspecified and therefore ambiguous. It should be noted that such sequences are not representative for actual music in which the interpretative context is usually much richer thereby reducing the amount of ambiguity. Therefore it may seem to be better to study more realistic musical samples. From an experimental viewpoint, however, using existing music pieces is not a solution, because they do not allow the systematic manipulation of the variables involved. The ambiguity of the single tone series can be reduced by presenting them in a context that unambiguously establishes the musical frames of interpretation: key and meter.
  4. The starting point of this study has been to develop a model of music perception that describes the transformations that are performed on the input leading to a mental representation. The tacit assumption underlying this goal namely that it is feasible to develop one comprehensive theory of music perception may not be realistic. Given that music is a highly complex multidimensional event, it seems likely that the context in which music is listened to greatly determines what aspect(s) the listener attends to. This could imply that separate models of music processing have to be developed for the various tasks studied.
  1. References

Bharucha, J. J. (1984). Anchoring effects in music: The resolution of dissonance. Cognitive Psychology, 16, 485-518.

Bharucha, J.J. (1996). Melodic Anchoring. Music Perception, 13, 383 - 401.

Cooke, D. (1959). The Language of Music. Oxford: Oxford University Press.

Cuddy, L. L., Cohen, A., & Mewhort, D. J. (1981). Perception of structure in short melodic sequences. Journal of Experimental Psychology Human Perception and Performance, 7, 869-883.

Deutsch, D., & Feroe, J. (1981). The internal representation of pitch sequences in tonal music. Psychological Review, 88, 503-522.

Jansen, E. L., & Povel, D.J. (1999). Mechanisms in the perception of accented melodic sequences. Proceedings of the 1999 Conference of the Society for Music Perception and Cognition. Evanston, Ill. p. 29.

Jansen E. L., & Povel, D. J. (2000). The role of implied harmony in the perception of brief tone sequences. This proceedings.

Krumhansl, C. L. (1990). Cognitive Foundations of Musical Pitch. New York: Oxford University Press.

Krumhansl, C. L. (2000). Rhythm and Pitch in Music Cognition. Psychological Bulletin, 126, 159 - 179

Lerdahl, F. (1988). Tonal pitch space. Music Perception, 5, 315-350.

Lerdahl, F., & Jackendoff, R. (1983). A generative theory of tonal music. Cambridge, MA: MIT Press.

Piston, W. (1941/1989). Harmony. London, Victor Gollancz Ltd.

Povel, D. J. (1996). Exploring the elementary harmonic forces in the tonal system. Psychological Research, 58, 274-283.

Povel D. J., & Jansen, E. (1998). Perceptual Mechanisms in Music Perception. Internal Report NICI.

Smith, K. C., & Cuddy, L. L. (1986). The Pleasingness of Melodic Sequences: Contrasting Effects of Repetition and Rule-familiarity. Psychology of Music, 14, 17-32.

Van Dyke Bingham, W. (1910). Studies in melody. Psychological Review, Monograph Supplements. Vol. XII, Whole No. 50.

Zuckerkandl, V. (1956). Sound and Symbol. Princeton University Press.


 Back to index