Proceedings paper



Eric D. Scheirer, Richard B. Watson, Barry L. Vercoe
Machine Listening Group
MIT Media Laboratory
E15-401D, Cambridge MA, 02139-4307 USA
{eds, watsonr, bv}


We conducted a listening experiment with 5-sec segments of natural music to investigate the human perception of musical complexity, and to discover physical features of stimuli that might underlie this percept. The judgments elicited were consistent across listeners and within different segments of a piece of music. A multiple-regression model based on a psychophysical model of sound processing in the auditory system was able to predict listeners' judgements of complexity. These results are consistent with the hypothesis that the perceived complexity of a musical signal is an important surface feature of music.

  1. Introduction
  2. When listeners hear a musical stimulus, they immediately orient themselves in the sound and use surface cues to make musical judgments, such as "this is by Bach" or "I hate this kind of music." This orientation process is apparently pre-conscious, relating to basic auditory organization rather than to high-level cognitive musical abilities. These musical judgments may be immanent or may lead to overt acts such as foot-tapping, speech acts, musical gestures (such as vocalization) or other observable behaviors.

    Naturalistic real-world settings exist that provide opportunities to see these behaviors in action. Perhaps the most significant is scanning the radio dial. A preliminary report on scanning-the-dial behavior and its implications was recently presented by Perrott and Gjerdigen . They found that college students were able to accurately judge the genre of a piece of music (about 50% correct in a ten-way forced choice paradigm) after listening to only 250-ms samples. The kind of musical information that is available after only 250 ms is quite different than the kind of information that is treated in the traditional sort of music-psychology experiment (notes, chords, and melodies).

    Immediate music-listening behaviors like these are fundamentally inexplicable with present models of music perception. It is not at all clear what sort of cognitive structures might be built that could support this sort of decision-making. The stimuli are too short to contain melodies, harmonic rhythms, or much hierarchical structure. On the other hand, the spectral content, in many styles of music, is not at all stationary even within this short duration. Thus, it seems quite possible that listeners are using dynamic cues in the short-time spectrum at least in part to make these judgments. This sort of description makes genre seem very much like timbre classification. Such a viewpoint is in concert with the writing of many modern-day composers on the relationship between timbre and orchestration .

    We define the musical surface to be the set of representations and processes that result from immediate, preconscious, perceptual organization of a acoustic musical stimulus and that enable a behavioral response. There are then three questions that immediately concern us. First, what sorts of representations and processes are these? Second, what sorts of behaviors do they afford the human listener? Third, what is the interaction between the representations and the processes as the listening evolves in time?

    In this paper, we present exploratory experimental and computer-modeling research that investigates the role of perceived complexity in the musical surface.

  3. Listening Experiment
  4. As part of a larger project on the perception and modeling of immediate music-listening behavior , we conducted an experiment dealing with the human perception of musical complexity directly (along with a number of other perceptual attributes that will not be reported here). We define this perceptual feature to be the sense of how much is going on. It is the scale on which listeners can rate sounds along a range from simple to complicated. This experiment was investigatory in nature and was not designed to test any hypotheses in particular.

    Overview of procedure

    Thirty musically trained and untrained subjects listened to two five-second excerpts taken from each of 75 pieces of music. The subjects used a computer interface to listen to the stimuli and make judgments about them. Among the judgments elicited was the subjects' sense of the music as simple or complex.

      1. Subjects

      The subjects were drawn from the MIT community, recruited with posts on electronic and physical bulletin boards. Most (67%) were between 18 and 23 years of age, the rest ranged from 25 to 72 years. The median age was 21 years. Of the 30 subjects, 10 were male and 20 were female, although there were no gender-based differences hypothesized in this experiment. All but four subjects reported normal hearing. 22 reported that they were native speakers of English, and 6 reported that they were not.

      9 subjects reported that they had absolute-pitch (AP) ability in response to the question "As far as you know, do you have perfect pitch?" No attempt was made to evaluate this ability, and it is not clear that all respondents understood the question. However, as reported below, there were small but significant differences on the experimental tasks between those who claimed AP and those who did not. The subjects had no consistent previous experience with musical or psychoacoustic listening tasks.

      After completing the listening task, subjects were given a questionnaire regarding their musical background, and thereby classified into three groups: M0 (nonmusicians, N = 12), M1 (some musical training, N = 15) and M2 (experienced musicians, N = 3). No formal tests of audiology or musical competence were administered.

      Breakdowns of musical ability by age and by gender are shown in Table 1. Note that the experiment was not counterbalanced properly for the evaluation of consistent demographic differences.




























    1. Materials
    2. The experimental stimuli were 5-second segments of real, natural music. Two non-overlapping segments were selected at random from each of 75 musical compositions. The 75 source compositions were selected by randomly sampling the Internet music site, which hosts a wide variety of musical performances in all musical styles by amateur and professional musicians. Samples were mixed down to mono by averaging the left and right channels, resampled to 24000 Hz, and amplitude-scaled such that the most powerful frame in the 5-second segment had power 10 dB below the full-power digital DC. The music was not otherwise manipulated or simplified. The stimulus set contains jazz, classical, easy-listening, country, and a variety of types of rock-and-roll music.

      It is worthwhile to explore the implications of this method of selecting experimental materials. is presently the largest music web site on the Internet, containing about 400,000 freely-available songs by 30,000 different performing ensembles. Using materials from such a site enables studies to more accurately reflect societal uses of than does selecting materials from personal music collections. The materials are certainly more weighted toward rock-and-roll and less toward music in the "Western classical" style than is typical in music-psychology experiments. However, this weighting is only a reflection of the fact that the listening population is more interested in rock-and-roll than it is in "Western classical" music.

      A second advantage of selecting music this way is that scientific principles may be used to choose the particular materials. In this case, since the set to be studied is a random sample of all the music on, it follows from the sampling principle that the results we will show below are applicable to all of the music on (within the limit of sampling variance, which is still large for such a small subset). This would not be the case if we simply selected pieces from a more limited collection to satisfy our own curiosity (or the demands of music theorists).

    3. Detailed procedure
    4. Subjects were seated in front of a computer terminal that presented the listening interface, as shown in Figure 1. The interface presented six sliders, each eliciting a different semantic judgment from the listener. The scales were labeled simple-complex, slow-fast, loud-soft, interesting-boring, and enjoyable-annoying (only the first will be directly discussed here). The subject was instructed that his task was to listen to short musical excerpts and report his judgments about them. Three practice trials were used to familiarize the subject with the experimental procedure and to set the amplification at a comfortable listening level. The listening level was allowed to vary between subjects, but was held fixed for all experimental trials for a single subject.

      Figure 1

      Each of the 150 stimuli (75 musical excerpts x 2 stimuli/excerpt) were presented in a random order, different for each subject. When the subject clicked on the Play button, the current stimulus was presented. After the music completed, the subject moved the sliders as he felt appropriate to rate the qualities of the stimulus. The subject was allowed to freely replay the stimulus as many times as desired, and to make ratings in any order after any number of playings. When the subject felt that the current settings of the rating sliders reflected his perceptions accurately, he clicked the Next button to go on to the next trial. The sliders were recentered for each trial.

      The subjects were encouraged to proceed at their own pace, taking breaks whenever necessary. A typical subject took about 45 minutes to complete the listening task.

    5. Dependent measures
    6. For each trial, the final setting of the simple-complex slider was recorded to a computer file. The computer interface produced a value from 0 (the bottom of the slider) to 100 (the top) for this rating on each trial. Any trial on which the slider was not moved at all (that is, for which the slider value was 50) was rejected and treated as missing data for that stimulus. Approximately 5.2% of the ratings were rejected on this basis.

      The response variables were shifted to zero-mean and scaled by a cube-root function to improve the normality of distribution. After this transformation, the responses (labeled SIMPLE for brevity) lie on a continuous scale in the interval [-3.68, +3.68] and are bimodally distributed, with modes at about 2.5. Two additional dependent variables were derived. The SIGN variable indicates only whether the response was above or below the center of the scale; it is a binary variable. The OFFSET variable indicates the magnitude of response deviation from the center of the scale on each trial, without regard to direction. It is calculated by collapsing the two lobes of the bimodal response distribution and is normally distributed.

    7. Results

    A correlation (Pearson's r) test was run in order to investigate relationships between the trial number (that is, the position of a particular stimulus in the random sequence of stimuli for a subject) and the dependent variable SIMPLE. This test explores possible learning or fatigue effects. The tests were not significant (r4272 = 0.005, p = n.s.) . This is consistent with the null hypothesis that there are no learning or fatigue effects in this task.

    The pairwise intersubject correlations of subject responses were computed. Of the 435 intersubject pairs, 183 (42.1%) were significantly correlated at p < 0.05 or better. The mean intersubject correlation was r150 = 0.177, p = 0.033. Thus, we may conclude that overall, subjects agreed on the judgment of complexity. However, the proportion of variance explained is rather small; if we choose two subjects at random, on average the ratings given by one subject explain only 3% of the variance in the ratings of the other. There were no differences between musicians' agreement with one another and non-musicians' agreement; 39.4% of the M0 subject ratings and 41.9% of the M1 subject ratings were significantly correlated. (2 of the 3 inter-M2-subject pairs were significant, which may bear further investigation).

    Since the stimuli were taken in pairs from the original musical sources, we may compare the ratings from the first excerpt of a song to the ratings from the second. Given a stimulus, we term the other excerpt from the same song the counterpart stimulus. On a subject-by-subject basis, the two excerpts elicited strongly correlated ratings (r2040 = .340, p < .001). That is, given a subject's rating of one excerpt of a song, that rating explains on average 11.6% of the variance in the counterpart. This is so even though the two excerpts were selected at random and do not necessarily have any obvious similarity. When the ratings are pooled across subjects, the mean ratings of each stimulus and its counterpart are even more strongly correlated (r150 = .502, p < .001).

    Several analyses of variance were conducted to explore the relationship between subject demographics and the rated judgments of complexity. Results are summarized in Table 2. In each case, the dependent variable was OFFSET, calculated by collapsing the two lobes of the bimodal response distribution, since the main judgment was not normally distributed. OFFSET measures the degree to which subjects use the ends of the scale relative to the center. Rejecting the null hypothesis in an analysis of variance of OFFSET (that there is no effect of the subject condition) is a sufficient condition to reject the null hypothesis for the main variable, SIMPLE.

    As seen in the table, each of the demographic variables had a significant effect on the subject ratings. The first two effects, based on subject and stimulus number, were expected. These effects that some subjects consistently find all stimuli to be more complex than do other subjects, and that some stimuli are rated more complex by all subjects than others. The rest of the effects were unexpected and difficult to interpret. The means and 95% confidence intervals of OFFSET broken down by each of these independent variables are plotted in Figure 2.

    Independent variable







    < 0.001




    < 0.001

    Musical ability




    Absolute pitch



    < 0.001

    Native English







    < 0.001




    < 0.001


    Figure 2

    Experienced musicians (M2 subjects) used the ends of the scale slightly more than other subjects. Subjects claiming absolute pitch used the ends of the scale slightly more. Subjects whose native language was not English, female subjects, and older subjects also used the ends of the scale more. Without many more subjects to fill out a complete multidimensional ANOVA, it is difficult to interpret these small but significant differences. One possibility is that the independent variables shown here are actually covariates of some unmeasured demographic variable that is more fundamental, perhaps corresponding to social cohort. Small but consistent effects of subject demographics similar to these have been measured in previous research on loudness judgments of natural music examples by Fucci et al. .


  5. Computational modeling
  6. In parallel to the experimental research, we developed a psychoacoustic model that incorporates submodels of tempo and rhythm perception , auditory scene analysis , and the extraction of sensory features from musical stimuli. The auditory model is implemented as a set of signal-processing computer programs. It operates directly on the acoustic signal, not from symbolic models of stimuli, and so can be used to study naturalistic samples of music taken from compact discs or other acoustic sources.

    1. Modeling technique
    2. The psychoacoustic model extracted 16 features from each of the 150 musical excerpts. Brief descriptions of the features are shown in Table 3. Scheirer provided more details on these features and how they are extracted from musical signals. Note that there are no features that relate to the cognitive structure of the musical signal. All of the features deal with sensory aspects of the musical sound such as loudness, pitch, tempo, and auditory scene analysis.




      Coherence of spectral assignment to auditory streams


      Stability of within-auditory-stream pitches over time


      Mean number of auditory streams present in signal


      Variance of number of auditory streams present in signal


      Mean amount of modulation (spectrotemporal change) in signal


      Entropy of loudness estimates in auditory streams


      Entropy of pitch estimates in auditory streams


      Loudness of loudest moment in signal


      Dynamic range (measured in loudness) of signal


      Most-likely tempo of signal


      Entropy of tempo-energy distribution in signal


      Stability of tempo estimates over time course of signal


      Centroid of tempo-energy distribution


      Number of beats elicited from foot-tapping model applied to signal


      Mean time between beats


      Variance of time between beats


      The features were entered in a multiple-regression procedure, where they were used to predict the mean complexity ratings for each stimulus that were collected in the experiment of Section 2. (Even though the individual ratings were bimodally distributed, the mean stimulus-by-stimulus ratings across all subjects were normally distributed, and so can be modeled with linear regression). Two kinds of multiple regressions were computed. The first entered all features at once, to determine how much of the mean complexity could be explained with this psychoacoustic model. The second entered the features one-at-a-time in a stepwise regression procedure, to see which features are most useful for explaining the primary degrees of freedom of the complexity judgments.

    3. Modeling results
    4. The first model, in which all features were entered, was strongly significant, with R = 0.536 (p < 0.001). Thus, compared to the correlations with the counterpart stimuli calculated in Section 2.6, the psychoacoustic model explains slightly more of the variance in the ratings (R2 = 0.294 for the psychoacoustic model, r2 = .250 for the segment-to-segment correlation.) A scatterplot of the predicted ratings vs. the observed mean ratings is shown in Figure 3.

      Figure 3

      Further, when the psychoacoustic features and the counterpart ratings were included in a single regression, the combined R2 value was 0.448. This is remarkably close to the result (0.294 + 0.250 = 0.544) that would be obtained if the covariance explained by the counterpart ratings were precisely orthogonal to that explained by the psychoacoustic model. This finding is compatible with the hypothesis that the sources of complexity shared between each stimulus and its counterpart are primarily cognitive (musical style, genre, use of lyrics) while the sources of complexity captured in the psychoacoustic model are primarily sensory.

      The second model, in which the psychoacoustic features were entered in a stepwise regression, was strongly significant at every step, as shown in Table 4. The +/- signs on each feature in Table 4 indicate the direction of the partial correlation of that feature with the residual at that stage of the stepwise regression (recall that larger values for SIMPLE indicate simpler stimuli). In total, five features are entered in the stepwise model. Two of these are features that relate to the auditory-scene-analysis of the signal (CHANCOH and VARIM) and two are features that relate to the tempo and beat structure of the signal (VARIBI and BESTT). In some cases, the sign of the partial correlation seems counterintuitive. For example, the negative VARIM partial correlation indicates that, once the effects of CHANCOH are accounted for, stimuli are simpler when they have a more-frequently changing number of auditory streams. However, since in each stage of the stepwise regression, only the residual from the previous stage is being explained, it is impossible to interpret the role of the later features without a more-detailed analysis of the feature covariance. The most important conclusion is that a model based on only five psychoacoustic features can explain nearly 20% of the variance in mean ratings of stimulus complexity.


      Feature entered










      < 0.001




      < 0.001




      < 0.001




      < 0.001

    5. Individual differences

    The results in the previous section indicate only that the overall mean ratings can be predicted with psychoacoustic models. It is also useful to explore individual within-subject ratings to examine whether they, too, can be predicted with such a model. Since the individual ratings are not normally distributed, a linear regression model is not appropriate. Rather, we converted the ratings into a binary response variable (above center/below center) and used logistic regression to model this variable, called SIGN. We computed 30 separate logistic regressions, one for each subject, using the 16 psychoacoustic features to predict SIGN. That is, the logistic regression for a subject tries to predict whether the subject gave a response above center, or below center, for each stimulus.

    In the grand average, 73.6% of the responses were correctly predicted (50% is the chance level). There is a clear advantage to using separate models for each subject. If a single model is used to predict the responses of all subjects, only 58.6% of the responses can be predicted correctly.

    Of the 30 subjects, the responses of 14 of them (46.7%) could be modeled significantly well to the p < 0.05 level. The other 16 subjects could not be modeled in this fashion. For some of the nonmodeled subjects, the difficulty was that the responses given by that subject were so heavily weighted to one side of the complexity scale that the constant term in the model explains nearly all of the log-probability, leaving no residual for the predictors. For example, for subject #30, more than 80% of his/her responses were correctly predicted by the model, yet this performance can be expected reasonably often by chance (p = 0.18). Such results indicate that a larger set and even broader range of stimuli is required to evaluate these models more carefully.

    There were no significant effects of the demographic variables on the proportion of responses that could be predicted. The null hypotheses that musicians' responses are as easy as nonmusicians' to predict, males' as easy as females', and so forth, cannot be rejected with this testing methodology.

    30 independent stepwise logistic-regression analyses were also computed, to examine the various features that helped to predict the different subjects' ratings. In these analyses, since the number of predictors and thus the degrees of freedom are fewer, more of the analyses reach significance. 25 of the 30 subjects (83.3%) had their responses predicted significantly well with a logistic model containing between one and five predictors. The most frequently-entered features were MEANMOD, entered in 8 of the 25 models; TEMPENT and VARIM, entered in 7 of the models; and BESTT, entered in 6 of the models. All of the features except DYNRNG were entered in at least one model.

    If many more subjects had been used in the study, it might be possible to divide them into groups based on the features that predict their responses. But this is difficult when the features number more than half the subjects as they do here. As one example of this sort of analysis, we divided the subjects into two groups. The first group consisted of those subjects (N = 12) for whom MEANMOD or CHANCOH were entered as predictors in the stepwise regression (these two features are strongly correlated, r = .270). The second group consisted of the rest (N = 18). Using these two groups, we determined how many of the intersubject correlations in rating patterns were significant, as was done for the whole subject pool in Section 2.6. 62.1% of the 66 intersubject correlations in the first group and 49.0% of the 153 intersubject correlations in the second group were significant at p < 0.05 or better. Thus both groups seem to be more homogenous, according to this metric for homogeneity, than the subject pool as a whole was.

    This argument by itself is not conclusive, as it is somewhat circular (second-order statistics are used to identify subjects to put into groups, who are then found with related second-order statistics to have something in common). However, it indicates a method that for a larger study might use to identify groups of subjects who share common strategies for making complex judgments. This is a first step towards a broader study of individual differences in listening behavior.

  7. Discussion

Let us return to the concepts put forth in the introduction. We assume for the moment that there is a stage of perceptual processing that can reasonably be called the musical surface. How could we determine whether a particular feature of music (complexity, in this case) is a surface feature, and whether a particular judgment or behavior is based partly, mostly, wholly, or not at all on the surface features of music?

Of course we do not mean to argue that only surface information is used for making musical judgments. Surely, low-level surface information and high-level cognitive information interact in complicated ways in any music-perception situation. However, most previous research on musical has focused exclusively on cognitive cues, such as tonal constraints, melody construction and identification, and other structural aspects of music. This approach limits both the styles of music that can be addressed (since the overwhelming majority of cognitive-structural hypothesis about music perception narrowly target Western classical music) and the explanatory power of the models. It is difficult to see how theories of music perception could ever relate to the acoustic signal when the basic theoretical elements are so distinct from the sensory aspects of hearing.

The modeling of cognitive aspects of music perception must be considered in relationship to the sensory modeling results that we have presented. The statistical results shown here demonstrate that significant proportions-more than a quarter-of the variance in human judgments of complexity can be explained without recourse to cognitive models. In other words, we have demonstrated that a sensory model suffices to explain a significant proportion of the variance in this judgment. The only explanatory space left to cognitive models remains in the residual.

The independence of the variance in judgments explained by the counterpart ratings, and that explained by the psychoacoustic model, allows us to formulate a coherent hypothesis regarding these two factors. Namely, that the variance explained by the counterpart ratings is primarily due to cognitive or structural similarities and differences among a set stimuli, while the variance explained by the psychoacoustic model is primarily due to sensory similarities and differences. One test for this hypothesis would be to control the length of the stimuli used in the listening task, as done by Perrott and Gjerdigen in their scanning-the-dial experiment. If the hypothesis is correct, as stimuli become very short, the counterpart ratings should be able to explain relatively less variance than the psychoacoustic model, because there will be little basis for examining structural similarities and differences among the stimuli. In contrast, as the stimuli become longer, the counterpart ratings should be able to explain relatively more variance, as the structural properties of the music become more important for mentally summarizing it for comparison.

A pressing question regarding experimental judgments of the sort we have reported here is that of individual differences. Although the intersubject variance in this task was small enough that experimental effects could be observed, it still seems large relative to the ratings being made. It is obviously inadequate to divide listeners so crudely into categories by their musical backgrounds.

Considering again the modeling results from Section 3, we can formulate several hypotheses regarding individual differences. The question at hand is what sorts of differences there are among listeners. We distinguish three hypotheses targeting only the sensory aspects of musical hearing (although we do not mean to claim that this list is exhaustive):

H1. There are no important differences among listeners. Different listeners use essentially the same features weighted the same way to make judgments.

H2. Individual differences are based on different weights applied to a single feature set. Each listener extracts the same auditory cues from sounds, and then these cues are combined with different weights to form judgments.

H3. Individual differences are based on different features of sound. Different listeners extract different cues and combine them in idiosyncratic ways to form judgments.

The present results are not compatible with hypothesis H1. If H1 were true, then a single regression model would be as good a model for subjects' judgments as the individually-adapted models. But we found that individual models could predict the subjects' judgments much more accurately than a single model.

We did not collect enough data in this experiment to distinguish H2 and H3. Although it is clear that different stepwise models enter different features, a few of the features are entered very often, and the overall space of features is really quite small. In the music-psychology literature, there seems to be almost no discussion of different listening strategies that listeners might adopt, the reason that different listeners (even those with similar musical experience) hear different things in music, or the perceptual and cognitive bases of musical preference. These topics must be considered crucial if we wish to develop a coherent psychology of music-listening behavior. Continuing evaluation of these hypotheses, and other hypotheses regarding individual differences in listening, awaits future research.


Erickson, R. (1985). Sound Structure in Music. Berkeley, CA: University of California Press.

Fucci, D., Petrosino, L., & Banks, M. (1994). Effects of genre and listeners' preference on magnitude-estimation scaling of rock music. Perceptual and Motor Skills, 78(3), 1235-1242.

Perrott, D., & Gjerdigen, R. O. (1999). Scanning the dial: An exploration of factors in the identification of musical style. Paper presented at the Society for Music Perception & Cognition, Evanston, IL.

Scheirer, E. D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1), 588-601.

Scheirer, E. D. (1999). Sound scene segmentation by dynamic detection of correlogram comodulation. Paper presented at the International Joint Conference on AI Workshop on Computational Auditory Scene Analysis, Stockholm.

Scheirer, E. D. (2000). Music-Listening Systems. Unpublished Ph.D., Massachusetts Institute of Technology, Cambridge, MA.


 Back to index