Proceedings paper
Unresolved Issues in Continuous Response Methodology: The Case of Time Series Correlations
Emery Schubert
School of Music and Music Education
University of New South Wales
Sydney 2052 NSW
AUSTRALIA
Phone: +61-2-9385 6808
Fax: +61-2-9313-7326
Email: E.Schubert@unsw.edu.au
ABSTRACT
Background: While continuous response methodologies have become increasingly popular among researchers of emotional response to music, the literature is very light on critical analysis of the methodology.
Aims: This paper investigates the common formats in which the methodology has appeared: open-ended, checklist, and rating scale; the kinds of problems for which it has been used: validation, comparison, the relationship between stimulus and response, and the dynamic lag structure of the musicâ€"emotion system; and the analytic techniques which have been applied: interoccular tests, correlation analysis, repeated measures approaches, and traditional time series analytic techniques.
Main Contribution: The most popular continuous response format is the rating scale, however there is little experimental evidence to support the reliability of this format over the checklist or the open-ended format. Also unclear is the kind of rating scale to use (unipolar or bipolar), the number of scales to use simultaneously (one, two, or three), the response sampling rate, and the label identifiers of the scales. Another serious problem facing continuous response research is the analysis of data. While time-series textbooks have for a long time warned against the use of visual inspection as the sole method of analysing continuous data, the literature is riddled with conclusions based on just such a technique.
Implications: In this paper I argue that elementary methods of time series analysis can be applied by researchers to produce a more valid basis for investigating their data. I also argue that continuous response methodology in musicâ€"emotion research is in its infancy â€" evidenced by the large proportion of validation and comparative studies. If and when the methodology matures, its most beneficial application will be in helping to understand the dynamic structure of the musicâ€"emotion system, and not so much the understanding of basic stimulus-response relationships, which traditional asynchronous approaches can do more efficiently.
FULL PAPER
Introduction
A common problem in studying emotional responses to music is that of lacking ecological (naturalistic) validity. In the typical study, a listener will hear an excerpt of music and at the end of the excerpt he or she will be asked to indicate the emotion expressed by the music or experienced by the listener (eg, Gabrielsson & Juslin, 1996; Heinlein, 1932). This is a highly efficient way of collecting data on emotional response to music. However, such instantaneous responses cannot tap into the subtle patterns of emotion which change from moment to moment through the course of a listening. For example, they cannot provide information about the lag structure between one response and another, or between response and stimulus (Schubert & Dunsmuir, 1999).
One of the remedies for this problem is to measure responses to the musical stimulus continuously. Instead of making a response at the end of an excerpt, the individual is continually assessing the expressed or experienced emotion during listening. Such continuous responses enable the researcher to build up a profile of the relationship between the stimulus and the response within a more realistic musical context and psychologically valid framework. However, this methodology brings with it a range of problems, many of which are yet to surface.
In this presentation I will mention methodological issues and concerns of which researchers using continuous response devices should be aware. I will then focus on one such problem, specifically the question of correlating comparative time series data.
General Methodological Issues
Two broad categories of problems in continuous response methodology are the response task requirement and the analysis of continuous response data. An example of the response task problem is that the concentration on the response task continuously is itself unnaturalistic. This should be a cause for concern for it contradicts the initial drive toward the methodology (ecological validity). However, continuous response researchers have found what appear to be reasonably adequate solutions to the problem. A typical solution is to make the continuous task a simple one by making responses on a single scale, such as amount of emotion (Krumhansl, 1998), tension (Madsen & Fredrickson, 1993; Nielsen, 1983) or aesthetic experience (Madsen, Brittin & Capperella-Sheldon, 1993), which a computer samples automatically in the background.
More serious are the issues regarding the analysis of continuous, time-series data. Many researchers of emotional response to music who have chosen to adopt continuous response methodologies have yet to come to terms with the issues that are pertinent in time series data analysis (Schubert, in press). Amongst the analytic issues there are problems which lie on either side of a spectrum of methodological issues (Figure 1). On one pole a large amount of continuous data is obtained but the point of the collection is not immediately apparent (I call the extreme of this pole â€˜no analysis'). For example, if a researcher is going to collect time series data and then report on the time-average (perhaps because he or she cannot find an appropriate way to analyse the data in its time series form), the researcher should consider whether the extra effort in collecting continuous data was worthwhile. On the other end of the spectrum, analysis is often applied which is appropriate for parametric data, but not for serially correlated data (this is a problem of using parametric inferential statistical analysis). For example Analysis of Variance is generally not an appropriate form of analysis for time series data because the assumption of independent, normally distributed data is usually violated (Gibbons, 1993). Somewhere along this spectrum lies the most common problem of emotional response studies which analyse continuous responses: the interpretation of a visual inspection of the time series. Gottman (1981) refers to this as an â€˜interoccular test' and warns against the use of this descriptive approach as sole means of analysing time series data. While many of these issues are known, particularly in the fields of Economics, Geography and Engineering, there also exists firmly grounded literature to explain and correct these problems (eg. Box & Jenkins, 1976; Hamilton, 1994). In this presentation I will focus on one issue: the use of Pearson's product-moment correlation technique for comparing two or more time series.
Figure 1. Spectrum of problems associated with time series data analysis of emotional response literature.
No Analysis |
Descriptive Analysis |
Inferential Analysis |
Continuous data collected but not really required or not used effectively |
Interoccular testing (i.e. describing the time series response profile by visual inspection alone) |
Continuous data analysed without attention to serial correlation. |
Correlation Analysis of Time-Series Data
The growing number of emotional response studies using continuous response has placed pressure on researchers to find and appraise appropriate analytic techniques. For example, a common method for comparing bivariate responses is to perform a Pearson product-moment correlation analysis (eg. see Howell, 1997). The Pearson product-moment correlation procedure is used regularly for comparing emotional response time series data (eg. Fredrickson, 1997; 1999; Frego, 1999; Krumhansl, 1996; 1997; 1998; Madsen, 1997; 1998; Madsen, Brittin, Capperella-Sheldon, 1993). Many of these studies did not report the type of correlation analysis performed, however, convention suggests, and the findings of the present study support, that Pearson's method was probably used in each case. Also, some of these studies did not report the significance of the tests. Again, as will become apparent, probably all studies would have reported a significant correlation coefficient with p â‰¤ 0.05. Given that time series data generally contain serial correlation, it follows that the Pearson product-moment correlation will produce an inaccurate estimate of the correlation coefficient, r. Provided the correlation coefficients are compared to one another within data sets produced in response to the same stimuli, this may not pose a major problem. The researcher can still make assertions about the ranking of the correlation coefficients and hence report with confidence that time series A is more correlated to time series C than is time series B. However, problems arise when (1) the correlation coefficient is compared with coefficients from other sources and (2) when the significance of the coefficient is taken "literally", without consideration for the effect of serial correlation.
An example of the first kind of problem is in reliability estimation, such as test-retest analysis. All the studies cited which investigated test-retest scores reported high and significant correlation coefficients. However, none of these studies determined whether the results were confounded by serial correlation. In all likelihood, these time dependent data do contain serial correlation: A response will be determined not by instantaneous changes in the musical signal, but by a collection of musical events or contexts. The memory of the listener for the recently passing musical events is a psychological manifestation of serial correlation. The large number of data points collected by computer controlled continuous response devices further ensures that the correlation coefficient will be significant. This is because the critical correlation coefficient, above which significance is identified (testing the null hypothesis of a zero correlation), decreases as the number of samples increases. There are instances in the literature when a result appears surprising, but the analytic technique is not questioned. For example, Madsen, Byrnes, Capperella-Sheldon and Brittin (1993) report that "no two people seem to relate to the same piece of music in exactly the same way, although in test-retest situations each person responds similarly at the same points in the music, even after an extended period of time, if the subject is listening in a similar situation" (p. 188). Madsen and his associates refer to a correlation coefficient of .9 for the test-retest analysis (Madsen, Brittin, Caperella-Sheldon, 1993, p. 65) but present no correlation coefficient for comparison with intersubject response.
This example identifies a major problem with correlation analyses in the literature, in that the studies I investigated do not attempt to support their findings by falsification (Stanovich, 1998). They tend to report positive relationships (significant correlations) without examining correlations that should not be significant. Consequently, these researchers cannot be aware as to whether their significant correlations (and they almost always are, or appear to be) are meaningful, or whether they are in fact false correlations that have appeared due to underlying serial correlation. (That is, the correlation gives information about the music stimulus, not just the measuring instrument). The assumption of the independent sampling of the Pearson product-moment correlation is quite likely violated in time series data.
A particular study prompted me to investigate the situation with correlations and time series data further. Fredrickson (1997) reported the results of continuous tension responses as sample-by-sample mean responses from 2nd grade, 5th grade, 8th grade, 11/12 grade students, professional and non-professional musician listeners. All groups were highly correlated with one-another. Fredrickson ranked the coefficients to demonstrate the strength of similarity between the various groups. He then reported that "of particular note is that the correlations, including the lowest one of .71 between the second graders and the musicians, were all significant at the [sic] p = 0.001 level" (p. 630). However, if the correlation of 0.71 is inflated or it is a false correlation due to underlying serial correlation, it leaves room open for a flood of studies to report incorrectly high correlation coefficients and to treat them as meaningful results. While I believe that something like this is already happening in the literature, at the same time I do not argue that Fredrickson's analysis is necessarily wrong. Instead I felt that the use of correlation analysis of time series emotional response data required some investigation.
In this paper I present some data which address the issue of correlation coefficients for serially correlated, time-series responses. I do not claim to find a solution to the problem, but I do intend to make researchers cognisant of issues concerning the application of correlation coefficients.
Monte Carlo Study
Using a sample of data from a study by Schubert (1999a), I constructed a pseudo-Monte Carlo study to investigate the effect of correlation coefficient calculation. The study is pseudo-Monte Carlo because I did not select data from a predetermined distribution (Mooney, 1997). Instead, I used actual time series data which was collected in the form of emotional responses to music. The data comes from continuous responses to three pieces of music: Edvard Greig's â€˜Morning Mood' from Peer Gynt, Joaquin Rodrigo's Adagio movement from Concierto de Aranjuez for Guitar and Orchestra and Antonin Dvorak's Slavonic Dance No 1, Op. 46. For each piece there existed two bipolar time series responses: the arousal response (the amount of arousal or sleepiness expressed by the music) and the valence response (the amount of happiness or sadness expressed by the music). Each response was recorded by computer once per second on a scale of -100 to +100 for arousal and valence. Over seventy participants' data were available for each piece and emotional response dimension.
Hypothesis
Being a Monte Carlo-type study, the hypothesis is assumed to be true, and the data is evaluated according to how well it fits the hypothesis (cf. Mooney, 1993). In the present study, the hypothesis is that responses by different participants will be correlated for the same dimension and piece of music, and in all other conditions they will be uncorrelated (falsification). For example, all subjects' arousal responses to Morning will be correlated, however, their valence response to Morning or to any other piece (or dimension) will not be correlated with this arousal response.
Method
For the present study I randomly sampled 16 responses from each of the three pieces. The first 200 seconds of each piece was selected. For simplicity and to conserve space, I will make reference to a subset of five responses from each piece, but the processes and findings discussed apply to the entire sample of sixteen.
Analysis
The data were factor analysed using a six-factor varimax-rotation solution using SPSS 6.1.1 for the Macintosh software. The analysis was conducted using the original data sets. A second analysis was conducted using the first-order differenced data sets. Differencing refers to a changes in responses rather than absolute responses, a technique used to reduce serial correlation in time series data (Gottman, 1981). By subtracting a data point from the previous (in time) sample, a first-order difference series is generated. This series corresponds to the gradient of the original series.
A sample of the factor loadings for each analysis is shown in the Table 1. Only factor loadings above .4 are considered. Relative to the undifferenced data, the differenced data is much closer to the expected model stated by the hypothesis. When the data is differenced, it tends to load fairly neatly onto separate factors grouped by the response dimension (Arousal or Valence) and musical item. For example, the first-order differenced arousal responses to the Dvorak (responses labelled A_Dxx in Table 1) load onto factor 3 for each of the five participants shown. When the same data are not differenced, there is still a grouping of the data but, in the case of the Dvorak arousal data, the loading occurs on two factors (1 and 3), contrary to the hypothesis. Further, undifferenced data factors are more frequent and more scattered.
Table 1 Factor loadings for undifferenced (untreated) data and differenced (serial correlation adjusted) data.
Undifferenced Data Factors |
Differenced Data Factors |
|||||||||||
UF 1 |
UF 2 |
UF 3 |
UF 4 |
UF 5 |
UF 6 |
RS |
DF 1 |
DF 2 |
DF 3 |
DF 4 |
DF 5 |
DF 6 |
-0.42 |
0.71 |
A_D11 |
0.56 |
|||||||||
-0.58 |
0.65 |
A_D12 |
0.72 |
|||||||||
-0.46 |
0.62 |
A_D15 |
0.60 |
|||||||||
-0.41 |
0.71 |
A_D17 |
0.65 |
|||||||||
-0.55 |
0.61 |
A_D18 |
0.65 |
|||||||||
0.78 |
A_GAL |
|||||||||||
0.80 |
0.45 |
A_GAN |
0.55 |
|||||||||
0.71 |
A_GAI |
|||||||||||
0.87 |
A_GCH |
0.44 |
||||||||||
0.75 |
A_GCO |
|||||||||||
0.43 |
0.43 |
A_RCH |
0.59 |
|||||||||
0.72 |
A_RDI |
0.78 |
||||||||||
0.53 |
A_RGR |
0.47 |
||||||||||
-0.63 |
0.59 |
A_RJO |
0.74 |
|||||||||
-0.41 |
0.62 |
A_RJU |
0.80 |
|||||||||
0.46 |
-0.45 |
V_D11 |
-0.51 |
|||||||||
0.43 |
V_D12 |
0.46 |
||||||||||
-0.46 |
V_D15 |
|||||||||||
0.56 |
V_D17 |
0.44 |
||||||||||
-0.59 |
0.48 |
V_D18 |
-0.45 |
|||||||||
0.50 |
V_GAL |
0.42 |
||||||||||
0.71 |
V_GAN |
|||||||||||
0.47 |
-0.55 |
V_GAI |
0.44 |
|||||||||
0.44 |
-0.48 |
V_GCH |
||||||||||
0.72 |
0.45 |
V_GCO |
0.44 |
|||||||||
-0.54 |
-0.41 |
0.48 |
V_RCH |
0.79 |
||||||||
-0.57 |
V_RDO |
0.62 |
||||||||||
-0.69 |
0.41 |
V_RJO |
0.92 |
|||||||||
V_RJU |
0.69 |
|||||||||||
0.68 |
V_RKA |
0.74 |
DF = Undifferenced Factor
DF = Differenced Factor
RS = Response Sample
Code used in Response Sample column:
A_ = Arousal response
V_ = Valence response
D = Dvorak Slavonic Dance
G = Grieg Morning
R = Rodrigo Adagio
The remaining letters/numbers are arbitrary participant codes.
The factor analyses provide evidence that serial correlation is present in undifferenced data, and suggests that correlations between any pairs of participants is more likely to lead to a misleadingly high correlation coefficient than when the data are first-order difference transformed. For example, the undifferenced arousal response to the Dvorak for any particular participant is likely to correlate with another participant's arousal response to the same piece. This result is fine, however Table 1 also demonstrates that a significant correlation is also likely to occur with any of the other pieces or dimensions because there exist reasonably large factor loadings onto factors 1 and 3 for each of the other examples. This is an incorrect result according to the hypothesis.
The differenced data still posed some problems. Dvorak Valence and Dvorak Arousal load onto the same factor for all but one of the participants. Further, there appears to be some contradictions with the sign of the factor loadings which are inconsistent within the rest of the group. For example, factor 3 in the â€˜Dvorak Valence' group has two negative loadings and two positive loadings. (Note: Factor 4 contains no loadings probably because of sampling error and because only factor loadings greater 0.4 are shown.) The first problematic finding can be reconciled by a closer examination of the Dvorak responses. For this piece the valence and arousal were more correlated than for other pieces (meaning that the hypothesis requires correction, or that different-dimension responses to the same piece should not have been compared). The second problematic finding could be explained by sampling error. With such a small sample chosen for analysis (Monte Carlo studies are considerably larger) the effect of sampling error becomes quite problematic (16 per group in the original study, five shown in Table 1). The important point, however, is that the differenced responses are considerably better grouped than the undifferenced responses.
Discussion and Conclusions
The Monte Carlo-type study demonstrates that Pearson product-moment correlations between time series responses tend to be inflated and misleading. A better result was obtained when each time series was first-order differenced. Differencing, in this case, removes some of the serial correlation from the data. The amount of serial correlation in the data can be diagnosed by examining the autcorrelation function, not discussed here (see Schubert & Dunsmuir, 1999). Another possible method of controlling the inflation of the correlation coefficient and the possibility of false correlation is to use more conservative correlation analyses such as Spearman's rho or Kendall's Tau (Howell, 1997). However, the mathematical derivation of these methods is not based on principles of time series. My own investigation of correlation coefficient matrices (again a Monte Carlo-like study on the above data) demonstrated minimal reduction in the number of false correlations when data is undifferenced (matrices not shown here to conserve space). Consequently, the conclusion drawn from the present investigation is that it is appropriate to control serial correlation before calculating correlation coefficients, and that a simple method of controlling serial correlation is to apply a first-order-difference transformation to the data.
While there are many issues that are of concern to emotion-in-music researchers who adopt continuous response methodologies, the present investigation and those discussed elsewhere (Beran and Mazzola, 1999; Schubert, 1999b; Schubert & Dunsmuir, 1999) suggest that there are simple techniques available for dealing with many of these matters. However, for continuous response methodology to be a plausible alternative to conventional, more efficient approaches, researchers must become aware of the issues and the solutions. In particular, the issue of serial correlation needs to receive more consideration than is currently the case in the literature.
References
Beran, J. & Mazzola, G. (1999). Analysing musical structure and performance - A statistical approach. Statistical Science, 14, 47-79.
Box, G. E. P. & Jenkins, G. M. (1976). Time series analysis: Forecasting and control (Rev. ed.). San Francisco: Holden-Day.
Fredrickson, W. E. (1997). Elementary, middle, and high school perceptions of tension in music. Journal of Research In Music Education, 45, 626-635.
Fredrickson, W. E. (1999). Effect of musical performance on perception of tension in Gustav Holst's first suite in E-flat. Journal of Research in Music Education, 47, 44-52.
Frego, R. J. D. (1999). Effects of aural and visual conditions on response to perceived artistic tension in music and dance. Journal of Research in Music Education, 47, 31-43.
Gabrielsson, A. & Juslin, P. N. (1996). Emotional expression in music performance: Between the performer's intention and the listener's experience. Psychology of Music, 24, 68-91.
Gibbons, J. D. (1993). Nonparametric Measures of Association. Newbury Park: Sage.
Gottman, J. M. (1981). Time-series analysis: A comprehensive introduction for social scientists. Cambridge: Cambridge University Press.
Hamilton, J. D. (1994). Time series analysis. Princeton, New Jersey: Princeton University Press.
Heinlein, C. P. (1928). The affective characters of the major and minor modes in music. Journal of Comparative Psychology, 8, 101-142.
Howell, D. C. (1997). Statistical methods for psychology (4th ed.). Belmont, CA: Duxbury.
Krumhansl, C. L. (1996). A perceptual analysis of Mozarts Piano Sonata K.282 - segmentation, tension, and musical ideas. Music Perception, 13, 401-432.
Krumhansl, C. L. (1997). An exploratory study of musical emotions and physiology. Canadian Journal of Experimental Psychology, 51, 336-352.
Krumhansl, C. L. (1998). Topic in music: An empirical study of memorability, openness, and emotion in Mozart's String Quintet in C major and Beethoven's String Quartet in A Minor. Music Perception, 16, 119-134.
Madsen, C. K. & Fredrickson, W. E. (1993). The experience of musical tension: A replication of Nielsen's research using the continuous response digital interface. Journal of Music Therapy, 30, 46-63.
Madsen, C. K. (1997). Emotional response to music as measured by the two-dimensional CRDI. Journal of Music Therapy, 34, 187-199.
Madsen, C. K. (1998). Emotion versus tension in Haydn's Symphony No. 104 as measured by the two-dimensional continuous response digital interface. Journal of Research In Music Education, 46, 546-554.
Madsen, C. K., Brittin, R. V. & Capperella-Sheldon, D. A. (1993). An empirical investigation of the aesthetic response to music. Journal of Research In Music Education, 41, 57-69.
Mooney, C. Z. (1997). Monte Carlo Simulation. Thousand Oaks: Sage.
Nielsen, F. V. (1983). Oplevelse af musikalsk spænding (the experience of musical tension). Akademisk Forlag, Copenhagen.
Schubert, E. & Dunsmuir, W. (1999). Regression modelling continuous data in music psychology. In Suk Won Yi (Ed.), Music, Mind, and Science (pp. 298-352). Seoul National University Press.
Schubert, E. (in press). Continuous measurement of self-report emotional response to music. In Patrik Juslin and John Sloboda (Eds.), Music and Emotion: Theory and Research. Oxford University Press.
Schubert, E. (1999a). Measuring emotion continuously: Validity and reliability of the two-dimensional emotion-space. Australian Journal of Psychology, 51, 154-165.
Schubert, E. (1999b). Measurement and Time Series Analysis of Emotion in music. Unpublished doctoral dissertation, University of New South Wales, Sydney.
Stanovich, K. E. (1998). How to Think Straight about Psychology (5^{th} ed.). New York: Longman.