Proceedings paper



By Maud Hickey, Northwestern University


The purpose of this paper is to explore the use of consensual assessment as a tool to rate creative artworks, and more specifically to look at consensual assessment as a tool for rating children's musical composition. I begin by describing the conceptual background of, and technique for using consensual assessment to rate the creativity of works of art. Secondly, I share results of research in which this particular assessment technique is used in music composition. Two studies currently in progress are summarized which use computer-generated compositions as a base and employ music educators and other experts as evaluators.


The identification of creative products or people as more or less creative is a difficult and controversial task. Guilford, and subsequently Torrance, had an immense influence on the field of psychometric measurement of creative people and products. The divergent thinking factors of fluency, flexibility, originality and elaboration that Guilford first hypothesized (1950, 1957) and that are used in the Torrance Tests of Creative Thinking (Torrance, 1966, 1974, 1981) are still widely used in some combination or variation in many current creative thinking tests and measurements (Brown, 1989).

However, the concept that divergent thinking and its components (i.e. fluency, flexibility, originality and elaboration) are synonymous with the outcome of creative thinking has been challenged. The greatest criticism is that the theoretical constructs came first, and then were validated using specialized tests, such as factor analyses, to identify the factors (Brown, 1989). They were not validated against any external measure of creative productivity. "The basic problem seems to be that creativity tests had only apparent construct validity and certainly not criterion validity" (Brown, 1989, p. 8). While Guilford spent many years validating the construction of test items for creative thinking through complex factorial analysis, he ignored the need to validate these factors against real-life processes and products of creative people. More specifically, Cattell, among others (Hocevar & Bachelor, 1989; Michael & Wright, 1989), criticized the use of fluency as a factor of creative thinking, as well as the often well-regarded written tests used to measure it: " . . . output per minute is unimportant compared to quality and aptness. The speed and productivity measures taken on artificial test situations are on a very different and possibly irrelevant level in relation to the productivity we encounter in real life originality" (1987, p. 509).

Amabile has proposed that we abandon "the hope of finding objective criteria for creativity and, instead, to adopt a definition that relies upon clearly subjective criteria" (1996, p. 34). In her interest to fill this need and work toward a more social psychological perspective of creativity assessment, Amabile developed and hence has repeatedly tested a "consensual assessment technique" for rating various forms of artistic creativity (Amabile, 1982, 1983, 1996). This technique is based upon Amabile's operational definition of creativity which is:

A product or response is creative to the extent that appropriate observers independently agree it is creative. Appropriate observers are those familiar with the domain in which the product was created or the response articulated. Thus, creativity can be regarded as the quality of products or responses judged to be creative by appropriate observers, and it can also be regarded as the process by which something so judged is produced" (1996, p. 33).

The consensual assessment technique for rating creativity, then, is to rate the creativity of products using experts agreed-upon-consensual-rating of these products.

Amabile outlines clear task, judge and procedural protocol that must be met for correct implementation of the consensual assessment technique. These are: 1. the task must lead to a clearly observable product, 2. the task should be open-ended enough to allow for flexibility, 3. All subjects should be presented with the same set of materials, instructions, and working conditions, 4. there should not be large individual differences in "baseline performance" skills required by the task, 5. Judges should have experience in the domain in question, 6. Judges should make their assessments independently, and 7. Judges could rate products on dimensions other than creativity (such as aesthetic appeal or craftsmanship) but judges should rate all products on one dimension before rating products on any other dimension. (Amabile, 1996).

Amabile, along with colleagues and others have successfully used the consensual assessment technique for rating the creativity of products in several artistic as well as problem-solving domains ranging from visual art portfolios to computer programs to business solutions. Amabile reports by author, task/product, subjects, and judges used, the reliability of approximately 53 different studies that utilized the consensual assessment technique (1996). The judge reliability for these reported studies is remarkably and consistently high.

Consensual Assessment in Music

To date, there are few studies that have explicitly used or examined the use of the consensual assessment technique in music composition. In an analysis of creativity assessment literature, Webster and Hickey (1995) found an inconsistent and wide variety of techniques used for rating musical creativity. Rating scales used for the measurement of creative musical thinking not only cover a wide range of methods, but also lack in concurrent validity-that is, in forming a comparison of creative thinking "test" scores to overall "best" and "worst" compositions or products (Webster & Hickey, 1995). Utilizing their own test which employed different kinds of forms (objective and subjective) for rating the creativity of musical compositions, Hickey and Webster discovered that scores from implicit (subjective) rating forms proved to be most predictive for the constructs of originality/creativity qualities and aesthetic value of children's compositions and that scores from more explicit forms were most predictive for the constructs of craftsmanship and technical qualities of musical compositions (1995).

Bangs (1992) successfully adapted Amabile's consensual assessment technique to rate the creative quality of children's musical compositions. The Dimension of Judgment tool was utilized to rate the musical creativity of pre- and post-treatment compositions of 37 third-grade children in order to compare the effects of intrinsic and extrinsic motivation factors on musical creativity. All of the compositions were rated by three independent judges using the Dimension of Judgment form which required judges to rate the compositions on 19 different criteria (adapted from Amabile's Dimension of Judgments for Artist-Judges [1982]). Interjudge reliability for the "creativity" item among the three judges was .76 and .82 indicating a reliable assessment form.

In a study seeking to understand the effect of problem finding and creativity style on creative musical products, Brinkman (1994) used a modified version of Amabile's consensual assessment technique. Brinkman asked 32 high school instrumental music students to compose two melodies. Three judges independently rated the melodies using a consensual assessment technique. That is, each judge was asked to rate each melody on a 7-point scale ("low" marked on one end and "high" on the other) in the categories of originality, craftsmanship and aesthetic value. The reliability of the three judges creativity ratings of the 64 melodies ranged from .77 to .96.

Reliability figures for 3 judges ratings of 14 children's musical compositions ranged from .62-.73 for creativity and .81-.95 for craftsmanship in a study by Hickey (1995).

The reliability of "creativity" ratings for children's musical compositions was .93 in another study by Hickey (1996).

Most recently, Hickey (in process) examined one of the assumptions of the consensual assessment technique that states that "experts" must be used as assessors of the creative products. In the domain of music, just who are the "experts" when it comes to dealing with children's compositional products? Amabile answers this question for visual art and problem solving studies which used the consensual assessment technique by reporting: ". . . . we are now convinced that for most products in most domains, judges need not be true experts in order to be considered appropriate" [for judging products] (1996, p. 72). She qualifies this, however, by stating that in many domains, some form of training in the field may be necessary for judges "to even understand the products they are assessing" (p. 72) and specifically cites computer-programming tasks and judging portfolios of professional artists. Based on analyses of several studies, Amabile concludes with a suggestion:

...the level of judge expertise becomes more important as the level of subjects' expertise in the domain increases. In other words, the judges should be closely familiar with works in the domain at least at the level of those being produced by the subjects. (1986, p. 72-73)

The purpose of the present study is to report the findings from 2 recent experiments which use the consensual assessment technique in order to refine this technique in music and to find which group of "experts" are best qualified to assess children's musical composition. The studies and results are reported next.

Study A

The purposes of this study were to: a) determine which group of judges-composers, theorists, music teachers, or children-would make the most reliable creativity ratings of children's musical compositions; and, b) determine the relationships of mean creativity scores between these groups of judges.


Five groups of judges' creativity assessments of children's musical compositions were compared. The groups were: 17 music teachers, 3 composers, 4 music theorists, 14 seventh-grade children, and 25 second-grade children. The music teachers were broken down into the following groups for analysis: 10 "instrumental" music teachers (teachers who taught only junior or senior high school band/orchestra); 4 "mixed experience" teachers (teachers who taught a combination of instrumental and choral or instrumental and general music), and; 3 "general/choral" music teachers (elementary general music teaching with some choral music). From the group of composers, two were college composition professors, and the 3rd was a graduate student in composition. All had at least 15 years of experience writing music in a wide variety of genres ranging from jazz to classical. The music theorists were college theory professors. The two groups of children came from contained classrooms in a private grade-school.

The 11 musical compositions which were rated by all of the judges were randomly selected from a pool of 21 compositions generated by fourth- and fifth-grade subjects in a previous research study (Hickey, 1995). In the 1995 study, the subjects were given unlimited time to create an original composition using a synthesizer connected via MIDI interface to a Macintosh computer. The final compositions were captured in MIDI file format using a computer program that allowed the recording of up to three simultaneous tracks of music. No compositional parameters were given. Students were encouraged to re-record their compositions as often as necessary until they were satisfied with their finished product.


Amabile (1983) recommends that in order to assure discriminant validity between other areas and creativity, that dimensions such as craftsmanship and aesthetic appeal be included on the rating form. The form used by the theorists and composers for this study was developed by combining and adapting items from Amabile's Dimensions of Creative Judgment (1982) and Bangs' Dimension of Judgment rating forms (1992). The final form was used and tested in two previous studies (Hickey, 1995; 1996). The rating form contained 18 items which fell under one of three dimensions: creativity, craftsmanship, and aesthetic appeal. The items consisted of 7-point Likert-type scales with anchors marked "low," "medium," and "high." The music teachers in this study used a 3-item form with 7-point rating scales for creativity, craftsmanship and aesthetic appeal. The creativity item for the music teachers, theorist and composers was worded: "Using your own subjective definition of creativity, rate the degree to which the composition is creative."

The children rated the compositions first for "Liking," and on a second listening, for "Creativity," using a separate form for each scale. The Creativity form asked the students to rate each composition on a 5-item scale with "Not Creative" and "Very Creative" marked on the low and high ends. The second-grade children's form had icons (from plain to more elaborate/silly faces) at each point on the scale to aid them in understanding the continuum.

The groups of children listened to the compositions together in their respective classrooms. Before listening to and rating the compositions, the author engaged the children in discussion about "Liking" music and/or thinking that music is "Creative." The children shared ideas about what "creative" meant to them and the discussion was guided to help them focus on understanding this term for rating music. They then rated each compositions first for "Liking," and on a second listening, for "Creativity."

All of the judges were informed that the compositions were composed by fourth- and fifth-grade children. And, following Amabile's suggestion for proper consensual assessment technique procedures (1996), the judges rated the compositions independently and were instructed to rate them relative to one another rather than against some "absolute" standard in the domain of music.


The analyses in this report are based on the judges' ratings on only the Creativity item of the assessment forms. Interjudge reliabilities were calculated using "Hoyt's analysis," an intra-class correlation technique which reports a coefficient alpha (Nunnally & Bernstein, 1994). The statistics were computed using GB-StatÔ (1994) software. Because each group had a different numbers of judges, reliability coefficients were adjusted in order to compare the groups as if only 3 judges were used in each group. The adjusted interjudge reliabilities for each group's creativity ratings on the musical compositions were: composers, 0.4; all music teachers, .64; instrumental music teachers, . 65; "mixed teachers," .59; general/choral teachers, .81; music theorists, .70; seventh-grade children, .61; and, second-grade children, .50.

The correlations of mean creativity ratings among the different groups of judges is presented in Table 1. Due to the lack of agreement among the composers, each composer is represented separately rather than using the group mean for correlation with the other groups. Significant correlations were found between the three groups of music teachers, between the music teachers and music theorists, and between the two groups of children. Though music teachers and music theorists agreed with each other, and the groups of children had a high positive correlation with each other, the theorists and teachers showed moderate to low correlations with the groups of children. There were no strong positive correlations amongst the composers nor between the composers and the other groups.

Table 1

Correlations of Mean Creativity Ratings Between Groups of Judges











1. Composer A










2. Composer B










3. Composer C










4. Music Theorists










5. All Music Teachers




.90 * *






6. Instrumental Teachers




.88 * *






7. Mixed Teachers




.86 * *


.78 * *




8. General/Choral Teachers




.63 *


.68 *

.72 *



9. 7th-grade Children










10. 2nd-grade Children









.83 * *

* *p <.01, * p<.05

Study B

The purpose of this most recent study was to test the reliability of a one-item creativity rating form using the consensual assessment technique for rating children's musical compositions and to test the reliability of a small group of judges.


The judges in this study were 6 music teachers who came from slightly varied teaching backgrounds. Three of these teachers were active teaching music composition to students in their general music classes. One was a high school band and general music teacher, and the other two were middle school general music teachers. Two judges were elementary general music teachers who taught music composition to their students only on a few occasions in the past (music composition was not a regular part of their curriculums). The final judge was a student teacher who was student teaching in elementary level general music.

The 53 compositions that were rated in this study were created by 28 third-grade children (8 and 9 years of age). The children were volunteers who came to the University over three, 2-hour Saturday sessions to learn about music composition using Macintosh computers and synthesizers. The students were shown how to use a simple music sequencing software with Korg X5D synthesizers to create original music compositions. The compositions collected for this study were composed on the first and third day of the sessions. The children's instructions were to simply create a composition that they liked. They could use as many tracks as they wished, and any combination of the available 128 General MIDI timbres. They were given as much time as needed (no child needed more than 45 minutes) and could revise and re-record as much as needed until they were satisfied with their composition. Several children recorded more than 1 composition during each of these sessions. They were asked to choose their favorite composition for purposes of this study. Twenty-five of the children completed 2 compositions for the project while three children only completed the first session.


The MIDI compositions were converted to audio files and saved onto a CD ROM for judges to listen to. Each judge received a CD with the 53 compositions in a different and random order. The judges then independently listened to the compositions and rated each on creativity using a 7-point Likert-type scale with anchors marked "low," "medium," and "high." The instructions for rating each composition were: "Using your own subjective definition of creativity, rate the degree to which the composition is creative."


The average creativity score for all 53 compositions was 3.8, with a range from 1.34 to 6.17. Interjudge reliabilities were calculated using "Hoyt's analysis," an intra-class correlation technique which reports a coefficient alpha (Nunnally & Bernstein, 1994). The statistics were computed using GB-StatÔ (1994) software. The reliability coefficient for all 6 judges was .61 (p < .01). To test the hypothesis formed from the results found in study "A"-that is that general/choral elementary teachers are the best experts in judging children's compositions-I calculated reliability coefficients with a variety of combinations of judges to see which produced the best reliability. The best reliability figure was . 65 (p < .01) when calculated without the high school band/general music teacher.

Discussion and Implications for Further Research

The main purposes of this paper were to describe the conceptual background of and technique for using consensual assessment to rate the creativity of children's music compositions and to share results of research in which this particular assessment technique is used. Study "A" sought to determine who might be the best group of experts to judge the creativity of children's musical compositions when using a consensual assessment technique. Based on the results of this study, it seems that the best "experts," or at least the most reliable judges, may be the very music teachers who teach the children-the general/choral music teachers. Perhaps the extensive music training that music teachers have along with their experience in the classroom with children provides them with the tools necessary to make consistent and valid judgments about the creative quality of children's original musical products. It is of interest to note that the composers used in this study were the group least able to come to an agreement on the creativity of the children's compositions. In music education in the United States, music composition is sometimes viewed as "mysterious," and often, the only experts considered in this realm are the professional composers. Perhaps music teachers should have reason to feel more confident in their ability to accurately assess the relative creativity of their students' musical compositions.

Study "B" further tested the reliability of subjective creativity assessment of children's musical compositions and also examined the differences in judges' ratings based upon their teaching backgrounds. The best reliability figure was obtained without the high school band/general music teacher. This corresponds with study "A" in that perhaps the high school teacher does not have the same sense, hence criteria, of what young children are capable of creating in musical compositions and may not be the best "judge" for rating creativity in children's musical compositions.

The reliability of .65 is significant, yet lower than figures obtained previously. One reason may be that the rating form asked judges to rate the musical compositions on only one item for "creativity." Amabile suggests that at least aesthetic appeal and craftsmanship (in addition to creativity) be used as items to rate creative products in order to force judges to think more carefully about the "creative" aspects of the product. Though rating on only 1 item is easier and quicker for judges, this may prove that this method is not as reliable for consensual assessment as using at least 3 rating items.

Another way to make this procedure more reliable is to include a general creativity definition for the judges. This definition would be that creative musical compositions are both original and "appropriate" (this seems to be the most common definition in the literature [Amabile, 1996]). Amabile suggests that a definition may be needed when judges are uncomfortable with the idea of rating the creativity of products in the absence of a guiding definition (1996).

A final hypothesis for the unsatisfactory reliability coefficient in Study "B" is that children at this age (8- and 9-years) are not developmentally "ready" or able to create an original and musically satisfying composition. The compositions that were rated very high or very low may have been done so by chance and not with any real intent or ability. We need more research in our field to understand the developmental trend of creative musical thinking in children in order to test this hypothesis.

Why bother with this pursuit of consensual assessment for rating creativity in children's composition? For one, and mentioned briefly above, it is to show that teachers indeed do know, and can reliably assess the creative quality of children's compositions without the need for clear-cut objective criteria. Of course critieria for assessing compositions should be made clear to children when the consequence might be a grade, but these studies show that teachers naturally have a subjective idea of compositions which are more or less creative when compared to others.

Using a subjective consensual assessment technique, one might collect and examine the compositions from children which are consistently rated as highly "creative." What are the features of these successful compositions? From these compositions we may be able to formulate sensible rubrics to aid in assessing children's musical compositions in schools. Furthermore, compositions rated highly "creative" could also be used as models for elementary music classrooms-models are desperately needed for teachers who strive to do more musical composition activities in their classrooms.

The subjective consensual assessment of children's musical compositions, for the most part, has worked. It may provide the most appropriate measure because of its subjectivity and because it does not presume objective criteria for creativity. This line of research may prove fruitful for the pursuit of understanding better the genesis and factors surrounding a creative musical "aptitude" in children. In order for consensual assessment to be the procedure for this identification, however, the next step is to identify children who repeatedly produce creative musical compositions and which are rated such by experts. And then to pursue more answers to questions about these children: what are the social and external factors that surround these children's background? Is there a relationship between scores on general as well as musical creativity tests and the creative musical production? Is there a relationship between musical creativity (based on musical composition assessment) and musical "aptitude" (based on a standardized musical aptitude test)?

Creative musical thinking in children is a complex phenomenon in need of further study. The use of the consensual assessment technique for identifying creative musical compositions and their creators, may prove to be the most reliable measure to aid in this research endeavor.



Amabile, T. M. (1982). Social psychology of creativity: A consensual assessment technique. Journal of Personality and Social Psychology, 43, 997-1013.

Amabile, T. M. (1983). The social psychology of creativity. New York: Springer-Verlag.

Amabile, T. M. (1996). Creativity in Context. Update to The social psychology of creativity. Boulder, CO: Westview Press.

Bangs, R. L. (1992). An application of Amabile's model of creativity to music instruction: A comparison of motivational strategies. Unpublished doctoral dissertation, University of Miami, Coral Gables, Florida.

Brinkman, D. (1994). The effect of problem finding and creativity style on the musical compositions of high school students. Unpublished doctoral dissertation, University of Nebraska, Lincoln.

Brown, R. T. (1989). Creativity. What are we to measure? In J. A. Glover, R. R. Ronning, & C. R. Reynolds (Eds.), Handbook of creativity, (pp. 3-32). New York: Plenum Press.

Cattell, R. B. (1987). Intelligence: its structure, growth and action. Amsterdam: Elsevier Science Publishers.

GB-StatÔ [Computer software]. (1994). Silver Spring, MD: Dynamic Microsystems, inc.

Guilford, J. P. (1950). Creativity. American Psychologist, 5, 444-454.

Guilford, J. P. (1967). The nature of human intelligence. New York: McGraw-Hill.

Hickey, M. (1995). Qualitative and Quantitative Relationships Between Children's Creative Musical Thinking Processes and Products. Unpublished doctoral dissertation, Northwestern University, Evanston, IL.

Hickey, M. (1996). Consensual Assessment of Children's Musical Compositions. Unpublished paper presented at the Research Poster Presentation, New York State School Music Association Convention.

Hocevar, D., & Bachelor, P. (1989). A taxonomy and critique of measurements used in the study of creativity. In J. A. Glover, R. R. Ronning, and C. R. Reynolds (eds.), Handbook of creativity, pp. 53-76. New York: Plenum Press.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.

Torrance, E. P. (1966). Torrance tests of creative thinking. Princeton, NJ: Personnel Press.

Torrance, E. P. (1974). The Torrance tests of creative thinking: Technical-norms manual. Bensenville, IL: Scholastic Testing Services.

Torrance, E. P. (1981). Thinking creatively in action and movement: administration, scoring, testing manual. Bensenville, IL: Scholastic Testing Service, Inc.

Webster, P. & Hickey, M. (1995, Winter). Rating scales and their use in assessing children's compositions. The Quarterly Journal of Music Teaching and Learning, VI (4), 28-44.


 Back to index