Summary: The Effectiveness of the Stylometry of Function Words in Discriminating between Shakespeare and Fletcher Thomas Bolton Horton University of Edinburgh, Department of Computer Science Ph.D. Dissertation, 1987 Current Address: Dept of Computer Science Florida Atlantic Universtiy Boca Raton, FL 33431 407/367-2674 Internet: tom@cs.fau.edu BITNET: HortonT@fauvax A number of recent successful authorship studies have relied on a statistical analysis of language features based on function words. However, stylometry has not been extensively applied to Elizabethan and Jacobean dramatic questions. To determine the effectiveness of such an approach in this field, language features were studied in twenty-four plays by Shakespeare and eight by Fletcher. The goal was to develop procedures that might be used to determine the authorship of individual scenes in _The Two Noble Kinsmen_ (TNK) and _Henry VIII_ (H8). Of the 32 texts of known authorship, 6 were set aside as a test set. These were treated as if their authorship was unknown. All procedures that I evaluated were applied to samples from these 6 plays, and these results were used to judge their effectiveness. The remaining 20 Shakespeare texts and the 6 Fletcher texts made up the control set, which was used to establish each dramatist's characteristics of composition. Homonyms, spelling variants and contracted forms in old-spelling dramatic texts present problems for a computer analysis. Many common function words have several variant spellings (e.g. "been" can be spelled "beene", "bene", "bin" etc). Other forms can represent a number of lexical forms; for example, besides the indefinite article, the single-letter word "a" can mean "he," "of," "on," "ah" etc. Some forms of compound contractions involving function words are frequent (for example, "let's", "o'th'", "'tis" and the many contracted forms of "is" like "it's", "Caesar's", "he's" etc). A program (called REPLACE) that uses a system of pre-edit codes and replacement/expansion lists was developed to prepare versions of the texts in which all forms of common words can be recognized automatically. Homonyms and variants were found and marked in each text using a system of hash suffixes. For example, "a#1" represents occurrences of "he", "a#3" represents "on", etc. Occurrences of "beene", "bene", etc are replaced by the standard form (but occurrences of "bin" meaning a container are marked "bin#1"). In addition, a list of compound contractions and their full forms was compiled (from experience and with help from Partridge's book _Orthography in Shakespeare and Elizabethan Drama_). A simple replacement strategy is not powerful enough to handle apostrophe-s and -t forms. Hash suffixes were also used to distinguish apostrophe-s contractions involving "is", "us", "his" etc from possessive forms, and contractions of "it" from forms like "banish`t" (ie "banished"). Program REPLACE made use of these special markings and the expansion lists to prepare "expanded" versions of the plays. These versions were then used to determine the extent of each author's use of compound contractions. In almost every case, Fletcher uses more of these forms than Shakespeare, although the latter uses more contractions as his career progresses. Because of this secular change and the possibility of alterations introduced by scribes, compositors or revisors, the expanded versions of the plays were used in remainder of the study. To evaluate some procedures for determining authorship developed by A. Q. Morton and his colleagues, occurrences of 30 common collocations and 5 proportional pairs are analyzed in the texts. Within-author variation for these features is greater than had been found in previous studies. Univariate chi-square tests are shown to be of limited usefulness because of the statistical distribution of these textual features and correlation between pairs of features. Contrary to some earlier claims, the best of the collocations do not discriminate as well as most of the individual words from which they are composed. Turning to the rate of occurrence of individual words and groups of words, distinctiveness ratios and t-tests are used to select variables that best discriminate between Shakespeare and Fletcher. Variation due to date of composition and genre within the Shakespeare texts is examined using the statistical procedure analysis of variance. A number of potential markers of authorship were eliminated because the rate of use for one of the subgroups (such as comedies, or late plays) was too close to Fletcher's overall rate. Some of the observed variations are interesting in their own right. Shakespeare's comedies are characterized by high rates for pronouns and "a", together with low rates for "the". Tragedies have low rates for "a". The histories have very low rates for personal pronouns (as noted by Brainerd) and high rates for "in", "of" and "and". "In" occurs infrequently in the romances, while "so" is much more frequent in this genre than the other three. The late plays have a high rate for "the". The rates for several word classes were examined (pronouns, forms of "have", "be" and "do", and modal verbs), but none of these group variables proved useful. However, when compiling the list of spelling variants, I noticed that Shakespeare uses more forms that begin with "there-" or "where-" (such as "therefore", "therein", "wheresoever", etc); the forms that Fletcher does use occur less frequently than in Shakespeare's texts. When the rates for all these "there/where- compounds" are combined, Shakespeare's overall rate is almost 12 times that of Fletcher. These forms are rather infrequent, and some forms appear in stock phrases or songs and may not reflect the author's normal usage; for these reasons, this group was not used as a variable in the main analysis of function words. However, two scenes usually attributed to Fletcher, _Henry VIII_ I.iii and _The Two Noble Kinsmen_ IV.iii, contain what I feel are significant occurrences of there/where- compounds. (Later analysis indicated that the use of function words in both scenes is also much closer to Shakespeare's known work.) A multivariate and distribution-free discriminant analysis procedure (using kernel estimation) was used to determine if data from a single scene resembled the scenes from Shakspeare or Fletcher more closely. The classifiers based on the best marker words and the kernel method were carefully evaluated using the texts of known authorship. To study the effect of characterization, I extracted the speeches of 62 characters (who speak at least 500 words) from 6 test-set plays. The procedure was only slightly less accurate in classifying these character samples than the set of test-set scenes, which suggests that characterization does not affect the use of these word-rate variables with this procedure to any great degree (at least for the purpose of distinguishing Fletcher from Shakespeare). I also tested the procedures with smaller and smaller scenes, and found that they performed well for samples as short as 500 words. When the final procedure is used to assign the 459 scenes of known authorship (containing at least 500 words), 94.8% are assigned to the correct author. Only two scenes are incorrectly classified, and 4.8% of the scenes cannot be assigned to either author by the procedure. When applied to individual scenes of at least 500 words in _The Two Noble Kinsmen_ and _Henry VIII_, the procedure indicates that both plays are collaborations and generally supports the usual division. However, the marker words in a number of scenes often attributed to Fletcher are very much closer to Shakespeare's pattern of use. Some of the more interesting results include the assignment to Shakespeare of TNK IV.iii, which most scholars have regarded as decent imitation. (The function word result for this scene is supported to some extent by an occurrence of a there/where- compound and by rates for "hath" and "you" that are unlike rates in known Fletcher scenes.) In H8, I.iii is another scene that resembles Shakespeare in the use of function words and there/where- compounds. It also contains a number of convincing Fletcher stylistic traits, so perhaps revision should be proposed. The function-word results for Shakespeare are extremely strong for the prose scene V.iv, which contains occurrences of "ye" and "'em" (Fletcher traits) that are unparalleled in Shakespeare's texts. This contradictory evidence raises questions about the copy-text; again, revision is a plausible explanation (although the results are so strong in this case that the scene appears to be mainly Shakespeare's work). The function-word results support Foakes' and Hoy's suggestions that Fletcher touched up Shakespeare's work in II.i-ii, III.iib, and IV.i-ii. The use of function words in Act IV is very unlike Fletcher, and my results indicate that he had little or nothing to do with it. Results for II.ii and III.iib are less clear but may support the theory of revision. ___________________________________________________________________ The contents of this electronic file are copyright (c)1990 Thomas B. Horton. Quotation for scholarly (non-commercial) purposes is permitted, but please contact the author to verify the material in question and advise him of your intention. Distribution is only permitted if the file is not changed inany way and if this notice is included in the file. The author may be contacted by e-mail at or or by regular mail at: Department of Computer Science, Florida Atlantic University, Boca Raton, FL 33431 USA