Corpus Linguisics Report – 2000 words




As Charles F. Meyer suggests, “because corpora consist of texts (or parts of texts), they enable linguists to contextualise their analysis of language; consequentially corpora are very well suited to more functionally based discussions of language” (Meyer 2002: 6). It is important to start with a particular hypothesis in order to avoid a mindless collection of data with no appropriate analysis. I am going to use a corpus linguistics analysis in order to test my hypothesis that ‘adjectives are used more frequently, and differently, by women than by men’. This scientific report will test whether this hypothesis is correct or incorrect.


An adjective is defined as a word that modifies a noun. It describes the quality, state, or action that a noun refers to ( They have a variety of rules:


  • They can come before nouns: “a new car”
  • Adjectives can come after verbs such as be, become, seem, look, etc.: “that car looks fast”
  • They can be modified by adverbs: “a very expensive car”
  • They can be used as complements to a noun: “the extras make the car expensive”



I am interested in the analysis of adjectives in terms of gender as it has particular socio-linguistic relevance. The analysis may be used to confirm or challenge stereotypes of women using more ‘descriptive’ language (increased use of adjectives, “the beautiful red flowers”) and men being less descriptive in their use of language (“the flowers”). On the other hand, it may confirm or challenge other stereotypes that men use language in a more ‘specific’ way (also requiring more adjectives, “the industrial double-edge saw” / “the saw”) The functional approach of corpus linguistics is appropriate for, and provides a useful set of tools for me to test my thesis in terms of how language is commonly used, which adjectives are used more commonly by men and women. It is also a very useful tool for providing data on how adjectives are used in sentence structure, and I want to examine whether this differs for male and female subcorpora data.




Descriptive Analyses from Data-collection in ICECUPIII (ICE Corpus Utility Programme)


I began my research by using the ICECUP tool to make a frequency list of adjectives in the female and male subcorpora:


Data-collection form the ICECUPIII











































This table shows that ICE is not a balanced corpus. It contains data on more males (70%) than females (30%). My expected use of adjectives by females in the corpus was 12892, while the actual observed data was 11890. This shows a –7.8% difference. For males on the other hand, there was a +3.3% difference between my expected figure of 30080 and the actual observed figure of 31082.  This data leads one to believe that my initial hypothesis may be problematic, as males seem to use adjectives more than I estimated, and females use them less. I need to develop this analysis with further tools.



Wordsmith-Tools (Concordance of Collocates and Patterns)


In Wordsmith-Tools I created a Concordance-List to analyse if there are any patterns, collocates, clusters or plots in both subcorpora. I used the following steps:


  • I made datasets of the written and spoken subcorpora and saved them as plain (text) and then as tag-file (name tag). I deactivated the Wordsmith setting ‘numbers included’.


  • I then converted the plain text (name.txt) with the convert file. This is so a Wordlist of the clean text can be made and saved. This list can then be used to create a stop list (as a .txt in notepad) where the first lines will have to be deleted, so that there are only words in the list. The clean file has to be saved again (x/wsmith/text/folder).


The saved text file (name.txt) can be converted using ‘convtags’ by following the same procedures as above. The above mentioned stop list has to be activated and the ‘tags to ignore’ in the wordlist settings has to be deactivated. Now a wordlist of a clean tag-file can be made and saved both as a .IST and as txt.file



Male Subcorpora Data:


Female Subcorpora Data:


These subcorpora data sets show certain adjectives that are more frequently used by females than males, While ‘good’ is the most common for both groups (422 total uses for female and 636 total uses for males), other words such as ‘nice’ suggest a difference in gender use. Nice is the 4th highest ranked adjective in the female data, but only the 19th in the male data, suggesting that it is a word more commonly used by female speakers. On the other hand, ‘honourable’ is 7th on the male wordlist but doesn’t feature at all on the female wordlist, showing that its usage is more popular by male speakers. This may reflect socio-linguistic gendering of language where women stereotypicaly engage in more ‘small-talk’, “they are nice shoes”, while male speakers stereotypically spend proportionally more time on discussion of world events, suggesting why words like ‘military’ and ‘Iraqi’ feature on the male but not female worldlists. These stereotypes are also undercut by anomalies such as ‘political’ and ‘economic’ featuring with more proportional frequency in the female than in the male data. Descriptive words such as ‘big and ‘small’ feature prominently on both lists, ‘big’ is 12th on the male data list, and 17th on the female. Words dealing with emotions such as ‘happy’ and ‘funny’ are more common in the female data. Happy occurs 74 times in the female data, but doesn’t appear on the list of the male data. For female speakers, there is the strongest collocation of ‘good’ in the position L1, which suggests that this is occurs most frequently on the first-side left position of the tag (77.2% of the time). 494 of the 636 uses of ‘good’ for male speakers occur in L1, giving a similar figure of 77.7%. ‘Right’ occurs with the most frequency in R1 for males (33.6% of the time, or 184 of 546 occurrences). For females, ‘right’ is also relatively strongly collocated in L1 – 84 of 231 times, or 36.3%. This analysis suggests that for words that are strongly collocated in L1 such as ‘good’ there is little difference between the position for the male or female subcorpora. Equally, for a word relatively weakly collocated in L1 – ‘right’, there is little difference. Slight differences in percentage could be down to the difference between the amount of data analysed in the male and female subcorpora. Some words such as ‘only’ are distributed more evenly across the positions. In the male data, it occurs 11 times (4%), 18 times (7%), 22 times (8.5%)  and 19 times (7.2%) across the positions of R2, R3, R4 and R5 respectively. This offers a contrast with a word like ‘difficult’, which occurs very infrequently out of L1 (it occurs in L1 88.5%). As there is no data available here for ‘only’ in the female list, then a comparison of this cannot be made in terms of gendered use. Difficult, however, follows a similar pattern in the female data, occurring in L1 86.4% of the time.






Collocations in Female Subcorpora








Collocations in Male Subcorpora





My next set of data show the most common collocates. ‘And’ for example, is the second most common, after ‘to’ in both sets of data. It occurs 3901 times in the male data and 3013 in the female data.  For females, ‘and’ comes most commonly after the adjective, 1431 of 2437 times, or 58.7%, it occurs in position R1 the most frequently, 511 times or 20.9% of the total uses in the female subcorpora data. In the male data, it occurs after the adjective 2103 of 3464 times, 60.7%, and in R1 679 times, or 19.6%. Although there is a slight variation in these figures, with females using ‘and’ after the adjective slightly less commonly than males, and using it in R1 slightly more commonly, the differences are not significant enough to suggest any major difference in gendered adjectival collocation.


To give another example ‘Uhm’ is ranked 21st in the female subcorpora list, and also 21st in the male list, suggesting an equal usage of this expression of hesitation. Although for males ‘Uh’ is ranked 3rd, with may suggest a problem with data collection and subtle definitions such as the difference between ‘uh’ and ‘uhm’. For females it occurs most commonly in R2, 107 of 584 times, or 18.3%. In the male case, it is again most frequent in R2, 174 of 859 times, or 20.2%. There are some variations. ‘People’ occurs slightly higher in the female data (the 8th most popular), while it is only the 16th most popular in the male data. In both cases, it occurs most frequently in R1 (66 of 233 for female, 28.3% and 78 of 304, 25.7% for male).


Such analysis suggests that, against my original hypothesis, there is little difference between the collocations of adjectives in male and female subcorpora.






This comparison table shows collocations in both male and female subcorpora. It confirms my findings so far that there is little difference in the frequency of adjective use between genders. ‘Adjgen’ for example is tagged 4.02% in the male data and 4.41% in the female data. There are no comparisons here that can be used to confirm my original hypothesis.


Chi Square


There are two types of random variables and they yield two types of data: numerical and categorical. A chi square (X2) statistic is used to investigate whether distributions of categorical variables differ from one another. Categorical variable yield data in the categories and numerical variables yield data in numerical form. Responses to such questions as “What is your name” or Do you have a cat?” are categorical because they yield data such as “Anna” or “no.” In contrast, responses to such questions as “How tall are you” or “What is your income?” are numerical. Numerical data can be either discrete or continuous. (see


In my case the Chi Square is 111,2 and it is smaller then 0.05

111,2<0.05 so the different is significant. The P-value of 0.00000 shows that this is a 100% proved mathematical result, allowing for no chance. The chi-square also has its problems as a methodology. As Meyer argues, “It is very useful for evaluating corpus data but does have its limitations. If the analyst is dealing with fairly small numbers…then the reliability of the chi-square is reduced” (Meyer 2008: 130).  In the case of my data, I think it was a useful tool. My data set was large enough to avoid empty cells or other problems for the chi-square.





In conclusion, my research has proved that there is actually only a slight difference between male and female adjective use, so my original hypothesis is incorrect. My analysis has found no patterns that suggest different placement or increased frequency of adjectival use for male and female speakers. Although different words may be used, they generally occupy the same positions in sentence structure, and in general adjectives are used with approximately equal frequency. In analyzing a corpus many things must be taken into consideration. Meyer suggests, “whether the corpus to be analysed is lengthy enough for the particular linguistic study and whether the samples in the corpus are balanced and representative” (Meyer 2002: 100). In order to further test my hypothesis, I would have to take on further research. The data for the female subcorpora could be expanded to make it in line with the data available for the male subcorpora. I could get further information on the balance of the corpus and how representative it is. Separating the data purely according to gender doesn’t allow for other factors such as age, race or class. There are also limitations with the tools and the data provided here. They can’t for example reveal the context in which each word was used, e.g. male-male or male-female conversation, which may be important in this context, and lead to other conclusions about adjective use in different social or different gendered situations.






‘The Chi Square Statistic’




Meyer, Charles (2002) English Corpus Linguistics: An Introduction Cambridge: Cambridge Univ. Press