How does the right to purchase alcoholic beverages by adolescents impact the statistics of drunk driving?


Biology and computer science are two rapidly developing and extensive fields of study. They are very different in their nature, yet often there are topics that concern both paradigms. It is therefore interesting to analyze the use of language in each field and to compare it between the two. The purpose of this project is to conduct such an analysis and comparison and to draw conclusions about the lexical structures of texts in biology and computer science. 

In this project, I will seek to answer the four main questions about the corpus of texts from biology and computer science. The questions are:

  1. What is the AWL, GSL, and technical word proportions in texts in computer science and biology, and how do they compare?
  2. How do contexts where the same AWL words appear differ between the fields?
  3. What are the most popular three-word combinations used in both fields?
  4. What is the distribution of the word “compute”, and what are the contexts of its use?

Considering how biology involves extensive classification of objects and how computer science often uses common words to describe algorithms and ideas, I expect a larger proportion of GSL in CS, which implies a greater use of technical terms in biology. AWL proportions are expected to be very similar between the fields. The contexts in which certain AWL words are used, including the word

“compute”, should be quite different, since biology is often less theoretical and more practical than computer science. Still, a prominent overlap of AWL words is expected between the fields.


Corpus Description

The overall corpus for this project consists of 16 journal articles: eight from biology and eight from computer science. Texts for both fields were taken from popular journals, namely from Biology Journal and IEEE Access. Both journals are renowned platforms for authors to post their articles to, and thus are representative of the general academic texts in the respective fields. When selecting the topics for the articles, I excluded articles that have had an overlap of computer science and biology topics. This allowed me to make a comparison more general and to reduce the possible overlap of traits unique to each field, avoiding possible distortion of the results. The number of journal entries in each field is sufficient to build a representative corpus and to gather a statistically significant number of words for the analysis. The computer science corpus includes 135,842 words and 11,167 unique word types, and the biology corpus includes 89,338 words and 8900 word types. This provides a good amount of data to make an effective comparison of language use in both corpora.


The primary tool that I used in this project is AntConc. Specific features of this software package that were particularly helpful were the Word List tool that allowed me to sort the words by their frequency and helped me answer the research question 1; Concordance tool that showed me contexts in which words appeared and that I used for answering questions 2 and 4; Concordance Plot tool that I used to answer questions 3 and 4; and Cluster/N-Gram tool for looking up word combinations that I used for answering question 3. 


When researching the articles, for convenience I picked random ones from a single representative scientific journal for each discipline. When answering the research question 1, I have used the word list function to determine the number of words relating to each discipline and calculated the relevant ratios. To answer question 2, I used the Word List tool to look at the most frequently used AWL words in both fields, picked ones that overlapped, and then compared the contexts these words appeared in using the Concordance Tool. This allowed me to analyze qualitatively in what contexts the same words were used in two disciplines.

Findings and Discussion

Question 1

The computer science corpus in total had 135,842 word tokens and 11,167 word types. The number of AWL words used was equal to 19,176 word tokens, and 1503 word types. The number of GSL words used in the corpus was 87,581, and involved 2983 word types. As such, computer science corpus consisted of 14.1% of AWL words, and of 64.5% of GSL words. This leaved about 21.4% of technical terms and other words.

The biology corpus had 89,336 word tokens and 8,900 word types. It had 8891 AWL word tokens and 981 word types, as well as 50,397 GSL word tokens and 1860 respective word types. Thus, the percentage of AWL words used in biology papers was equal to almost exactly 10%, while the proportion of GSL words was 56.4%. This leaved 33.6% of technical terms or other words. 

Evidently, the analyzed computer science corpus used considerably more AWL words than biology papers did. This result was rather unexpected. On the other hand, the number of GSL words was considerably greater in the computer science articles, as expected, and this also meant that computer science used much less technical terms than biology. Because computer science often uses concept names

inspired by the real world to describe algorithms, its GSL, as well as AWL score, was much greater. For example, some of the most important CS terms include common words: “loop”, “stack”, “function”, “tree”, “program”, “return”, and so on, while biology relies heavily on scientific terms such as “genes”, “protein”, “yeast”, “phenotype”, etc.

Question 2

The most common AWL words used in both subjects were the words “data”, detect”, “analysis”, and “function”. It was notable that biology used these words to describe experiments or the data gathering process that the researchers were conducting. On the other hand, computer science journals used these words to describe theoretical and abstract concepts. This distinction is very interesting and may be indicative of the nature of the two subjects. Computer science often focuses on simulations and analysis of the data, while biology relies heavily on the empirical method and gathering and analysis of the data through experimentation. Figures 1-4 illustrate contexts for the words “data” and “detect” for biology and computer science papers.


Question 3

The most popular three-word combinations used in the computer science field were “et al proposed”, “in order to”, and “internet of things”. Most popular three-word combinations in biology were “from biology journal”, “materials and methods”, “on the basis”, and “of gene expression”. I have noticed that computer science journals included phrases that indicated citation and interaction with the wider academic community by using the phrase “et al proposed”. For example, when using the concordance tool, it is seen that “et al proposed” appears frequently in most CS papers and is used to reference works of other authors. This is illustrated in Figure 5. “On the basis” is a very popular construct used in biology papers. As it is seen in Figure 6, this phrased is used to link the discussion to some gathered data; this agrees with my expectation that biology is more experiment oriented.


Question 4

When comparing the use of the word “compute” in both disciplines, I saw that in biology it is used mostly in references to other articles. On the other hand, the usage of the word “compute” in computer science is much often involved in the body of the article to describe a certain concept. As a contrast, biology papers have a very sparce usage of the word compute. The distribution of the word compute is shown in Figure 9. It is visible how every computer science paper includes the word “compute”, and how common it is, while only five out of eight biology articles have used this word, and only a few times per paper. Figures 7-8 illustrates the use of the word compute in computer science and biology papers.

This project used AntConc software package to analyze the use of English language in 16 papers and compare how lexical usage in computer science and biology differed. Overall, after analyzing corpora for computer science and biology, I have identified some major differences, as well as some similarities. Firstly, CS uses much less technical terms than biology, which was expected. However, CS also has a greater use of AWL words, about 14% of all the words used, which was rather surprising to see. Biology often uses common AWL words to describe experimentation, while computer science focuses on high-level concepts and thus uses these words to explain some idea. Both disciplines cite other authors extensively, as is visible by the frequency of three-world N-grams where they cite or mention other papers. And lastly, the word “compute” is used much more extensively in CS than it is in biology, which was an interesting observation. I believe that this study has shed some light on the use of language in biology and computer science, and hopefully it will be of use to someone who is interested in how they can alter the use of language in their writing in one of these disciplines. One major limitation of the corpus is that text data was not formatted and includes references, titles, as well as headers used in the original paper.


Leave a Reply