Using Gale Digital Scholar Lab: Utilizing n-grams

An introduction to GDSL and its tools has already been given in a previous blog post. In this blog, I will attempt to explain the utility of another GDSL tool, namely n-gram. An n-gram is a contiguous sequence of n items from a given sample of text or speech. These items can be characters, words, or even other units like phonemes or syllables, depending on the context. N-grams are widely used in natural language processing (NLP) and computational linguistics for various tasks, including language modeling, text analysis, and machine learning.

The “n” in n-gram represents the number of items in the sequence. Commonly used n-grams include:

  1. Unigrams (1-grams): These are single items, which are typically individual words. For example, in the sentence “The quick brown fox,” the unigrams are “The,” “quick,” “brown,” and “fox.”
  2. Bigrams (2-grams): These consist of pairs of adjacent items. In the same sentence, the bigrams would be “The quick,” “quick brown,” and “brown fox.”
  3. Trigrams (3-grams): These consist of sequences of three adjacent items. For the same sentence, the trigrams would be “The quick brown” and “quick brown fox.”

N-grams are often used in language modeling to estimate the probability of a specific word or sequence of words occurring in a given context. They are also used in various NLP tasks, such as text generation, machine translation, and sentiment analysis. N-grams provide a way to capture some of the context and relationships between words in a text, which can be useful for many language-related applications.

In GDSL, the n-gram analysis can be used in two ways:

  1. Word Cloud: Word Cloud is a visual representation of a collection of words, where the size of each word is proportional to its frequency or importance in the text. Typically, word clouds are used to quickly and visually convey the most prominent words in a piece of text, making it easy to identify the most common or significant terms at a glance.
  2. Term Frequency: Term Frequency (TF) is a fundamental concept in natural language processing, information retrieval, and computational linguistics. It serves as a quantitative measure of the frequency of occurrence of a specific term or word within a document or text corpus, thereby aiding in the assessment of the term’s significance and relevance in a particular textual context. In essence, TF offers a means to quantify the emphasis placed on individual terms within documents

Both these tools can provide a useful way to understand the main concepts, ideas, and words in a textual corpus. Here is an example of a word cloud made from our test content set.

To attain precision in n-grams, qualifiers in search can be utilized. First, create a content set CS with parameters X and Y. Then generate a hypothesis Z about CS. Z could be about the influence of another factor, an explanation behind certain events, or a correlation with other factors. Once the hypothesis has been generated, incorporate it into your search by adding yet another parameter that corresponds to Z. Now, the new content set created by parameters X,Y and Z would be a subset of the prior content set. Analyzing (A∩B)’ union would give insight into what data was not taken into account when parameter Z was introduced. This can usually aid in identifying different clusters of data within the same corpus. In this case, the word clouds can also aid in visual identification since the word clouds would appear to be different for the two content sets.


For example, compare the first word cloud of the data set with parameters X and Y where X = Pakistan, Y = War and function = AND. The hypothesis here was that in this content set, there are two clusters; one that reports the war between India and Pakistan and another that reports the war between Pakistan and Afghanistan (and the Soviet Union). To check for this, parameter Z was added (Z = India). Given this, (A∩B)’ must be analyzed. And rightly so, Soviet is not found in XYZ but is available in XY. This confirms our hypothesis.

Although this might be a little complex, it can help greatly in understanding and qualifying data.

The place of the missing data can also tell about the frequency of it in XY as a whole.

Using Gale Digital Scholar Lab: Achieving Precision In Document Clustering

One tool that can be used for Digital Humanities is the Gale Digital Scholar Lab (henceforth: GDSL). GDSL is a database of various texts that can be used for analyzing, finding, cleaning, and organizing data using natural language processing (NLP). The toolset for textual analysis provided by GDSL includes document clustering, named entity recognition, n-grams, parts of speech, sentiment analysis, and topic modeling. All these analyses can be used to understand and categorize data in different ways. Such analyses are useful for scholars who aim to study trends and correlations in texts of any certain types. Currently, Carleton has access to 21 textual databases including American Fiction 1774-1920, American Historical Periodicals from the American Antiquarian Society, Archives of Sexuality and Gender, Archives Unbound, British Library Newspapers, Decolonization: Politics and Independence in Former Colonial and Commonwealth Territories and more.

In this blog, I aim to study one of these tools provided by GDSL and present ways to make it more precise and exhaust more of its capabilities. This tool of analysis is Document Clustering. To begin document clustering, first of all, we need to search for appropriate data that can be used to create a Content Set. The Advance Search feature can be used to generate Content Sets with specific characteristics. Search operators and special characters can further help in creating precise content sets.

A combination of different search terms, operators, and special characters would result in the generation of an appropriate dataset. One important parameter that can be used is “word1 nx word2” where x stands for the number of possible words between word 1 and word 2. For example, if you want to see all the sources in which “Ireland” is mentioned in 10 spaces near “Finland”, you can search “Finland n10 Ireland”. After searching, you will see all your results and they can be added to the content set by selecting the “Select All” and “Add To Content Set” options.

Once you have created the Content Set, it can be used for further analysis. As you can see, I got 53 results and I have added all of them in a test content set. Now, I will use Document Clustering tool on this content set. The document clustering tool can be accessed by My Content Sets > Analyze > Document Clustering.

By clicking the “Run” option, you will be able to run the analysis on the given dataset. I have run a basic analysis on my dataset. Now, I will show you how the output of the analysis can be better understood and utilised to the best extent. This is the initial output of my first run with two clusters.

Please note that GDSL does not tell you what the y-axis or x-axis is but there are ways you can understand the output in a more comprehensive manner. The very first thing to do is to just manually compare and contrast the data points available in the two clusters. I attempted to do this with the clusters I generated. I saw that cluster 2 (the orange cluster) contained more philosophical works whereas cluster 1 (blue cluster) contained more general works such as history, literature and news. This gives me a general idea of what the x-axis (or perhaps the y-axis) might mean for this graph. The higher the x value, the more philosophical the work might be.

Another good way to understand the output is to increase the number of clusters. You can change the cleaning configuration and No. of clusters of the tool by going to Document Clustering > Tool Setup (grey toolbar on left) > Cleaning Configuration/Number of Clusters. Below is the setup I used for my second test run. Rather than using 2 clusters only, I used 3 clusters.

The graph generated for 3 clusters looked like this:

Given this cluster, I aimed to find out the main difference between the three clusters. I found out that the third cluster in this graph only included magazines. The second cluster also included magazines (but more of an academic nature rather than literati nature).

In addition to this, you can also revise your dataset and search for terms in them. This can also help you find out what are the classifications being made in the clusters. It would not always be obvious what the cluster contains but a close look and analysis can provide more information.