An introduction to GDSL and its tools has already been given in a previous blog post. In this blog, I will attempt to explain the utility of another GDSL tool, namely n-gram. An n-gram is a contiguous sequence of n items from a given sample of text or speech. These items can be characters, words, or even other units like phonemes or syllables, depending on the context. N-grams are widely used in natural language processing (NLP) and computational linguistics for various tasks, including language modeling, text analysis, and machine learning.
The “n” in n-gram represents the number of items in the sequence. Commonly used n-grams include:
- Unigrams (1-grams): These are single items, which are typically individual words. For example, in the sentence “The quick brown fox,” the unigrams are “The,” “quick,” “brown,” and “fox.”
- Bigrams (2-grams): These consist of pairs of adjacent items. In the same sentence, the bigrams would be “The quick,” “quick brown,” and “brown fox.”
- Trigrams (3-grams): These consist of sequences of three adjacent items. For the same sentence, the trigrams would be “The quick brown” and “quick brown fox.”
N-grams are often used in language modeling to estimate the probability of a specific word or sequence of words occurring in a given context. They are also used in various NLP tasks, such as text generation, machine translation, and sentiment analysis. N-grams provide a way to capture some of the context and relationships between words in a text, which can be useful for many language-related applications.
In GDSL, the n-gram analysis can be used in two ways:
- Word Cloud: Word Cloud is a visual representation of a collection of words, where the size of each word is proportional to its frequency or importance in the text. Typically, word clouds are used to quickly and visually convey the most prominent words in a piece of text, making it easy to identify the most common or significant terms at a glance.
- Term Frequency: Term Frequency (TF) is a fundamental concept in natural language processing, information retrieval, and computational linguistics. It serves as a quantitative measure of the frequency of occurrence of a specific term or word within a document or text corpus, thereby aiding in the assessment of the term’s significance and relevance in a particular textual context. In essence, TF offers a means to quantify the emphasis placed on individual terms within documents
Both these tools can provide a useful way to understand the main concepts, ideas, and words in a textual corpus. Here is an example of a word cloud made from our test content set.
To attain precision in n-grams, qualifiers in search can be utilized. First, create a content set CS with parameters X and Y. Then generate a hypothesis Z about CS. Z could be about the influence of another factor, an explanation behind certain events, or a correlation with other factors. Once the hypothesis has been generated, incorporate it into your search by adding yet another parameter that corresponds to Z. Now, the new content set created by parameters X,Y and Z would be a subset of the prior content set. Analyzing (A∩B)’ union would give insight into what data was not taken into account when parameter Z was introduced. This can usually aid in identifying different clusters of data within the same corpus. In this case, the word clouds can also aid in visual identification since the word clouds would appear to be different for the two content sets.
For example, compare the first word cloud of the data set with parameters X and Y where X = Pakistan, Y = War and function = AND. The hypothesis here was that in this content set, there are two clusters; one that reports the war between India and Pakistan and another that reports the war between Pakistan and Afghanistan (and the Soviet Union). To check for this, parameter Z was added (Z = India). Given this, (A∩B)’ must be analyzed. And rightly so, Soviet is not found in XYZ but is available in XY. This confirms our hypothesis.
Although this might be a little complex, it can help greatly in understanding and qualifying data.
The place of the missing data can also tell about the frequency of it in XY as a whole.