Analysis of n-grams (simple text analysis)

Static Badge Static Badge

N-grams are sequences of n words. Knowing which n-grams and how frequent they appear in a given text is useful in various search related problems and text analysis. Given an input corpus(text), construct a list of frequencies of n-grams where \(n\) is the program parameter. We update the list by weighting the elements by the probability of occurrence of parts of n-grams.\ Counting n-grams Write a program that given an input \(n\) (size of \(n-gram\)) and a given corpus returns lists of all n-grams with he corresponding frequencies (number of occurrences in the given corpus).\ Relative frequencies Some \((n) grams\) can occur frequently simply because one of their words is very frequent. An interesting statistic can be obtained by dividing the number of occurrences of \(n-gram\) \"A B\" with the total number of occurrences of all n-grams that begin with a letter \"A\". This gives \(P (B | A)\) probability of seeing \"B\", following a letter \"A\". The program should print the n-grams with aforementioned metric.

Implementation guidelines

  1. Running the program:

    • The program can be ran in different modes (sequential, parallel, distributed) by specifying a parameter.

    • User can specify the n (n-gram) and the input file (text) considered a corpus.

    • The program measures run-time needed to complete.

  2. Problem specific implementation requirements

    • Every version (sequential, parallel, and distributed) measures cycles passed. Every update of positions of all particles is considered a cycle. All three implementations run the simulation until they reach a specified number of cycles.

    • The implementation must adapt automatically to the hardware it is being ran on (Physical CPU's, Cores, Memory, etc..);

Testing

The report must include extensive testing and explanation of results (numeric and graphical). All three versions must be tested. The parameters that influence the runtime are the size of the input corpus and \(n\). Consequently, both need to be tested independently to show how the implementation scales. Present the results with informative charts/figures and explain them in detail. To obtain corpus use the Internet. Project Opus http://opus.nlpl.eu/ is a good source of interesting texts.