Sentiment analysis

In vast marketplaces like Amazon, customer feedback is crucial, typically gathered through concise product reviews. These reviews are not only publicly accessible, benefiting users and manufacturers alike but also provide valuable insights for Amazon itself. The utility derived from these reviews is directly proportional to the volume of analyzed feedback. Consequently, managing and analyzing a large dataset efficiently necessitates robust and speedy analytical tools.

The challenge for students in this project is to develop a high-performance online data analysis tool. This tool must be adept at managing parallel data streams effectively. To facilitate this, we have developed a websocket-based API, which allows clients to subscribe to a variety of topics. Each topic continuously streams reviews, and clients have the flexibility to subscribe to multiple topics simultaneously.

In this setup, the API serves as the producer, and your program acts as the consumer. The data should be stored as tasks and distributed to worker threads tasked with performing sentiment analysis. This sentiment analysis can be performed using Stanford's Core NLP library (https://nlp.stanford.edu/software/corenlp.shtml). The stream can be read and processed in Java, adapting relevant libraries as necessary.

Write a program capable of consuming data from this stream and performing sentiment analysis on each item. This project should be considered by students comfortable with research oriented tasks.

Reviews API

Open a websocket connection to the following address (wss://prog3.student.famnit.upr.si/sentiment)

Implementation guidelines

Running the program:
- The program can be ran in different modes (sequential, parallel, distributed) by specifying a parameter.
- The program measures number of reviews per second that we're successfully analyzed.
Problem specific implementation requirements
- The implementation should use a thread pool policy that throttles down depending on the number of reviews. A caching approach could be beneficial. When there is not enough work for all threads, the thread pool must cache the thread until more work arrives.
- The implementation must adapt automatically to the hardware it is being ran on (Physical CPU's, Cores, Memory, etc..);

Testing

The report must include extensive testing and explanation of results (numeric and graphical). Testing should be done by measuring the number processed reviews per second for each implementation. Since the steam is not guaranteed to be steady, measure the processed reviews per second over a significantly large period of time (minutes). Compute the average, median, minimum, maximum, and standard deviation of obtained measurements.

Prepare all results in a text file and analyze them in detail in your report.

Note: This project might require more communication to decide on details of the implementation.