ICT337: Explain in details the significance of Apache Spark framework and discuss in details FIVE (5) limitations of using Apache Spark framework in big data processing. Big Data Computing in the Cloud assignment
Module / Subject / School:
ICT337 Big Data Computing in the Cloud
Singapore University of Social Sciences
Requirements:Â
Question 1
Question 1a
Explain in details the significance of Apache Spark framework. Next, discuss in details FIVE (5) limitations of using Apache Spark framework in big data processing.
(15 marks)
Question 1b
Explain in details the Spark job execution process and how the Directed Acyclic Graph (DAG) works in Spark.
(10 marks)
Question 2
Question 2a
Explain the key logic for both PySpark built-in programs in Figures Q2(a)(1) and Figure Q2(a)(2).
Question 2b
Appraise in details the concept of PySpark Resilient Distributed Datasets (RDD) and PySpark DataFrames. Use a table to highlight the main differences between RDD and DataFrames. (15 marks)
Question 3
In your local machine’s Spark setup, design and develop a PySpark program using PySpark RDD APIs to perform the following tasks. Show your full PySpark program and provide screenshots and results for all key steps where applicable.
Data sources used in this question are: (i) Alice’sAdventuresInWonderland.txt , and (ii) TheAdventuresOfSherlockHolmes.txt. Note that these data files can be downloaded from ICT337 Canvas webpage.
Question 3a
Read both text files and store their content using Spark RDDs. Show the total number of records.
Question 3b
Perform the following tasks and show the results in each step:
- Write a function to remove all characters that are not alpha-numeric and spaces.
Change all characters to lower letter and remove any leading or trailing spaces. - Extract individual word token from the sentences.
- Remove white spaces between sentences and find the word occurrence counts in terms of (word, count).
(6 marks)
Question 3c
Setup Python Natural Language Toolkit ( https://www.nltk.org/ ) library in your program and download Stop Words. Show the list of English Stop Words and its total count. Based on the results in Q3(b), remove all Stop Words from the resultant RDDs and then randomly sample 0.5 percent of the RDD words. Show the results and their total word count. Finally, find the TOP TEN (10) most frequent words. (7 marks)
Question 3d
Compute an average word occurrence frequency.
(3 marks)
Question 3e
Find the common words between the two text files. Show the TOP THIRTY (30) most common words.
(3 marks)
Question 4
In your local machine’s Spark setup, develop a PySpark program using PySpark RDD APIs to perform the following tasks. Show your full PySpark program and provide screenshots and results for all key steps where applicable.
Data sources used in this question are: (i) mov_rating.dat, (ii) mov_item.dat, (iii) mov_genre.dat, (iv) mov_user.dat, and (v) mov_occupation.dat. Note that these data files can be downloaded from ICT337 Canvas webpage.
Question 4a
Construct program and perform the following tasks and show the results in each step:
Read the “mov_rating.dat” and “mov_item.dat” files and store the content using Spark RDDs.
- Find the Top FIVE (5) users that review the most movies. Show the user ID and total number of occurrences.
- Find the Top TEN (10) most reviewed movies. Show the movie ID, movie name, and total number of occurrences. Note that you may use Spark Broadcast variable to store the mapping of movie ID to movie name.
(5 marks)
Question 4b
Based on the RDDs of “mov_rating.dat” and “mov_item.dat” in Question 4 (a), create new RDDs of (movie ID, ((user ID, rating), genre_list)) using Spark join operation.
There are Nineteen different genre categories. For each category, find the TOP THREE (3) most reviewed movies, sorted by average review ratings (i.e., highest to lowest ratings). Show the genre name, movie ID, movie name, and average rating.
Save the TOP THREE (3) movie results using RDD file saving mechanism and show the content. Note that the genre name can be referenced from “mov_genre.dat” and the mapping can be stored using Spark Broadcast variable.
(6 marks)
Question 4c
Read the “mov_user.dat” to obtain user’s occupation and create new RDDs of (movie ID, ((user ID, rating), occupation)).
There are TWENTY-ONE categories of occupation. For each category, find the TOP THIRTY (30) most reviewed movies, sorted by average review ratings (i.e., highest to lowest ratings). Show the total movie counts for the occupation category, as well as occupation type, movie ID, movie name, and average rating.
Save the top thirty movie results using RDD file saving mechanism and show the content. Note that the occupation name can be referenced from “mov_occupation.dat”
(6 marks)
Question 4d
We would like to build a simple movie recommendation engine with the available movie data.
To accomplish this, perform the following tasks and show the results in each step (i.e., sample RDD content and its total count):
- Create movie rating RDD with key-value pairs of: (user ID, (movie ID, rating))
- Perform an RDD self-join operation so as to find all combination of movie pairs rated by a given user. The resultant RDD should have the structure of: (user ID, ((movie #1, rating #1), (movie #2, rating #2))).
- Filter out movie pair duplication using condition of movie #1 < movie #2. This should greatly reduce the RDD size.
(4 marks)
Question 4e
Based on the RDD results from Q4(d), perform the following tasks and show the results in each step (i.e., sample RDD content and its total count):
- Organize the RDD into key-value pairs of: ((movie #1, movie #2), (rating #1, rating #2)). Then, collect all movie ratings for each movie pair, whereby the resultant RDD structure should be in terms of ((movie #1, movie #2), ((rating 1, rating 2), (rating 3, rating 4), …)).
- For each movie pairs, compute the Cosine Similarity (https://en.wikipedia.org/wiki/Cosine_similarity) for the collection of movie rating pairs. This is the key algorithm to measure the degree of similarity between two movies based on ratings. The cosine similarity value/score should be ranging from -1 to +1, where -1 refers to movies that are opposite in nature and +1 refers to movies that are highly similar. Figure 3 shows the definition of Cosine Similarity.
Figure 3: Definition of Cosine Similarity
On top of the cosine similarity score, your function should also count the total number of rating pairs (i.e., your function should have output information of: ((movie #1, movie #2), (score, numberOfRatingPairs))).
Note that you may use the RDD cache operation to store the final results.
(6 marks)
Question 4f
Find the TOP TEN (10) movies that are similar to movie ID = 50 (i.e., Star Wars). Constraint your movie similarity search using a threshold of Cosine Similarity score set to 0.97 and a threshold of the number of movie rating pairs (i.e., numberOfRatingPairs) set to 50. This allows us to only consider “good” quality search results. (3 marks)
What we score:
73%
Our Writer’s CommentÂ
This assignment tests students on their understanding of the concepts.
To score well in this assignment, focus on the following:
Question 1a:
- Significance of Apache Spark: Provide a comprehensive explanation of why Apache Spark is a powerful and widely-used framework for big data processing. Mention its distributed computing capabilities, in-memory processing, fault tolerance, and support for various data sources.
- Limitations of Apache Spark: Discuss five significant limitations that users may encounter when using Apache Spark for big data processing. These limitations could include memory requirements, learning curve, lack of support for certain data formats, and challenges with debugging.
Question 1b:
- Spark job execution process: Explain the step-by-step process of how Spark executes jobs, including stages and tasks, and how it optimizes data processing.
- Directed Acyclic Graph (DAG): Elaborate on how the Directed Acyclic Graph works in Spark and its role in optimizing Spark’s execution plan.
Question 2a:
- Logic for PySpark built-in programs: Provide a detailed explanation of the logic behind the PySpark built-in programs shown in Figures Q2(a)(1) and Q2(a)(2). Use comments or pseudocode if necessary to clarify the steps.
Question 2b:
- PySpark Resilient Distributed Datasets (RDD) and DataFrames: Clearly explain the concepts of RDD and DataFrames in PySpark and highlight their key characteristics and differences. Use a well-organized table to present the comparison.
Question 3:
- Comprehensive PySpark program: Develop a PySpark program that performs the tasks mentioned in Question 3. Ensure that your code is well-structured, readable, and includes appropriate comments to explain the steps.
- Screenshots and results: Provide screenshots of the relevant output/results from each key step. Ensure the screenshots are clear and well-labeled.
Question 4:
- Well-structured PySpark program: Develop a PySpark program for Question 4, addressing each sub-question clearly and coherently.
- Screenshot of results: Include screenshots of the relevant output/results for each step. Clearly label the screenshots to match the corresponding sub-question.
General Tips:
- Clarity and coherence: Write your answers with clarity and logical coherence. Ensure that the information flows smoothly, and each point supports the overall argument.
- Code efficiency: Optimize your PySpark code for performance. Use caching and partitioning when appropriate to improve data processing speed.
- Proper referencing: Use proper APA citation style for any external sources or references you include in your answers.
- Proofreading: Carefully proofread your work to eliminate any grammatical or typographical errors.
- Meeting word count: Ensure that you meet the specified word count for each question while providing comprehensive answers. Avoid unnecessary repetitions or irrelevant information.
Why are we trusted by Singaporean part-time students?
- Assurance of Academic Success: We are confident in the quality of our work so much so that we offer a 200% money-back guarantee if our work is not of sure-pass quality. This showcases our unwavering commitment to delivering exceptional essays that consistently meet the highest academic standards.
- Well-Established and Trusted: With a decade of experience in the industry, we have built a strong and reputable presence. Our long-standing track record speaks to our ability to consistently deliver outstanding results. Students can have confidence in our extensive expertise and our proven ability to help bachelor and master’s students excel in their academic pursuits.
- Valuable Testimonials: We are proud to have received numerous testimonials from hundreds of satisfied students who have benefited from our services. These testimonials serve as a testament to the trust and satisfaction students place in our essay writing service. The positive feedback highlights the quality of our work and the positive impact we have had on our clients’ academic journeys.
- High Ratings on Google: Our commitment to excellence is reflected in the positive ratings we consistently receive from students. With an impressive 4.6 rating on Google, students can rely on the feedback shared by others who have had successful experiences with our service. These positive reviews further reinforce the trust students can place in the quality of our work and the positive outcomes we deliver.
- Strict Confidentiality: We prioritize the confidentiality and privacy of our clients. Students can trust that their personal information and engagement with our service will be treated with the utmost confidentiality and security. We never disclose any client details to third parties.
Sample Assignment
While we have scored well for this assignment, we do not provide any sample assignment for this. If you’d like us to work on this, you can try our model assignment writing service.