pa02 : Application of data structures to a movie dataset

num	ready?	description	assigned	due
pa02	true	Application of data structures to a movie dataset	Mon 02/28 09:00AM	Fri 03/11 10:00PM

This assignment must be done solo or in pairs

Introduction

Get the starter code: https://github.com/ucsb-cs24-w22/pa02-w22-data

Goal of this assignment

Practice using the data structures available in the C++ Standard Template Library (STL) to solve problems efficiently on a real-world dataset
Learn to perform runtime complexity analysis and express your analysis using Big-O notation
Learn to organize a project’s code structure on your own (not just filling in a template)

Instructions

In this assignment, we have given you datasets containing movie names and their ratings. Your goal is to use an appropriate data structure from the C++ STL to store this information, measure and analyze the running time of your algorithms.

An overview of your goals are provided below:

Use an appropriate C++ STL class to store movie names and ratings.
Your choice of data structure should be informed by how to efficiently find all movies that start with a specific prefix. For example, duplicating the entire dataset wouldn’t be an efficient approach.
Specifically, your program should take as input m prefix strings via the command line. For each one, it must find all movies that start with that prefix and print the one with the highest rating among them.
Analyze the Big-O running time for your algorithm.

Required Files

movies.cpp, movies.h // These files should contain any abstractions that you need to define. We recommend not implementing any data structure from scratch
main.cpp // This file should read in the movies from input files and produce the expected output.
Makefile // generate the executable runMovies

Starter Code

The starter code has the following files

main.cpp // Skeleton code to read in movies from input files
input_20_ordered.csv
input_20_random.csv
input_100_ordered.csv
input_100_random.csv
input_1000_ordered.csv
input_1000_random.csv

You are given 3 pairs of input files, each pair contains 20,100 and 1000 movies respectively. The only difference between files in a pair is the way movies are ordered. In one case movies appear alphabetically in the file while in the other case they appear in random order. The files are of format File1 and File2.

Problem overview and sample execution of the program

There are two main sub-problems that you need to solve for this assignment. It’s important that you keep your solution to these two sub-problems separate because later in part 3 of the assignment you need to analyze the running time complexity of your solution to part 2 (only)

In part 1, your task is to store all the movies (that are in the file) into a data structure of your choice. Note, this can only be one data structure (not multiple)
In part 2, you are given m movie prefix strings and you need to search the data structure to find matching movies for each of them as well as the highest rated movie. You may use other data structures to help you solve this problem.
Finally, in part 3 you will analyze the running time complexity of your solution for part 2.

Please make sure that you have a clear separation between your solution to part 1 and part 2.

A sample execution of the program is as follows:

./runMovies filename prefix_1 prefix_2 prefix_3 ... prefix_m

filename represents the input file containing movies and ratings (as described before)
prefix_i represents starting letters of a movie name (movie prefix). There may be upto m such movie prefixes provided to your program

If a prefix contains white spaces it must be placed within quotes. Here is an example call to the program passing in two prefix string, one of which contains white spaces

./runMovies input_1000_random.csv "the american" ab

Part 1: Print the movie names and ratings in alphabetical order of names

If no movie prefixes are provided as arguments to your program, then your program should print all the movies and their rating in alphabetical order of movie names. Below is an example run of the program for the given input file input_20_random.csv

./runMovies input_20_random.csv

This should produce the output

ace ventura: when nature calls, 6.1
balto, 7.1
casino, 7.8
cutthroat island, 5.7
dracula: dead and loving it, 5.7
father of the bride part ii, 5.7
four rooms, 6.5
goldeneye, 6.6
grumpier old men, 6.5
heat, 7.7
jumanji, 6.9
money train, 5.4
nixon, 7.1
sabrina, 6.2
sense and sensibility, 7.2
sudden death, 5.5
the american president, 6.5
tom and huck, 5.4
toy story, 7.7
waiting to exhale, 6.1

Part 2: Find the highest rated movie that starts with a prefix

If there is one or more movie prefix arguments provided to your program, then your program shouldn’t print all the movies in ascending order of name.

Instead, for each prefix, your program should

find all movies that start with that prefix
find the movie with the highest rating beginning with the prefix a

Note that you can make use of additional data structures from the C++ STL to help you efficiently solve this part of the assignment.

Your program should first print out all the movies that match each prefix in decreasing order of their rating. If two movies have the same rating then the movies should be printed in alphabetical order (for example the american president is printed before the confessional in the example run below). Furthermore, if there are multiple competing movies that have the highest rating, then the one that appears earlier in the alphabet should be selected as the best movie (see one of the later examples).

Below is a sample run of the program

./runMovies input_100_random.csv to th w

This should produce the following output:

toy story, 7.7
to die for, 6.7
tom and huck, 5.4

the usual suspects, 8.1
the city of lost children, 7.6
the postman, 7.6
the white balloon, 7.5
the journey of august king, 6.7
things to do in denver when you're dead, 6.7
the american president, 6.5
the confessional, 6.5
the crossing guard, 6.1
the indian in the cupboard, 5.9
the juror, 5.5
the big green, 5.2

wings of courage, 6.8
white squall, 6.3
waiting to exhale, 6.1
when night is falling, 5.9

Best movie with prefix to is: toy story with rating 7.7
Best movie with prefix th is: the usual suspects with rating 8.1
Best movie with prefix w is: wings of courage with rating 6.8

If a movie with the given prefix is not present, your program should print the message:

No movies found with prefix <prefix_value>

Here is an example run

./runMovies input_100_random.csv t xyz

should produce the output

the usual suspects, 8.1
toy story, 7.7
the city of lost children, 7.6
the postman, 7.6
the white balloon, 7.5
twelve monkeys, 7.4
the journey of august king, 6.7
things to do in denver when you're dead, 6.7
to die for, 6.7
the american president, 6.5
the confessional, 6.5
the crossing guard, 6.1
the indian in the cupboard, 5.9
the juror, 5.5
tom and huck, 5.4
two bits, 5.4
the big green, 5.2
two if by sea, 4.5

No movies found with prefix xyz

Best movie with prefix t is: the usual suspects with rating 8.1

Note in the following example that there are two movies that start with the prefix “be”. In this case the one that appears earlier in the alphabet must be selected as the best movie.

./runMovies input_1000_random.csv be

should produce the output:

before sunrise, 7.7
before the rain, 7.7
beauty and the beast, 7.5
belle de jour, 7.3
beautiful girls, 6.6
beyond rangoon, 6.4
beat the devil, 6.2
before and after, 5.8
beverly hills cop iii, 5.5
bed of roses, 5.1
being human, 5.0
beyond bedlam, 4.0

Best movie with prefix be is: before sunrise with rating 7.7

Part 3: Analyze the Big-O running time of your algorithm from part 2

Assume your program is provided with m prefix values and you have a dataset of n movies. Further assume that each prefix matches with at most k movies. Analyze the worst case (Big-O) time complexity of your algorithm for part 2 of the assignment (described above) in terms of the variables m, n, and k. The starting point of your analysis is that all the n movies have already been stored in some data structure and the m prefix values are available in a string array (as read in from the command line).

You must provide the overall time complexity analysis as a commented block right after your main function in the file main.cpp. Note that your answer will depend on your choice of data structures and your specific algorithm.

You will be graded for both the efficiency of your algorithms, the accuracy and clarity of your analysis. However, we are not giving you a target Big-O efficiency. Solutions that have similar Big-O running time will receive the same grade. So, there isn’t a single magic Big-O efficiency that you need to reach. A range of equally good solutions will be acceptable.

Requirements

For this programming assignment, you will have a lot of flexibility on your implementation (which just means we won’t be providing a code framework for you to fill in). However, there are a few requirements that you need to keep in mind as you think about your solution:

You must make appropriate use of data structures from the standard template library
Your code should be readable
Your classes should define clear interfaces and hide implementation details as much as possible.

Submission instructions

Submit your code on Gradescope. You must organize your program in the files: main.cpp, movies.cpp, movies.h Also, you must create a Makefile that compiles your program to an executable called runMovies. Make sure main.cpp has your runtime complexity analysis within a commented block right after the main function.