pa02 : BST on movie dataset

num ready? description assigned due
pa02 true BST on movie dataset Thu 11/14 12:00AM Thu 12/05 11:59PM

This assignment may be done solo or in pairs.

Introduction

  • Create a GitHub repo following the correct naming convention.
  • If you have a partner, add your partner as a collaborator to the repo. The partner should accept the invitations to complete this process
  • You and your partner should agree to use pair programming to complete the pa instead of dividing the work and coding separate parts of the pa individually
  • Minimal starter code is provided for this lab, but be sure to grab it from GitHub

Goal of this assignment

  • Practice implementing and using Binary Search Trees
  • Learn to perform runtime complexity analysis and express your analysis using Big-O notation
  • Learn to organize a project’s code structure on your own (not just filling in a template)
  • Learn to design classes using good OOP design principles discussed in class
  • Learn to test your code using unit test cases

Instructions

In this assignment, we have given you datasets containing movie names and their ratings. Your goal is to implement a Binary Search Tree to store this information, measure and analyze the running time of searching for movies.

An overview of your goals are provided below:

  • Implement a Binary Search Tree to store movie names and ratings. Your BST should be sorted by movie name.
  • Your BST should provide a public function to search for a particular movie, search for movies that start with a specific prefix, and find the movie with the highest rating starting with a particular prefix.
  • Measure the absolute running time of your search functions and how it scales with input size.
  • Analyze the Big-O running time for different types of search.
  • Write a report to discuss the trends in running times using your mathematical analysis to explain the performance data that you collected.

Required Files

  • movies.cpp, movies.h // These files should contain your implementation of the binary search tree to store movies information
  • utility.cpp, utility.h // These files should contain any other classes that you need for your implementation.
  • tests.cpp, tests.h // These files should contain test code for all the classes and methods for your implementation.
  • main.cpp // This file should read in the movies from input files and create BST along with functionality to time the searches / other functions.
  • Makefile // generate two executables- the first should be called runMovies and the second should be called tests. The command make tests should RUN (not just compile) all the test code to unit test your classes and methods

Starter Code

As in previous assignments the starter code is available in the repo: https://github.com/ucsb-cs24-f19-nichols/cs24-f19-starter-code The starter code has the following files

  • main.cpp // Skeleton code to read in movies from input files
  • input_20_ordered.csv
  • input_20_random.csv
  • input_100_ordered.csv
  • input_100_random.csv
  • input_1000_ordered.csv
  • input_1000_random.csv

You are given 3 pairs of input files, each pair contains 20,100 and 1000 movies respectively. The only difference between files in a pair is the way movies are ordered. In one case movies appear alphabetically in the file while in the other case they appear in random order. The files are of format File1 and File2.

Approach

As discussed in the lecture, the order of inserting values into a BST affects its structure and hence the running time. Your goal is to investigate how the running time of search relates to the depth of a movie in the BST. The depth of a movie in the BST is the number of nodes on the path from the root to the node containing the movie. For example, the root will be at depth 0.

In this assignment, we ask that you implement your own BST class. One of the variations to your BST compared to an earlier lab is that each node in your BST should store information such as movie_name, rating, depth. Your main program should expect three arguments from the user. A sample execution of the program will be as follows:

./runMovies arg1 arg2 arg3

arg1 is a boolean flag, which can be true or false arg2 represents the input file containing movies and ratings (as described before) arg3 is either the starting letters of a movie name (movie prefix) or a number

You should use the value of the flag(arg1) to interpret the third argument and accordingly do one of two things:

Case 1: Find the highest rated movie that starts with a given prefix

If arg1 is true, you should expect the third argument to be a movie prefix. In this case,

  • In your main program insert the movies in the same order as they appear in the input file into a BST, using the public functions of your class.
  • Print the pre-order traversal of your tree to stdout, for each node visited, output the movie name, rating, and depth.
  • Find the movie with the highest rating beginning with the prefix was passed in as the third argument, using the public functions of your BST.

Below is a sample run of the program with arg1 as true

./runMovies true input_20_random.csv the

This should produce the following output:

toy story, 7.7, 0
jumanji, 6.9, 1
grumpier old men, 6.5, 2
father of the bride part ii, 5.7, 3
dracula: dead and loving it, 5.7, 4
balto, 7.1, 5
ace ventura: when nature calls, 6.1, 6
cutthroat island, 5.7, 6
casino, 7.8, 7
goldeneye, 6.6, 4
four rooms, 6.5, 5
heat, 7.7, 3
sabrina, 6.2, 2
nixon, 7.1, 3
money train, 5.4, 4
tom and huck, 5.4, 3
sudden death, 5.5, 4
sense and sensibility, 7.2, 5
the american president, 6.5, 5
waiting to exhale, 6.1, 1

Best movie is the american president with rating 6.5

Case 2: Analyze the time to search for keys in a BST and write a report

If arg1 is false, your goal is to collect two types of timing data to answer the following questions: (1) What are the statistics of the average time to search in a BST? Justify your answer (2) How does the number of nodes visited (proportional to the primitive operations) when inserting a new key in the BST vary with the number of nodes in the bst? Justify your answer.

Note that you need to collect two different types of data to answer these questions. The first is an absolute time, while the second is a count.

To answer (1), calculate the average time to search for any key. You can calculate this as the total time it takes to search for all the keys in your BST divided by the number of nodes in the BST. To get a more reliable value, don’t time each search. If your BST has N keys, record the time it takes to perform all N * W searches (search for all N nodes in a BST W times) and then divide that value by W. You should expect this average time to have some random variation, which is why we ask that you compute it over multiple (W) runs and report the min, max and average statistics. The value of W is the third argument to your program. We recommend that you use a W value of 50 in all cases, but your program should work for any W.

To answer (2), implement your BST insert function in a way that tracks the number of nodes visited (N_visited) during the insert operation for each key and the number of nodes in the BST before that key was inserted (N). For example, inserting a key into an empty tree requires visiting 0 nodes. In this case, N_visited and N are both 0.

Collect two sets of data, one for each of these input files: input_1000_ordered and input_1000_random. For each input file record (N_visited) for and the corresponding N, for all the 1000 insertions. Store each data set in an output file in two columns: N and N_visited. Your dataset should have a 1000 entries. Plot your data as a scatter plot of the number of nodes visited (N_visited) vs. number of nodes already in the BST before a specific key was inserted(N), for N values between 0 and 999. Here is a sample sheet that helps you generate a plot for one data set.

Note: (1) Your data set should be much larger than the sample provided. (2) In general, N_visited should be less than or equal to N, for each key insert. Since these values are completed deterministic, you don’t need to calculate averages like in part 1.

Finally write a report to present the data and reason about the data collected for (1) and (2). Your report should answer the following questions for each part

Part 1: About the statistics of the average time to search in a BST:

  1. Explain the trends that you observe in the above table.
  2. Are the trends as you expect? Why or why not?
  3. (Optional) Add any other related plots that you think are relevant to these questions

Part 2: About the number of nodes visited when inserting a new key in the BST for random and ordered datasets with 1000 entries:

  1. How does the number of operations for insert vary with the number of the current nodes present in the tree when (a) nodes are inserted into the BST in alphabetical order (b) nodes are inserted in random order?
  2. Include the code for your insert function, do a Big O analysis and use it to explain the trends you observed?
  3. Are the trends you observed as you expect? Why or Why not?

Here is a sample report format expected for the report.

Submit your assignment as a .pdf along with your code on gradescope. We will manually grade this part of the assignment.

Requirements

For this programming assignment, you will have a lot of flexibility on your implementation (which just means we won’t be providing a code framework for you to fill in). However, there are a few requirements that you need to keep in mind as you think about your solution:

  • You must use a binary search tree you implemented yourself to solve the problem, not another data structure or a class in the standard library.
  • Your binary search tree must implement a constructor, a destructor and other methods as needed
  • You must time and report your search/lookup function calls
  • Your code should be readable
  • Your classes should define clear interfaces and hide implementation details as much as possible.
  • Your program must properly free all memory it allocates, including your binary tree nodes and any dynamically allocated data stored inside them. We will also check this with valgrind when you turn in your code to Gradescope.

Submission instructions

Submit your code on Gradescope. You must organize your program in the files: main.cpp, movies.cpp, tests.cpp, movies.h,tests.h, utility.cpp and utility.h Also, you must create a Makefile that compiles your program to an executable called runMovies and tests. Also submit your report as a pdf file.