Front page of the student’s submission (the following are compulsory):

Module Code:

Assignment report Title:

Date (when the work completed):

Actual hrs spent for the assignment:


CS2PP22 Programming in Python for Data Science

Coursework 1: Python Basics, Data Preprocessing, Exploratory Data Analysis, and Python Classes

Instructions:

Items to be submitted:

  1. A modified version of this Jupyter notebook file (.ipynb)

    • This is to be submitted already fully executed in a serial fashion, from top to bottom.
    • Try Kernel --> Restart & Run All to verify that this works as intended.
    • Add a cell at the top of the notebook to note the installation method and version of any additional Python packages not included in the module's Anaconda distribution or that was not instructed to be installed during the module. $$$$
  2. A copy of this notebook (with the CS2PP22_CW1 file name as in Item 1.) but in .html format, which displays all content independently. This can be included in the overall assessment archive, but an unarchived copy should be submitted to Blackboard alongside the archive. $$$$

  3. airbnb_modified.csv $$$$

  4. Optional: Any functions and classes created for this Task may be written in a separate .py module file and imported to this space. These should be stored in the submitted archive at the same directory level as the modified notebook (Item 1.).

Marking scheme:

Marks Section
5 Organisation: Preparation and submission of all required files
10 1: Network Representations
5 2.0: Analysis Preparation
15 2.1: Data Cleaning
5 2.2: Creating New Columns
15 2.3: Exploratory Data Analysis
5 3.1: Data Preparation
30 3.2: Class Definition
5 3.3: Class Execution
5 3.4: Class Comparisons

Task 1

[10 marks]

G = (V,E)

CompareNetworks.png

V = a, b, c, d, e

Edges.png


Data Reference:

The file data/influencers.tsv contains a representation of a social network dataset where enumerated influencer personalities have links between them if they have been frequently associated with one another.

To read in the tab separated network file, we need to read each line in the file. To do this we just treat the file object as an iterator.

This code reads the file and places the data in an edge list representation:

# Create an empty edge set
edges = set()

with open('data/influencers.tsv', 'r') as file:
    for line in file:
        a, b = line.split('\t')
        e = (int(a), int(b))
        edges.add(e)
        
        

Instructions:

  1. Begin with the edges variable resulting from the above code.$$$$

  2. [1 mark] Use a for loop to populate a dict that will contain the neighbour list network representation.$$$$

  3. [1 mark] Verify that your solution matches that for nodes 55, 57, and 25 (shown below).$$$$

  4. [1 mark] With your entire neighbour list code in a single Jupyter notebook cell, use the timeit Magic command to report the performance of your code, executing 27 runs of 1500 loops each.

    • There should be no network data output from this cell.$$$$
  5. [1 mark] Discuss your solution, describing the Python constructs/tools you have used, and the design of your implementation.$$$$

  6. [3 marks] Design, implement, and verify the results of a second, less efficient implementation and record its equivalent performance metrics, as done above in the more efficient case.$$$$

  7. [1 mark] Discuss your second solution, describing the Python constructs/tools you have used and the design of your less efficient implementation.$$$$

  8. [2 marks] Explain why the efficiency differs between your two implementations.


There will be many nodes in the final dataset. The result should have the form:

{55: {11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65},
 57: {41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67},
 25: {11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75}...
   }.
 

This shows that influencer 55 is frequently associated with influencers 11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, and 65.

# Begin with edges
# Begin with edges
edges = set()


with open('data/influencers.tsv', 'r') as file: 
    for line in file:
        a, b = line.split('\t')
        e = (int(a), int(b))
        edges.add(e)

def edge_to_neighbour(edges): #defining function to create dictionary for neighbour list
    neighbour_list = {}
    for edge in edges:
        node1, node2 = edge #defining relevant parameters including 
        if node1 not in neighbour_list:
            neighbour_list[node1] = set() #checking if node is already a key in neighbour list
        if node2 not in neighbour_list:
            neighbour_list[node2] = set()
        neighbour_list[node1].add(node2) #appending neighbouring nodes to neighbour list nodes
        neighbour_list[node2].add(node1)
         
    return neighbour_list #return neighbour list values instead of edge list values
        
    
neighbour_list = edge_to_neighbour(edges) #defining content of neighbour_list dictionary
    

for node, neighbours in neighbour_list.items():
    neighbours = sorted(neighbours)
    print(f"Node {node}: Neighbours {neighbours}")
Node 55: Neighbours [11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65]
Node 57: Neighbours [41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67]
Node 25: Neighbours [11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75]
Node 23: Neighbours [11, 12, 16, 17, 18, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31]
Node 16: Neighbours [17, 18, 19, 20, 21, 22, 23, 26, 55]
Node 20: Neighbours [16, 17, 18, 19, 21, 22, 23]
Node 59: Neighbours [48, 55, 57, 58, 60, 61, 62, 63, 64, 65, 66]
Node 58: Neighbours [11, 27, 48, 55, 57, 59, 60, 61, 62, 63, 64, 65, 66, 70, 76]
Node 65: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 63, 64, 66, 76]
Node 11: Neighbours [0, 2, 3, 10, 12, 13, 14, 15, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 43, 44, 48, 49, 51, 55, 58, 64, 68, 69, 70, 71, 72]
Node 8: Neighbours [0]
Node 0: Neighbours [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]
Node 51: Neighbours [11, 26, 49, 52, 53, 54, 55]
Node 22: Neighbours [16, 17, 18, 19, 20, 21, 23]
Node 17: Neighbours [16, 18, 19, 20, 21, 22, 23]
Node 68: Neighbours [11, 24, 25, 27, 41, 48, 69, 70, 71, 75]
Node 48: Neighbours [11, 25, 27, 47, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 71, 73, 74, 75, 76]
Node 66: Neighbours [48, 58, 59, 60, 61, 62, 63, 64, 65, 76]
Node 60: Neighbours [48, 58, 59, 61, 62, 63, 64, 65, 66]
Node 21: Neighbours [16, 17, 18, 19, 20, 22, 23]
Node 19: Neighbours [16, 17, 18, 20, 21, 22, 23]
Node 18: Neighbours [16, 17, 19, 20, 21, 22, 23]
Node 69: Neighbours [11, 24, 25, 27, 41, 48, 68, 70, 71, 75]
Node 34: Neighbours [11, 29, 35, 36, 37, 38]
Node 37: Neighbours [11, 29, 34, 35, 36, 38]
Node 54: Neighbours [26, 49, 51, 55]
Node 49: Neighbours [11, 26, 50, 51, 54, 55, 56]
Node 63: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 76]
Node 41: Neighbours [24, 25, 42, 55, 57, 62, 68, 69, 70, 71, 75]
Node 24: Neighbours [11, 23, 25, 26, 27, 41, 42, 50, 68, 69, 70]
Node 72: Neighbours [11, 26, 27]
Node 27: Neighbours [11, 23, 24, 25, 26, 28, 29, 31, 33, 43, 48, 58, 68, 69, 70, 71, 72]
Node 43: Neighbours [11, 26, 27]
Node 42: Neighbours [24, 25, 41]
Node 73: Neighbours [48, 74]
Node 74: Neighbours [48, 73]
Node 70: Neighbours [11, 24, 25, 27, 41, 58, 68, 69, 71, 75]
Node 61: Neighbours [48, 55, 57, 58, 59, 60, 62, 63, 64, 65, 66]
Node 44: Neighbours [11, 28]
Node 29: Neighbours [11, 23, 27, 34, 35, 36, 37, 38]
Node 76: Neighbours [48, 58, 62, 63, 64, 65, 66]
Node 62: Neighbours [41, 48, 55, 57, 58, 59, 60, 61, 63, 64, 65, 66, 76]
Node 7: Neighbours [0]
Node 71: Neighbours [11, 25, 27, 41, 48, 68, 69, 70, 75]
Node 47: Neighbours [46, 48]
Node 35: Neighbours [11, 29, 34, 36, 37, 38]
Node 9: Neighbours [0]
Node 64: Neighbours [11, 48, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 76]
Node 26: Neighbours [11, 16, 24, 25, 27, 43, 49, 51, 54, 55, 72]
Node 2: Neighbours [0, 3, 11]
Node 50: Neighbours [24, 49]
Node 31: Neighbours [11, 23, 27, 30]
Node 15: Neighbours [11]
Node 56: Neighbours [49, 55]
Node 32: Neighbours [11]
Node 52: Neighbours [39, 51]
Node 5: Neighbours [0]
Node 75: Neighbours [25, 41, 48, 68, 69, 70, 71]
Node 33: Neighbours [11, 27]
Node 39: Neighbours [25, 52, 55]
Node 30: Neighbours [23, 31]
Node 36: Neighbours [11, 29, 34, 35, 37, 38]
Node 38: Neighbours [11, 29, 34, 35, 36, 37]
Node 13: Neighbours [11]
Node 28: Neighbours [11, 27, 44, 45]
Node 45: Neighbours [28]
Node 4: Neighbours [0]
Node 14: Neighbours [11]
Node 12: Neighbours [11, 23]
Node 6: Neighbours [0]
Node 3: Neighbours [0, 2, 11]
Node 67: Neighbours [57]
Node 53: Neighbours [51]
Node 46: Neighbours [47]
Node 40: Neighbours [25]
Node 1: Neighbours [0]
Node 10: Neighbours [11]
# Verifying my more efficient solution
expected_neighbours = {
55: {11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65}, #Expected neighbours for node 1
57: {41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67}, #Expected neighbours for node 2
25: {11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75} #Expected neighbours for node 3
   }

# Check if the generated neighbour lists match the expected neighbours for each node
for node, expected_neighbour_set in expected_neighbours.items():
    if node not in neighbour_list:
        print(f"Node {node} not found in the neighbor list.")
    elif neighbour_list[node] == expected_neighbour_set:
        print(f"Solution for node {node} Verified")
    else:
        print(f"Solution for node {node} is not Verified") 
Solution for node 55 Verified
Solution for node 57 Verified
Solution for node 25 Verified
# Testing the performance of my more efficient solution
import timeit

 # Read edges from file and store them in a set
edges = set()
with open('data/influencers.tsv', 'r') as file:
    for line in file:
        a, b = line.split('\t')
        e = (int(a), int(b))
        edges.add(e)

# Define the function to convert edges to neighbor list
def edge_to_neighbour(edges):
    neighbour_list = {}
    for edge in edges:
        node1, node2 = edge
        if node1 not in neighbour_list:
            neighbour_list[node1] = set()
        if node2 not in neighbour_list:
            neighbour_list[node2] = set()
        neighbour_list[node1].add(node2)
        neighbour_list[node2].add(node1)
        # Assuming undirected graph, add both directions
    return neighbour_list

# Specify operation to calcualate the time taken to convert edges to neighbor list
time_taken = timeit.repeat(lambda: edge_to_neighbour(edges), repeat=19, number=2000)

# Return Result
print("Average Performance:", sum(time_taken) / len(time_taken))
Average Performance: 0.6957541842104643

Discussion of more efficient solution characteristics:

# Verifying my less efficient solution
edges = set()
with open('data/influencers.tsv', 'r') as file:
    for line in file:
        a, b = line.split('\t')
        e = (int(a), int(b))
        edges.add(e)

neighbour_list = {}
for edge in edges:
    # Introduce unnecessary conversion to a list
    edge_list = list(edge)
    node1, node2 = edge_list[0], edge_list[1]
    if node1 not in neighbour_list:
        neighbour_list[node1] = set()
    if node2 not in neighbour_list:
        neighbour_list[node2] = set()
    neighbour_list[node1].add(node2)
    neighbour_list[node2].add(node1)

for node, neighbours in neighbour_list.items():
    # Introduce redundant sorting of neighbors
    sorted_neighbours = sorted(neighbours)
    print(f"Node {node}: Neighbours {sorted_neighbours}")
Node 55: Neighbours [11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65]
Node 57: Neighbours [41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67]
Node 25: Neighbours [11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75]
Node 23: Neighbours [11, 12, 16, 17, 18, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31]
Node 16: Neighbours [17, 18, 19, 20, 21, 22, 23, 26, 55]
Node 20: Neighbours [16, 17, 18, 19, 21, 22, 23]
Node 59: Neighbours [48, 55, 57, 58, 60, 61, 62, 63, 64, 65, 66]
Node 58: Neighbours [11, 27, 48, 55, 57, 59, 60, 61, 62, 63, 64, 65, 66, 70, 76]
Node 65: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 63, 64, 66, 76]
Node 11: Neighbours [0, 2, 3, 10, 12, 13, 14, 15, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 43, 44, 48, 49, 51, 55, 58, 64, 68, 69, 70, 71, 72]
Node 8: Neighbours [0]
Node 0: Neighbours [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]
Node 51: Neighbours [11, 26, 49, 52, 53, 54, 55]
Node 22: Neighbours [16, 17, 18, 19, 20, 21, 23]
Node 17: Neighbours [16, 18, 19, 20, 21, 22, 23]
Node 68: Neighbours [11, 24, 25, 27, 41, 48, 69, 70, 71, 75]
Node 48: Neighbours [11, 25, 27, 47, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 71, 73, 74, 75, 76]
Node 66: Neighbours [48, 58, 59, 60, 61, 62, 63, 64, 65, 76]
Node 60: Neighbours [48, 58, 59, 61, 62, 63, 64, 65, 66]
Node 21: Neighbours [16, 17, 18, 19, 20, 22, 23]
Node 19: Neighbours [16, 17, 18, 20, 21, 22, 23]
Node 18: Neighbours [16, 17, 19, 20, 21, 22, 23]
Node 69: Neighbours [11, 24, 25, 27, 41, 48, 68, 70, 71, 75]
Node 34: Neighbours [11, 29, 35, 36, 37, 38]
Node 37: Neighbours [11, 29, 34, 35, 36, 38]
Node 54: Neighbours [26, 49, 51, 55]
Node 49: Neighbours [11, 26, 50, 51, 54, 55, 56]
Node 63: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 76]
Node 41: Neighbours [24, 25, 42, 55, 57, 62, 68, 69, 70, 71, 75]
Node 24: Neighbours [11, 23, 25, 26, 27, 41, 42, 50, 68, 69, 70]
Node 72: Neighbours [11, 26, 27]
Node 27: Neighbours [11, 23, 24, 25, 26, 28, 29, 31, 33, 43, 48, 58, 68, 69, 70, 71, 72]
Node 43: Neighbours [11, 26, 27]
Node 42: Neighbours [24, 25, 41]
Node 73: Neighbours [48, 74]
Node 74: Neighbours [48, 73]
Node 70: Neighbours [11, 24, 25, 27, 41, 58, 68, 69, 71, 75]
Node 61: Neighbours [48, 55, 57, 58, 59, 60, 62, 63, 64, 65, 66]
Node 44: Neighbours [11, 28]
Node 29: Neighbours [11, 23, 27, 34, 35, 36, 37, 38]
Node 76: Neighbours [48, 58, 62, 63, 64, 65, 66]
Node 62: Neighbours [41, 48, 55, 57, 58, 59, 60, 61, 63, 64, 65, 66, 76]
Node 7: Neighbours [0]
Node 71: Neighbours [11, 25, 27, 41, 48, 68, 69, 70, 75]
Node 47: Neighbours [46, 48]
Node 35: Neighbours [11, 29, 34, 36, 37, 38]
Node 9: Neighbours [0]
Node 64: Neighbours [11, 48, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 76]
Node 26: Neighbours [11, 16, 24, 25, 27, 43, 49, 51, 54, 55, 72]
Node 2: Neighbours [0, 3, 11]
Node 50: Neighbours [24, 49]
Node 31: Neighbours [11, 23, 27, 30]
Node 15: Neighbours [11]
Node 56: Neighbours [49, 55]
Node 32: Neighbours [11]
Node 52: Neighbours [39, 51]
Node 5: Neighbours [0]
Node 75: Neighbours [25, 41, 48, 68, 69, 70, 71]
Node 33: Neighbours [11, 27]
Node 39: Neighbours [25, 52, 55]
Node 30: Neighbours [23, 31]
Node 36: Neighbours [11, 29, 34, 35, 37, 38]
Node 38: Neighbours [11, 29, 34, 35, 36, 37]
Node 13: Neighbours [11]
Node 28: Neighbours [11, 27, 44, 45]
Node 45: Neighbours [28]
Node 4: Neighbours [0]
Node 14: Neighbours [11]
Node 12: Neighbours [11, 23]
Node 6: Neighbours [0]
Node 3: Neighbours [0, 2, 11]
Node 67: Neighbours [57]
Node 53: Neighbours [51]
Node 46: Neighbours [47]
Node 40: Neighbours [25]
Node 1: Neighbours [0]
Node 10: Neighbours [11]
# Testing the performance of my less efficient solution
import timeit


 # Read edges from file and store them in a set
edges = set()
with open('data/influencers.tsv', 'r') as file:
    for line in file:
        a, b = line.split('\t')
        e = (int(a), int(b))
        edges.add(e)

neighbour_list = {}
for edge in edges:
    # Introduce unnecessary conversion to a list
    edge_list = list(edge)
    node1, node2 = edge_list[0], edge_list[1]
    if node1 not in neighbour_list:
        neighbour_list[node1] = set()
    if node2 not in neighbour_list:
        neighbour_list[node2] = set()
    neighbour_list[node1].add(node2)
    neighbour_list[node2].add(node1)

for node, neighbours in neighbour_list.items():
    # Introduce redundant sorting of neighbors
    sorted_neighbours = sorted(neighbours)
    print(f"Node {node}: Neighbours {sorted_neighbours}")

# Specify operation to calcualate the time taken to convert edges to neighbor list
time_taken = timeit.repeat(lambda: edge_to_neighbour(edges), repeat=19, number=2000)

# Return Result
print("Performance:", sum(time_taken) / len(time_taken))
Node 55: Neighbours [11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65]
Node 57: Neighbours [41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67]
Node 25: Neighbours [11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75]
Node 23: Neighbours [11, 12, 16, 17, 18, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31]
Node 16: Neighbours [17, 18, 19, 20, 21, 22, 23, 26, 55]
Node 20: Neighbours [16, 17, 18, 19, 21, 22, 23]
Node 59: Neighbours [48, 55, 57, 58, 60, 61, 62, 63, 64, 65, 66]
Node 58: Neighbours [11, 27, 48, 55, 57, 59, 60, 61, 62, 63, 64, 65, 66, 70, 76]
Node 65: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 63, 64, 66, 76]
Node 11: Neighbours [0, 2, 3, 10, 12, 13, 14, 15, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 43, 44, 48, 49, 51, 55, 58, 64, 68, 69, 70, 71, 72]
Node 8: Neighbours [0]
Node 0: Neighbours [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]
Node 51: Neighbours [11, 26, 49, 52, 53, 54, 55]
Node 22: Neighbours [16, 17, 18, 19, 20, 21, 23]
Node 17: Neighbours [16, 18, 19, 20, 21, 22, 23]
Node 68: Neighbours [11, 24, 25, 27, 41, 48, 69, 70, 71, 75]
Node 48: Neighbours [11, 25, 27, 47, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 71, 73, 74, 75, 76]
Node 66: Neighbours [48, 58, 59, 60, 61, 62, 63, 64, 65, 76]
Node 60: Neighbours [48, 58, 59, 61, 62, 63, 64, 65, 66]
Node 21: Neighbours [16, 17, 18, 19, 20, 22, 23]
Node 19: Neighbours [16, 17, 18, 20, 21, 22, 23]
Node 18: Neighbours [16, 17, 19, 20, 21, 22, 23]
Node 69: Neighbours [11, 24, 25, 27, 41, 48, 68, 70, 71, 75]
Node 34: Neighbours [11, 29, 35, 36, 37, 38]
Node 37: Neighbours [11, 29, 34, 35, 36, 38]
Node 54: Neighbours [26, 49, 51, 55]
Node 49: Neighbours [11, 26, 50, 51, 54, 55, 56]
Node 63: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 76]
Node 41: Neighbours [24, 25, 42, 55, 57, 62, 68, 69, 70, 71, 75]
Node 24: Neighbours [11, 23, 25, 26, 27, 41, 42, 50, 68, 69, 70]
Node 72: Neighbours [11, 26, 27]
Node 27: Neighbours [11, 23, 24, 25, 26, 28, 29, 31, 33, 43, 48, 58, 68, 69, 70, 71, 72]
Node 43: Neighbours [11, 26, 27]
Node 42: Neighbours [24, 25, 41]
Node 73: Neighbours [48, 74]
Node 74: Neighbours [48, 73]
Node 70: Neighbours [11, 24, 25, 27, 41, 58, 68, 69, 71, 75]
Node 61: Neighbours [48, 55, 57, 58, 59, 60, 62, 63, 64, 65, 66]
Node 44: Neighbours [11, 28]
Node 29: Neighbours [11, 23, 27, 34, 35, 36, 37, 38]
Node 76: Neighbours [48, 58, 62, 63, 64, 65, 66]
Node 62: Neighbours [41, 48, 55, 57, 58, 59, 60, 61, 63, 64, 65, 66, 76]
Node 7: Neighbours [0]
Node 71: Neighbours [11, 25, 27, 41, 48, 68, 69, 70, 75]
Node 47: Neighbours [46, 48]
Node 35: Neighbours [11, 29, 34, 36, 37, 38]
Node 9: Neighbours [0]
Node 64: Neighbours [11, 48, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 76]
Node 26: Neighbours [11, 16, 24, 25, 27, 43, 49, 51, 54, 55, 72]
Node 2: Neighbours [0, 3, 11]
Node 50: Neighbours [24, 49]
Node 31: Neighbours [11, 23, 27, 30]
Node 15: Neighbours [11]
Node 56: Neighbours [49, 55]
Node 32: Neighbours [11]
Node 52: Neighbours [39, 51]
Node 5: Neighbours [0]
Node 75: Neighbours [25, 41, 48, 68, 69, 70, 71]
Node 33: Neighbours [11, 27]
Node 39: Neighbours [25, 52, 55]
Node 30: Neighbours [23, 31]
Node 36: Neighbours [11, 29, 34, 35, 37, 38]
Node 38: Neighbours [11, 29, 34, 35, 36, 37]
Node 13: Neighbours [11]
Node 28: Neighbours [11, 27, 44, 45]
Node 45: Neighbours [28]
Node 4: Neighbours [0]
Node 14: Neighbours [11]
Node 12: Neighbours [11, 23]
Node 6: Neighbours [0]
Node 3: Neighbours [0, 2, 11]
Node 67: Neighbours [57]
Node 53: Neighbours [51]
Node 46: Neighbours [47]
Node 40: Neighbours [25]
Node 1: Neighbours [0]
Node 10: Neighbours [11]
Performance: 0.6348627842104945

Discussion of less efficient solution characteristics:

Discussion of differing efficiency:


Task 2


Data Reference:

Column Description
ID id number that identifies the property
name Property name
Host ID id number that identifies the host
Host Name Host name
neighbourhood_group The main regions of the city
neighbourhood The neighbourhoods
latitude Property latitude
longitude Property longitude
Room Type Type of the room
price The price for one night ($)
minimum_nights Minimum number of nights to book the place
number_of_reviews Number of reviews received
last_review Date of the last review
reviews_per_month Number of reviews per month
calculated_host_listings_count Number of properties available on Airbnb owned by the host
availability_365 Number of days of availability within 365 days
city Property city

2.0. Analysis Preparation

[5 Marks]


2.0.1. Import [2 marks]

Import the libraries that will be used in your solutions for Task 2.

import pandas as pd
import numpy as np

2.0.2. Load data [1 mark]

Locate the data file airbnb.csv within the zipped file you have downloaded from Blackboard (under ./data/) and read the data from file to a pandas DataFrame variable called air.

df = pd.read_csv('data/airbnb.csv') # read data from csv into DataFrame

print(df)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_10644\3209262031.py:1: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv') # read data from csv into DataFrame
              ID                                               name  \
0          38585    Charming Victorian home - twin beds + breakfast   
1          80905                                   French Chic Loft   
2         108061  Walk to stores/parks/downtown. Fenced yard/Pet...   
3         155305                 Cottage! BonPaul + Sharky's Hostel   
4         160594                                Historic Grove Park   
...          ...                                                ...   
226025  45506143                          DC Hidden In Plain "Site"   
226026  45511428  DC 3 BR w/ screen porch 3 blck to metro w/ par...   
226027  45514685  Charming Penthouse Apt w/ Rooftop Terrace in L...   
226028  45516412                Adams Morgan/Nat'l Zoo 1 BR Apt #32   
226029  45517735    Beautiful large one-bedroom w/ washer and dryer   

          Host ID  Host Name neighbourhood_group  \
0          165529    Evelyne                 NaN   
1          427027    Celeste                 NaN   
2          320564       Lisa                 NaN   
3          746673    BonPaul                 NaN   
4          769252  Elizabeth                 NaN   
...           ...        ...                 ...   
226025   25973146      Marci                 NaN   
226026  231133074     Thomas                 NaN   
226027   33758935     Bassem                 NaN   
226028   23193071    Michael                 NaN   
226029   17789858       Adam                 NaN   

                                            neighbourhood   latitude  \
0                                                   28804  35.651460   
1                                                   28801  35.597790   
2                                                   28801  35.606700   
3                                                   28806  35.578640   
4                                                   28801  35.614420   
...                                                   ...        ...   
226025  Downtown, Chinatown, Penn Quarters, Mount Vern...  38.903880   
226026                      Brookland, Brentwood, Langdon  38.920820   
226027                                 Shaw, Logan Circle  38.911170   
226028     Kalorama Heights, Adams Morgan, Lanier Heights  38.926630   
226029  Edgewood, Bloomingdale, Truxton Circle, Eckington  38.911569   

        longitude        Room Type  price  minimum_nights  number_of_reviews  \
0      -82.627920     Private room   60.0               1                138   
1      -82.555400  Entire home/apt  470.0               1                114   
2      -82.555630  Entire home/apt    NaN              30                 89   
3      -82.595780  Entire home/apt   90.0               1                267   
4      -82.541270     Private room  125.0              30                 58   
...           ...              ...    ...             ...                ...   
226025 -77.029730  Entire home/apt  104.0               1                  0   
226026 -76.990980  Entire home/apt  151.0               2                  0   
226027 -77.033540  Entire home/apt  240.0               2                  0   
226028 -77.044360  Entire home/apt   60.0              21                  0   
226029 -77.009431  Entire home/apt   79.0               7                  0   

       last_review  reviews_per_month  calculated_host_listings_count  \
0       16 02 2020               1.14                               1   
1       07 09 2020               1.03                              11   
2       30 11 2019               0.81                               2   
3       22 09 2020               2.39                               5   
4       19 10 2015               0.52                               1   
...            ...                ...                             ...   
226025         NaN                NaN                               2   
226026         NaN                NaN                               1   
226027         NaN                NaN                               1   
226028         NaN                NaN                               5   
226029         NaN                NaN                               2   

        availability_365             city  
0                      0        Asheville  
1                    288        Asheville  
2                    298        Asheville  
3                      0        Asheville  
4                      0        Asheville  
...                  ...              ...  
226025                99  Washington D.C.  
226026               300  Washington D.C.  
226027               173  Washington D.C.  
226028               362  Washington D.C.  
226029                62  Washington D.C.  

[226030 rows x 17 columns]

2.0.3. Display data [1 mark]

Display the first 5 rows of the DataFrame.

print(df.head())

2.0.4. Display data information [1 mark]

In one or more code cells, display the following information of the DataFrame:

Following the display of this information, use a Markdown cell to write:

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])

print("Column Names")
print(df.columns)

print ("\nData Types")
print(df.dtypes)
Number of rows: 226030
Number of columns: 15
Column Names
Index(['name', 'Host Name', 'neighbourhood_group', 'neighbourhood', 'latitude',
       'longitude', 'Room Type', 'price', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365', 'city'],
      dtype='object')

Data Types
name                               object
Host Name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
Room Type                          object
price                             float64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
city                               object
dtype: object

Markdown...


2.1. Data Cleaning

[15 Marks]


2.1.1. Remove columns [2 marks]

Drop the following features from the DataFrame:

Then, store and display the resulting DataFrame and write a statement, explaining why one might choose to exclude these features.

import pandas as pd

df_modified = df.drop(columns=['ID', 'Host ID', 'Neighbourhood Group'])

df_modified



#One may choose to exclude these features because they are unnecessary
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[16], line 3
      1 import pandas as pd
----> 3 df_modified = df.drop(columns=['ID', 'Host ID', 'Neighbourhood Group'])
      5 df_modified
      9 #One may choose to exclude these features because they are unnecessary

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\frame.py:5581, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   5433 def drop(
   5434     self,
   5435     labels: IndexLabel | None = None,
   (...)
   5442     errors: IgnoreRaise = "raise",
   5443 ) -> DataFrame | None:
   5444     """
   5445     Drop specified labels from rows or columns.
   5446 
   (...)
   5579             weight  1.0     0.8
   5580     """
-> 5581     return super().drop(
   5582         labels=labels,
   5583         axis=axis,
   5584         index=index,
   5585         columns=columns,
   5586         level=level,
   5587         inplace=inplace,
   5588         errors=errors,
   5589     )

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\generic.py:4788, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4786 for axis, labels in axes.items():
   4787     if labels is not None:
-> 4788         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4790 if inplace:
   4791     self._update_inplace(obj)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\generic.py:4830, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
   4828         new_axis = axis.drop(labels, level=level, errors=errors)
   4829     else:
-> 4830         new_axis = axis.drop(labels, errors=errors)
   4831     indexer = axis.get_indexer(new_axis)
   4833 # Case for non-unique axis
   4834 else:

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\indexes\base.py:7070, in Index.drop(self, labels, errors)
   7068 if mask.any():
   7069     if errors != "ignore":
-> 7070         raise KeyError(f"{labels[mask].tolist()} not found in axis")
   7071     indexer = indexer[~mask]
   7072 return self.delete(indexer)

KeyError: "['ID', 'Host ID', 'Neighbourhood Group'] not found in axis"

2.1.2. Rename columns [2 marks]

Rename the following columns:

Column New name
Host Name host_name
Room Type room_type
reviews_per_month rpm
calculated_host_listings_count listing_count
availability_365 availability

Then, display the resulting DataFrame and write a statement, discussing why one might change the names in this way.

import pandas as pd

# Load data from TSV file
df = pd.read_csv('data/airbnb.csv')

# Check the column names in the DataFrame
print("Original column names:")
print(df.columns)

# Perform the renaming of columns
df = df.rename(columns={'Host Name': 'host_name',
    'Room Type': 'room_type',
    'reviews_per_month': 'RPM',
    'calculated_host_listings_count': 'listing_count',
    'availability_365': 'availability',
                       })

# Verify the renamed column names
print("\nRenamed column names:")
print(df.columns)


"""It may be considered useful to change the names in this manner because it ensures that the column headings are all recorded in the same manner, where previously some were recorded as
capitalised headings and some were named as traditional file names e.g. city mpg. This disparity housed inconsistency within the format of the table, meaning that not every column's purpose was clear, especially if headed by an abbreviation e.g. MSRP. By changing the headings to the proposed names, consistency and uniformity is reintroduced to the table."""
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\272338296.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv')
Original column names:
Index(['ID', 'name', 'Host ID', 'Host Name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'Room Type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'city'],
      dtype='object')

Renamed column names:
Index(['ID', 'name', 'Host ID', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review', 'RPM',
       'listing_count', 'availability', 'city'],
      dtype='object')
"It may be considered useful to change the names in this manner because it ensures that the column headings are all recorded in the same manner, where previously some were recorded as\ncapitalised headings and some were named as traditional file names e.g. city mpg. This disparity housed inconsistency within the format of the table, meaning that not every column's purpose was clear, especially if headed by an abbreviation e.g. MSRP. By changing the headings to the proposed names, consistency and uniformity is reintroduced to the table."

2.1.3. Remove duplicated rows [3 marks]

Use an f-string to print() the number of irrelevant duplicate rows (count repeats only, not both originals and repeats).

Drop the duplicated rows, retain the resulting DataFrame, and show the resulting number of rows in the new DataFrame.

import pandas as pd

# reload dataset
df = pd.read_csv('data/airbnb.csv')

# renaming columns code
new_column_names = {
    'Host Name': 'host_name',
    'Room Type': 'room_type',
    'reviews_per_month': 'RPM',
    'calculated_host_listings_count': 'listing_count',
    'availability_365': 'availability',
}


df.rename(columns=new_column_names, inplace=True)

# counting duplicate rows
num_duplicate_rows = df.duplicated().sum()
print(f"Number of irrelevant duplicate rows: {num_duplicate_rows}")

C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\1593544922.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv')
Number of irrelevant duplicate rows: 0

2.1.4. Null values [2 marks]

Report the number of null values remaining in each column.

import pandas as pd

# reload dataset
df = pd.read_csv('data/airbnb.csv')

# renaming columns code
new_column_names = {
    'Host Name': 'host_name',
    'Room Type': 'room_type',
    'reviews_per_month': 'RPM',
    'calculated_host_listings_count': 'listing_count',
    'availability_365': 'availability',
}

df.rename(columns=new_column_names, inplace=True)

num_null_values = df.isnull().sum()
print(f"Number of null values: {num_null_values}")
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3555504588.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv')
Number of null values: ID                          0
name                       28
Host ID                     0
host_name                  33
neighbourhood_group    115845
neighbourhood               0
latitude                    0
longitude                   0
room_type                   0
price                       9
minimum_nights              0
number_of_reviews           0
last_review             48602
RPM                     48602
listing_count               0
availability                0
city                        0
dtype: int64

2.1.5. Handling missing values [3 marks]

import pandas as pd
import numpy as np

# Load data from TSV file
df = pd.read_csv('data/airbnb.csv')

# Calculate the mean of the 'price' column using a numpy routine
price_mean = np.nanmean(df['price'])

# Replace missing values in the 'price' column with the calculated mean
df['price'].fillna(price_mean, inplace=True)

# Drop all other rows that still contain missing values
df.dropna(inplace=True)

# Display the final number of rows left in the resulting DataFrame
num_rows_left = len(df)
print(f"Final number of rows left in the resulting DataFrame: {num_rows_left}")

# Optionally, display the modified DataFrame
print(df)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\987998301.py:5: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv')
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\987998301.py:11: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['price'].fillna(price_mean, inplace=True)
Final number of rows left in the resulting DataFrame: 85144
              ID                                               name  \
48150       5065                                           MAUKA BB   
48151       5269          Upcountry Hospitality in the 'Auwai Suite   
48152       5387                Hale Koa Studio & 1 Bedroom Units!!   
48153       5389                                      Keauhou Villa   
48154       5390                              STAY AT PRINCE KUHIO!   
...          ...                                                ...   
212157  43531859              Chic Seattle Apartment near Greenlake   
212159  43554835  SL6 - Private·Modern·Quality·Convenient·N ...   
212160  43554849  SL4 - Modern,Private,Quality,Convenient,N Seattle   
212161  43589616     Perfect spot for group events or family stays!   
212164  43603849               Rooftop Tent in the Heart of Seattle   

          Host ID  Host Name  neighbourhood_group    neighbourhood  latitude  \
48150        7257      Wayne               Hawaii          Hamakua  20.04095   
48151        7620  Lea & Pat               Hawaii     South Kohala  20.02740   
48152        7878     Edward               Hawaii       South Kona  19.43119   
48153        7878     Edward               Hawaii       North Kona  19.56413   
48154        7887       Todd                Kauai      Koloa-Poipu  21.88305   
...           ...        ...                  ...              ...       ...   
212157    6601753     Isabel  Other neighborhoods        Roosevelt  47.68457   
212159  287025852         Li            Lake City    Olympic Hills  47.73462   
212160  287025852         Li            Lake City    Olympic Hills  47.73357   
212161  347974040     Marcus             Delridge    Highland Park  47.51284   
212164   51816582        Fei           Queen Anne  East Queen Anne  47.63913   

        longitude        Room Type  price  minimum_nights  number_of_reviews  \
48150  -155.43251  Entire home/apt   85.0               2                 42   
48151  -155.70200  Entire home/apt  124.0              30                 10   
48152  -155.88079  Entire home/apt   85.0               5                168   
48153  -155.96347  Entire home/apt  239.0               6                 20   
48154  -159.47372  Entire home/apt   92.0               3                143   
...           ...              ...    ...             ...                ...   
212157 -122.31550  Entire home/apt  100.0               3                  1   
212159 -122.29509     Private room   79.0               1                  2   
212160 -122.29651  Entire home/apt   79.0               1                  1   
212161 -122.33587  Entire home/apt  200.0               1                  2   
212164 -122.34293     Private room   35.0               1                  2   

       last_review  reviews_per_month  calculated_host_listings_count  \
48150   22 03 2020               0.45                               2   
48151   01 03 2020               0.09                               5   
48152   18 03 2020               1.30                               3   
48153   22 03 2020               0.24                               3   
48154   10 08 2020               1.03                               1   
...            ...                ...                             ...   
212157  06 06 2020               1.00                               1   
212159  10 06 2020               2.00                               3   
212160  13 06 2020               1.00                               3   
212161  14 06 2020               2.00                               1   
212164  14 06 2020               2.00                               2   

        availability_365     city  
48150                365   Hawaii  
48151                261   Hawaii  
48152                242   Hawaii  
48153                287   Hawaii  
48154                116   Hawaii  
...                  ...      ...  
212157                99  Seattle  
212159               337  Seattle  
212160               351  Seattle  
212161               330  Seattle  
212164                 0  Seattle  

[85144 rows x 17 columns]

2.1.6. Reasoning [3 marks]

In the previous step, we eliminated entries with missing data from the dataset. If this dataset were to be used to train a machine learning model to predict price for one night, what is a potential drawback of this approach? Describe an alternative approach and any related caveats of which we should be aware.

Markdown... It will shrink the dataset and overall based on which entries have been removed,this will potentially bias the model and affect it's ability to be trained fairly, as it will have significantly differently amounts of data in each category.

An alternative approach to removing missing values could be regression imputation, which utilises this tactic from the data science process in a predictive context, aiming to simply predict missing values based on a regression model, also allowing it to only be selectively used if there is a strong relationship between variables.


2.2. Creating New Columns

[5 Marks]


2.2.1. Add name_len [2 marks]

Using an implementation of list comprehension, create a new column, name_len, such that,

Then, display the modified DataFrame and report the number of entries in these two new categories.

import pandas as pd

# Load data from TSV file
df = pd.read_csv('data/airbnb.csv')


#df = pd.DataFrame()

# Create a new column 'name_len' using list comprehension with a conditional
df['name_len'] = ['verbose' if len(name) >= 35 else 'brief' for name in df['name']]

# Display the modified DataFrame
print(df)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\1541629390.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[68], line 10
      4 df = pd.read_csv('data/airbnb.csv')
      7 #df = pd.DataFrame()
      8 
      9 # Create a new column 'name_len' using list comprehension with a conditional
---> 10 df['name_len'] = ['verbose' if len(name) >= 35 else 'brief' for name in df['name']]
     12 # Display the modified DataFrame
     13 print(df)

TypeError: object of type 'float' has no len()

2.2.2. Add price_class [2 marks]

Using an implementation of a function, create a new column, price_class, such that it becomes equal to:

Then, display the modified DataFrame and report the number of entries in these two new categories.

import pandas as pd

def classify_price(price):
    if price >= 5000:
        return 'astronomical'
    elif 1000 <= price < 5000:
        return 'high'
    elif 200 <= price < 1000:
        return 'mid'
    else:
        return 'low'

# Assuming you have a DataFrame called 'df' with a column 'price'
df['price_class'] = df['price'].apply(classify_price)

    

print(df)
              ID                                               name  \
0          38585    Charming Victorian home - twin beds + breakfast   
1          80905                                   French Chic Loft   
2         108061  Walk to stores/parks/downtown. Fenced yard/Pet...   
3         155305                 Cottage! BonPaul + Sharky's Hostel   
4         160594                                Historic Grove Park   
...          ...                                                ...   
226025  45506143                          DC Hidden In Plain "Site"   
226026  45511428  DC 3 BR w/ screen porch 3 blck to metro w/ par...   
226027  45514685  Charming Penthouse Apt w/ Rooftop Terrace in L...   
226028  45516412                Adams Morgan/Nat'l Zoo 1 BR Apt #32   
226029  45517735    Beautiful large one-bedroom w/ washer and dryer   

          Host ID  Host Name neighbourhood_group  \
0          165529    Evelyne                 NaN   
1          427027    Celeste                 NaN   
2          320564       Lisa                 NaN   
3          746673    BonPaul                 NaN   
4          769252  Elizabeth                 NaN   
...           ...        ...                 ...   
226025   25973146      Marci                 NaN   
226026  231133074     Thomas                 NaN   
226027   33758935     Bassem                 NaN   
226028   23193071    Michael                 NaN   
226029   17789858       Adam                 NaN   

                                            neighbourhood   latitude  \
0                                                   28804  35.651460   
1                                                   28801  35.597790   
2                                                   28801  35.606700   
3                                                   28806  35.578640   
4                                                   28801  35.614420   
...                                                   ...        ...   
226025  Downtown, Chinatown, Penn Quarters, Mount Vern...  38.903880   
226026                      Brookland, Brentwood, Langdon  38.920820   
226027                                 Shaw, Logan Circle  38.911170   
226028     Kalorama Heights, Adams Morgan, Lanier Heights  38.926630   
226029  Edgewood, Bloomingdale, Truxton Circle, Eckington  38.911569   

        longitude        Room Type  price  minimum_nights  number_of_reviews  \
0      -82.627920     Private room   60.0               1                138   
1      -82.555400  Entire home/apt  470.0               1                114   
2      -82.555630  Entire home/apt    NaN              30                 89   
3      -82.595780  Entire home/apt   90.0               1                267   
4      -82.541270     Private room  125.0              30                 58   
...           ...              ...    ...             ...                ...   
226025 -77.029730  Entire home/apt  104.0               1                  0   
226026 -76.990980  Entire home/apt  151.0               2                  0   
226027 -77.033540  Entire home/apt  240.0               2                  0   
226028 -77.044360  Entire home/apt   60.0              21                  0   
226029 -77.009431  Entire home/apt   79.0               7                  0   

       last_review  reviews_per_month  calculated_host_listings_count  \
0       16 02 2020               1.14                               1   
1       07 09 2020               1.03                              11   
2       30 11 2019               0.81                               2   
3       22 09 2020               2.39                               5   
4       19 10 2015               0.52                               1   
...            ...                ...                             ...   
226025         NaN                NaN                               2   
226026         NaN                NaN                               1   
226027         NaN                NaN                               1   
226028         NaN                NaN                               5   
226029         NaN                NaN                               2   

        availability_365             city price_class  
0                      0        Asheville         low  
1                    288        Asheville         mid  
2                    298        Asheville         low  
3                      0        Asheville         low  
4                      0        Asheville         low  
...                  ...              ...         ...  
226025                99  Washington D.C.         low  
226026               300  Washington D.C.         low  
226027               173  Washington D.C.         mid  
226028               362  Washington D.C.         low  
226029                62  Washington D.C.         low  

[226030 rows x 18 columns]

2.2.3 Save modified data [1 mark]

Save the cumulatively modified DataFrame to a new comma-separated file called airbnb_modified.csv to be stored under the ./data/ directory.

Do not include the row indices in the file.

import pandas as pd

df = pd.read_csv('data/airbnb.csv')

airbnb_modified = df

#directory = 'CS2PP22_CW1_new/data/modified_data' #Specify directory to save file

directory = 'data/modified_data' #Specify directory to save file

file_path = directory + 'airbnb.csv' #Define file path

airbnb_modified.to_csv(file_path, index=False) #save modified dataset to new CSV file
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3801390647.py:3: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv')

2.3. Exploratory Data Analysis

In this section, you will not need to retain any DataFrame modifications.

[15 Marks]


2.3.1. Mean price [1 mark]

Find the mean price of all properties. Report the solution rounded to 2 decimal places.

import pandas as pd

# Calculate the mean price
mean_price = df['price'].mean()

# Report the solution rounded to 2 decimal places
mean_price_rounded = round(mean_price, 2)

print("Mean price of all properties:", mean_price_rounded)
Mean price of all properties: 219.71

2.3.2. Count properties by room type [1 mark]

Report the number of properties for each room type.

import pandas as pd

# Count the number of properties for each room type
properties_per_room_type = df['room_type'].value_counts()

print("Number of properties for each room type:")
print(properties_per_room_type)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'room_type'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[71], line 4
      1 import pandas as pd
      3 # Count the number of properties for each room type
----> 4 properties_per_room_type = df['room_type'].value_counts()
      6 print("Number of properties for each room type:")
      7 print(properties_per_room_type)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'room_type'

2.3.3. Unique Cities [1 mark]

Report a list of unique property cities, sorted in ascending alphabetical order.

import pandas as pd

# Extract unique property cities and sort them
unique_cities = sorted(df['city'].unique())

print("List of unique property cities sorted in ascending alphabetical order:")
print(unique_cities)
List of unique property cities sorted in ascending alphabetical order:
['Asheville', 'Austin', 'Boston', 'Broward County', 'Cambridge', 'Chicago', 'Clark County', 'Columbus', 'Denver', 'Hawaii', 'Jersey City', 'Los Angeles', 'Nashville', 'New Orleans', 'New York City', 'Oakland', 'Pacific Grove', 'Portland', 'Rhode Island', 'Salem', 'San Clara Country', 'San Diego', 'San Francisco', 'San Mateo County', 'Santa Cruz County', 'Seattle', 'Twin Cities MSA', 'Washington D.C.']

2.3.4. Unique Cities [2 marks]

Report the number of unique cities.

Display the mean price and mean number of reviews for each room type with each city by storing the information in a MultiIndex DataFrame. Report the values rounded to 2 decimal places.

Verify that the expected number of unique cities are represented in the resulting DataFrame.

import pandas as pd

# 1. Calculate the number of unique cities
num_unique_cities = df['city'].nunique()

# 2. Group the DataFrame by 'room_type' and 'city'
grouped_data = df.groupby(['Room Type', 'city'])

# 3. Calculate the mean price and mean number of reviews for each group
mean_data = grouped_data.agg({'price': 'mean', 'number_of_reviews': 'mean'})

# 4. Store the information in a MultiIndex DataFrame
multiindex_df = mean_data.round(2)

# 5. Verify the expected number of unique cities
num_cities_in_df = multiindex_df.index.get_level_values('city').nunique()
expected_num_cities = num_unique_cities

print("Number of unique cities:", num_unique_cities)
print("\nMean price and mean number of reviews for each room type with each city:")
print(multiindex_df)

# Verify that the expected number of unique cities are represented in the resulting DataFrame
if num_cities_in_df == expected_num_cities:
    print("\nThe expected number of unique cities are represented in the resulting DataFrame.")
else:
    print("\nThe expected number of unique cities are not represented in the resulting DataFrame.")
Number of unique cities: 28

Mean price and mean number of reviews for each room type with each city:
                                    price  number_of_reviews
Room Type       city                                        
Entire home/apt Asheville          217.91              77.34
                Austin             315.40              33.62
                Boston             209.97              37.92
                Broward County     262.55              21.42
                Cambridge          224.80              49.45
...                                   ...                ...
Shared room     San Mateo County    40.12              17.82
                Santa Cruz County   51.29              24.57
                Seattle             48.19              21.15
                Twin Cities MSA    309.47               5.71
                Washington D.C.     46.53              17.19

[109 rows x 2 columns]

The expected number of unique cities are represented in the resulting DataFrame.

2.3.5. Cross-tabulation [2 marks]

Display in a single DataFrame the number of property entries for each room_type and city combination, with a Total column (showing the sum of the rows) and a Total row (showing the sum of the columns) in the margins.

import pandas as pd

# reload dataset
df = pd.read_csv('data/airbnb.csv')

# renaming columns code
new_column_names = {
    'Host Name': 'host_name',
    'Room Type': 'room_type',
    'reviews_per_month': 'RPM',
    'calculated_host_listings_count': 'listing_count',
    'availability_365': 'availability',
}

# Create a cross-tabulation of 'room_type' and 'city'
cross_tab = pd.crosstab(df['Room Type'], df['city'], margins=True, margins_name='Total')

# Rename the 'Total' column to 'Total' if needed
cross_tab = cross_tab.rename(columns={'All': 'Total'})

# Rename the 'Total' row to 'Total' if needed
cross_tab = cross_tab.rename(index={'All': 'Total'})

print("Number of property entries for each room_type and city combination:")
print(cross_tab)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3283301359.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/airbnb.csv')
Number of property entries for each room_type and city combination:
city             Asheville  Austin  Boston  Broward County  Cambridge  \
Room Type                                                               
Entire home/apt       1684    8085    2162            8267        536   
Hotel room              19      15      27             134          0   
Private room           364    2202    1142            2295        484   
Shared room              7     134       8             162          9   
Total                 2074   10436    3339           10858       1029   

city             Chicago  Clark County  Columbus  Denver  Hawaii  ...  Salem  \
Room Type                                                         ...          
Entire home/apt     4401          5721      1043    3186   20050  ...    101   
Hotel room            73           330         3      35     195  ...      0   
Private room        1833          2291       346     935    2119  ...     98   
Shared room           90            66        17      44      70  ...      3   
Total               6397          8408      1409    4200   22434  ...    202   

city             San Clara Country  San Diego  San Francisco  \
Room Type                                                      
Entire home/apt               3378       9213           4282   
Hotel room                       2         25             98   
Private room                  3402       3013           2496   
Shared room                    309        153            177   
Total                         7091      12404           7053   

city             San Mateo County  Santa Cruz County  Seattle  \
Room Type                                                       
Entire home/apt              1620               1236     5028   
Hotel room                      2                  2       51   
Private room                 1064                326     1399   
Shared room                   169                  7       97   
Total                        2855               1571     6575   

city             Twin Cities MSA  Washington D.C.   Total  
Room Type                                                  
Entire home/apt             4563             5222  154173  
Hotel room                    13               41    1941  
Private room                1845             1902   65887  
Shared room                   49              185    4029  
Total                       6470             7350  226030  

[5 rows x 29 columns]

2.3.6. Isolating specific features [1 mark]

Display data corresponding to hosts with the name 'Mary' who offer private rooms with verbose property names.

How many of these are present?

# Filter the DataFrame for hosts with the name 'Mary' and private rooms
filtered_df = df[(df['Host Name'] == 'Mary') & (df['Room Type'] == 'Private room')]

# Display verbose property names
verbose_property_names = filtered_df[['name', 'Host Name', 'Room Type']]

print("Data corresponding to hosts with the name 'Mary' who offer private rooms with verbose property names:")
print(verbose_property_names)
Data corresponding to hosts with the name 'Mary' who offer private rooms with verbose property names:
                                                     name Host Name  \
58460                           Diamond Head-Waikiki View      Mary   
73473                        MY HOLLYWOOD SPACE RENTAL 15      Mary   
73558                        MY HOLLYWOOD SPACE RENTAL 16      Mary   
73559                        MY HOLLYWOOD SPACE RENTAL 17      Mary   
73580                        MY HOLLYWOOD SPACE RENTAL 18      Mary   
73995                        MY HOLLYWOOD SPACE RENTALS 3      Mary   
73996                        MY HOLLYWOOD SPACE RENTALS 5      Mary   
73997                        MY HOLLYWOOD SPACE RENTALS 4      Mary   
73998                        MY HOLLYWOOD SPACE RENTALS 2      Mary   
73999                        MY HOLLYWOOD SPACE RENTALS 1      Mary   
74196                Hilltop 1 Bedroom View of LA Lights!      Mary   
74492                        MY HOLLYWOOD SPACE RENTAL 12      Mary   
75650                           Spacious and Cozy Bedroom      Mary   
79432                 Hilltop Gorgeous Views MasterSuite!      Mary   
86563      Cozy&bright room 5-10 min to Santa Monica Pier      Mary   
86592    Big private room&bathroom in heart of SM/Parking      Mary   
96107                                       Hotel Jackman      Mary   
100784  Private Bedroom near LAX, El Segundo, Beach Ci...      Mary   
102879        Private Bedroom close to LAX & Beach Cites!      Mary   
117624                 Sunny, calm room in Victorian home      Mary   
125450                 Large comfy room in Victorian home      Mary   
126692    A room for you in Williamsburg! Only for Women.      Mary   
130985                 Clean bright room in spacious apt!      Mary   
131182    Bright, quiet room in 2br close to park/trains.      Mary   
131457          Comfy bed in Cozy Home - GRAND ARMY PLAZA      Mary   
131689                    Great quiet room,great location      Mary   
139221  Cozy room in great Prospect Heights neighborhood.      Mary   
143458    Secured Apartment Queens NY - 15 mins from JFK.      Mary   
147407     ***Beautiful quiet small room in noisy city***      Mary   
155106       Private Room in Upper West Side (no kitchen)      Mary   
158373         Full floor in Park Slope Brownstone duplex      Mary   

           Room Type  
58460   Private room  
73473   Private room  
73558   Private room  
73559   Private room  
73580   Private room  
73995   Private room  
73996   Private room  
73997   Private room  
73998   Private room  
73999   Private room  
74196   Private room  
74492   Private room  
75650   Private room  
79432   Private room  
86563   Private room  
86592   Private room  
96107   Private room  
100784  Private room  
102879  Private room  
117624  Private room  
125450  Private room  
126692  Private room  
130985  Private room  
131182  Private room  
131457  Private room  
131689  Private room  
139221  Private room  
143458  Private room  
147407  Private room  
155106  Private room  
158373  Private room  

2.3.7. Isolating specific features [2 marks]

Display data corresponding to properties located north of 40 degrees latitude for hosts named 'Lori' and 'Rita' that have a 30-night minimum stay together in a single DataFrame.

How many of these entries are present?

# and hosted by 'Lori' or 'Rita' with a 30-night minimum stay
filtered_df = df[(df['latitude'] > 40) & 
                 ((df['Host Name'] == 'Lori') | (df['Host Name'] == 'Rita')) & 
                 (df['minimum_nights'] >= 30)]

# Print the number of entries present
num_entries = len(filtered_df)
print("Number of entries present:", num_entries)

# Display the filtered DataFrame
print("\nData corresponding to properties located north of 40 degrees latitude for hosts named 'Lori' or 'Rita' with a 30-night minimum stay:")
print(filtered_df)
Number of entries present: 7

Data corresponding to properties located north of 40 degrees latitude for hosts named 'Lori' or 'Rita' with a 30-night minimum stay:
              ID                                               name  \
128293  10192564      Room in Lovely, Spacious Upper West Side Apt.   
206979  10295151                MODERN, CLEAN Capitol Hill LOCATION   
207998  17589837         Modern 1 Bdrm Full Kitchen Retreat Near UW   
208821  21717949  Bright, charming, newly remodeled apt.- W Seattle   
209324  24560661   Capitol Hill & Broadway Home Room#A,Free Parking   
209326  24561629  Capitol Hill & Broadway Home Room#B, Free Parking   
209409  24996483  Capitol Hill & Broadway Home Room#C , Free Par...   

          Host ID Host Name  neighbourhood_group    neighbourhood  latitude  \
128293   52320041      Lori            Manhattan  Upper West Side  40.80114   
206979   32689598      Lori         Capitol Hill         Broadway  47.62351   
207998  117593676      Lori  Other neighborhoods           Bryant  47.66977   
208821  158136273      Rita             Delridge        Riverview  47.54253   
209324   18668299      Rita         Capitol Hill         Broadway  47.62411   
209326   18668299      Rita         Capitol Hill         Broadway  47.62275   
209409   18668299      Rita         Capitol Hill         Broadway  47.62226   

        longitude        Room Type  price  minimum_nights  number_of_reviews  \
128293  -73.96772     Private room   70.0              60                 17   
206979 -122.32594  Entire home/apt  100.0              30                 71   
207998 -122.28766  Entire home/apt  130.0              30                 25   
208821 -122.35022  Entire home/apt   80.0              45                 45   
209324 -122.31979     Private room   90.0              30                 51   
209326 -122.31835     Private room   90.0              30                 80   
209409 -122.31990     Private room   80.0              30                 68   

       last_review  reviews_per_month  calculated_host_listings_count  \
128293  2018-09-05               0.34                               1   
206979  2020-03-03               1.31                               1   
207998  2019-09-08               0.65                               1   
208821  2019-08-28               1.47                               1   
209324  2019-08-31               1.97                               6   
209326  2019-09-02               3.16                               6   
209409  2019-08-30               2.71                               6   

        availability_365           city  
128293               212  New York City  
206979               365        Seattle  
207998               323        Seattle  
208821                83        Seattle  
209324               280        Seattle  
209326               294        Seattle  
209409               289        Seattle  

2.3.8. Counting specific features [2 marks]

For properties in Boston with a price greater than $250, report the number of entries for each room type category.

import pandas as pd

# Filter the DataFrame for properties in Boston with a price greater than $250
filtered_df = df[(df['city'] == 'Boston') & (df['price'] > 250)]

# Report the number of entries for each room type category
entries_per_room_type = filtered_df['Room Type'].value_counts()

print("Number of entries for each room type category for properties in Boston with a price greater than $250:")
print(entries_per_room_type)
Number of entries for each room type category for properties in Boston with a price greater than $250:
Series([], Name: count, dtype: int64)

2.3.9. Grouping data [2 marks]

For properties with a minimum stay of 3 or fewer nights, show the minimum and maximum price, as well as minimum and maximum reviews per month for each city.

Display the values as integers in a single DataFrame.

import pandas as pd

# Filter the DataFrame for properties with a minimum stay of 3 or fewer nights
filtered_df = df[df['minimum_nights'] <= 3]

# Group the filtered DataFrame by city and calculate the required statistics
grouped_data = filtered_df.groupby('city').agg({
    'price': ['min', 'max'],
    'reviews_per_month': ['min', 'max']
})

# Convert the calculated values to integers
grouped_data = grouped_data.astype(int)

# Rename the columns for better clarity
grouped_data.columns = ['min_price', 'max_price', 'min_reviews_per_month', 'max_reviews_per_month']

# Display the DataFrame
print("Minimum and maximum price, and minimum and maximum reviews per month for properties with a minimum stay of 3 or fewer nights in each city:")
print(grouped_data)
Minimum and maximum price, and minimum and maximum reviews per month for properties with a minimum stay of 3 or fewer nights in each city:
               min_price  max_price  min_reviews_per_month  \
city                                                         
Hawaii                10      25000                      0   
Los Angeles           10       4280                      0   
New York City          0      10000                      0   
Seattle               10       1650                      0   

               max_reviews_per_month  
city                                  
Hawaii                            17  
Los Angeles                       33  
New York City                     44  
Seattle                           15  

2.3.10. Reporting extreme values [1 mark]

Determine which city has the highest mean property price. Display this city and its corresponding mean price (rounded to 2 decimal places).

import pandas as pd

# Group the DataFrame by city and calculate the mean price for each city
mean_price_per_city = df.groupby('city')['price'].mean()

# Find the city with the highest mean price
highest_mean_price_city = mean_price_per_city.idxmax()
highest_mean_price = mean_price_per_city.max()

# Display the city with the highest mean price and its corresponding mean price (rounded to 2 decimal places)
print("City with the highest mean property price:")
print("City:", highest_mean_price_city)
print("Mean Price:", round(highest_mean_price, 2))
City with the highest mean property price:
City: Hawaii
Mean Price: 207.95

Task 3

[40 Marks]

Design a Market class to represent groups investors in the AirBnB market within individual cities. The investors can collect portfolios of rental properties by purchasing them with a budget that you will allocate.

Once investor characteristics are set, they will compete by reporting portfolio scores each month during a simulated year. These values will determine a winning investor for the year.

You will operate at least 2 instances of this class in different markets to compare the performance of winners in each area.


3.1 Prepare data [5 marks]

Begin with the data you have saved in the airbnb_modified.csv file.

Read in and modify this data so that:

Investors will be able to choose properties for their portfolio from this modified DataFrame.

import pandas as pd
import numpy as np
df = pd.read_csv('data/modified_dataairbnb.csv')
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3646692886.py:1: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/modified_dataairbnb.csv')
df['price'] = df['price'].round(-1)
# Convert the 'last_review' column to datetime
df['last_review'] = pd.to_datetime(df['last_review'])
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\2186937433.py:2: UserWarning: Parsing dates in %d %m %Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  df['last_review'] = pd.to_datetime(df['last_review'])
# Filter the DataFrame to retain only properties with their last reviews in March, August, and September
df = df[df['last_review'].dt.month.isin([3, 8, 9])]
# Filter the DataFrame to retain only cities with more than 960 entries
cities_counts = df['city'].value_counts()
cities_to_keep = cities_counts[cities_counts > 960].index
df = df[df['city'].isin(cities_to_keep)]
# Display the modified DataFrame
print("Modified DataFrame:")
print(df)
Modified DataFrame:
              ID                                               name  \
1          80905                                   French Chic Loft   
3         155305                 Cottage! BonPaul + Sharky's Hostel   
10        304379                         Refocus Cottage - paradise   
11        353092                Athena's Loft:  Find yourself here!   
12        427497           Luxurious Mountain Guest Suite Apartment   
...          ...                                                ...   
225936  45326130      Modern and cozy home located in Washington DC   
225950  45349877  Penthouse w/ Patio ‚òÜ Capitol Hill Condo ‚òÜ ...   
225955  45352724  MODERN ‚òÜ Well-Located Shaw Town Home ‚òÜ 2BR...   
225964  45385834      Brand new modern apartment with private entry   
226002  45472150  LUXE 1-BR STUDIO SPACE / Desirable Location + ...   

          Host ID Host Name neighbourhood_group  \
1          427027   Celeste                 NaN   
3          746673   BonPaul                 NaN   
10        1566145     Gayle                 NaN   
11        1788071      Beth                 NaN   
12        1909922     Milan                 NaN   
...           ...       ...                 ...   
225936   55489711      Amir                 NaN   
225950    3850096       Ije                 NaN   
225955    3850096       Ije                 NaN   
225964   16561471    Victor                 NaN   
226002  367917574      Tara                 NaN   

                                            neighbourhood  latitude  \
1                                                   28801  35.59779   
3                                                   28806  35.57864   
10                                                  28804  35.64453   
11                                                  28806  35.58217   
12                                                  28803  35.49111   
...                                                   ...       ...   
225936          Union Station, Stanton Park, Kingman Park  38.90101   
225950                         Capitol Hill, Lincoln Park  38.88703   
225955     Howard University, Le Droit Park, Cardozo/Shaw  38.91626   
225964               Brightwood Park, Crestwood, Petworth  38.94358   
226002  Edgewood, Bloomingdale, Truxton Circle, Eckington  38.91457   

        longitude        Room Type  price  minimum_nights  number_of_reviews  \
1       -82.55540  Entire home/apt  470.0               1                114   
3       -82.59578  Entire home/apt   90.0               1                267   
10      -82.52586  Entire home/apt  290.0              30                 24   
11      -82.59997  Entire home/apt   80.0               4                497   
12      -82.48438  Entire home/apt  120.0               2                 40   
...           ...              ...    ...             ...                ...   
225936  -77.00283  Entire home/apt  140.0               1                  1   
225950  -77.00586  Entire home/apt  130.0               2                  1   
225955  -77.02074  Entire home/apt  110.0               2                  1   
225964  -77.01283  Entire home/apt   80.0               1                  1   
226002  -77.00911  Entire home/apt   80.0               1                  1   

       last_review  reviews_per_month  calculated_host_listings_count  \
1       2020-09-07               1.03                              11   
3       2020-09-22               2.39                               5   
10      2019-08-03               0.23                               2   
11      2020-03-15               5.76                               1   
12      2020-08-17               0.42                               1   
...            ...                ...                             ...   
225936  2020-09-18               1.00                               1   
225950  2020-09-16               1.00                               8   
225955  2020-09-13               1.00                               8   
225964  2020-09-19               1.00                               1   
226002  2020-09-18               1.00                               1   

        availability_365             city  
1                    288        Asheville  
3                      0        Asheville  
10                     0        Asheville  
11                   315        Asheville  
12                   339        Asheville  
...                  ...              ...  
225936               328  Washington D.C.  
225950               162  Washington D.C.  
225955               171  Washington D.C.  
225964                75  Washington D.C.  
226002               173  Washington D.C.  

[70758 rows x 17 columns]
##################################

3.2. Define a Market class [30 marks]

Your are to translate these requirements (with consideration of their usage, defined in 3.3 and 3.4) into working code.

Overall: Coding efficiency and structure, including comments and docstrings, where appropriate, will contribute to the mark in this section.

$$$$

import pandas as pd
import numpy as np
from random import choice, sample, randint

class Market:
    def __init__(self, properties_df, city_name, num_investors=10):
        # Check if properties_df has any missing values
        assert not properties_df.isnull().values.any(), "DataFrame must not contain missing values"
        
        # Initialize properties DataFrame, city name, and number of investors
        self.properties_df = properties_df
        self.city_name = city_name
        self.num_investors = num_investors
        
        # Check if the number of investors is an integer
        if not isinstance(self.num_investors, int):
            raise TypeError("The number of investors must be an integer.")
        
        # Check if the number of investors is positive and non-zero
        assert self.num_investors > 0, "The number of investors must be positive and non-zero."
        
        # Ensure that the number of investors is evenly divisible by 5
        assert self.num_investors % 5 == 0, "The number of investors must be evenly divisible by 5."
        
        # Select neighbourhoods containing the most properties in the city
        self.selected_neighbourhoods = self.select_neighbourhoods()
        
        # Generate investors
        self.generate_investors()
        
        # Initialise portfolios for investors
        self.initialise_portfolios()
        
        # Store champion investor
        self.champion = None
        
    def select_neighbourhoods(self, size=None, specified_neighbourhoods=None):
        # Implementation to select neighbourhoods
        pass
    
    def generate_investors(self):
        # Implementation to generate investors
        pass
    
    def initialise_portfolios(self):
        # Implementation to initialise portfolios
        pass
    
    def simulate_year(self):
        # Implementation to simulate a year
        pass
    
    def _purchase_portfolio(self, investor):
        # Implementation for purchasing portfolio
        pass

3.3. Execute your Market class [5 marks]

m1 = Market(air, "Hawaii")
m1.select_neighbourhoods()
m1.generate_investors()
m1.initialise_portfolios()
m1.simulate_year()
print(f'The champion of {m1.name} Market is the {m1.champion}')
...
The champion of Hawaii Market is Kaulana with a score of 24298578.
m1.show_win_record()

Example

# Check for missing values in the DataFrame
missing_values = df.isnull().sum()

# If there are missing values, handle them accordingly
if missing_values.any():
    # You can either drop rows with missing values
    df = df.dropna()


# Create a Market instance
m1 = Market(df, "Hawaii")

# Select neighbourhoods
m1.select_neighbourhoods()

# Generate investors
m1.generate_investors()

# Initialise portfolios
m1.initialise_portfolios()

# Simulate a year
m1.simulate_year()

# Print the champion of the market
print(f'The champion of {m1.city_name} Market is the {m1.champion}')

# Show win record
m1.show_win_record()
The champion of Hawaii Market is the None
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[110], line 29
     26 print(f'The champion of {m1.city_name} Market is the {m1.champion}')
     28 # Show win record
---> 29 m1.show_win_record()

AttributeError: 'Market' object has no attribute 'show_win_record'

3.4. Compare the results of multiple Market executions [5 marks]

Market rankings:
Position     Name                     Champion                  Score
1            Hawaii                   Kaulana                   24298578
2            Boston                   Nat                       21440474

class Investor:
    def __init__(self, name):
        self.name = name

class Market:
    def __init__(self):
        self.champion = None  # Assume champion is not set initially

    def set_champion(self, investor):
        self.champion = investor

# Create a Market instance
market = Market()

# Set champion to None initially

print("Champion before setting:", market.champion)

# Try to access the 'name' attribute on the champion
try:
    champion_name = market.champion.name
    print("Champion name:", champion_name)
except AttributeError:
    print("Champion is None, cannot access 'name' attribute")

# Set a valid investor as the champion
valid_investor = Investor("John")
market.set_champion(valid_investor)

# Now try to access the 'name' attribute again
champion_name = market.champion.name
print("Champion name:", champion_name)

print(f"{i}\t\t{market['Name']}\t\t{market['Champion']}\t\t{market['Score']}")
Champion before setting: None
Champion is None, cannot access 'name' attribute
Champion name: John
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[116], line 34
     31 champion_name = market.champion.name
     32 print("Champion name:", champion_name)
---> 34 print(f"{i}\t\t{market['Name']}\t\t{market['Champion']}\t\t{market['Score']}")

NameError: name 'i' is not defined