Front page of the student’s submission (the following are compulsory):
Module Code:
Assignment report Title:
Date (when the work completed):
Actual hrs spent for the assignment:
Instructions:
Write Python code to perform each of the following sub-tasks.
Most of the required techniques are covered in the Lectures and Practicals during Weeks 1-4. $$$$
Some parts of this assignment may require further self-study of Python documentations or other resources. $$$$
To locate appropriate methods to complete these activities and for details regarding their proper implementation, you may wish to refer to:
You may also refer to other documentation/self-study resources, such as those suggested in the Lecture Notes or a multitude of other resources that you have independently discovered. $$$$
When you are asked to display a
pandas
DataFrame
, it is acceptable to show the
head()
, tail()
, or both of these, as display
of the full dataset would otherwise be excessive. $$$$
Blank code and markdown cells are provided for each sub-task, however you will likely need to create additional cells to provide further explanation or sensibly separate steps toward solution.
Items to be submitted:
A modified version of this Jupyter notebook file
(.ipynb
)
A copy of this notebook (with the
CS2PP22_CW1
file name as in Item 1.) but in
.html
format, which displays all content independently.
This can be included in the overall assessment archive, but an
unarchived copy should be submitted to Blackboard alongside the archive.
$$$$
airbnb_modified.csv
$$$$
Optional: Any functions and classes created for this
Task may be written in a separate .py
module file and
import
ed to this space. These should be stored in the
submitted archive at the same directory level as the modified notebook
(Item 1.).
Marking scheme:
Marks | Section |
---|---|
5 | Organisation: Preparation and submission of all required files |
10 | 1: Network Representations |
5 | 2.0: Analysis Preparation |
15 | 2.1: Data Cleaning |
5 | 2.2: Creating New Columns |
15 | 2.3: Exploratory Data Analysis |
5 | 3.1: Data Preparation |
30 | 3.2: Class Definition |
5 | 3.3: Class Execution |
5 | 3.4: Class Comparisons |
[10 marks]
A network or graph, G, consists of a set of nodes (or vertices), V, and edges, E. $$$$
An edge is a pair of nodes (a,b) denoting the nodes connected by the edge.
G = (V,E)
Networks can be represented in various different structures. $$$$
We can represent networks using a simple list of all the nodes and all the edges in the network:
V = a, b, c, d, e
Another data structure that we could use is to build a neighbour list for every node. $$$$
For every node, x, we
maintain a list of all the neighbours, y.
Data Reference:
The file data/influencers.tsv
contains a representation
of a social network dataset where enumerated influencer personalities
have links between them if they have been frequently associated with one
another.
To read in the tab separated network file, we need to read each line in the file. To do this we just treat the file object as an iterator.
This is an example of using a context manager: https://book.pythontips.com/en/latest/context_managers.html $$$$
Here line
will be a string for each line in the
file. The .split(x)
method splits a string into a list of
substrings for each occurrence of the character x
(in this
case the tab: \t
).
This code reads the file and places the data in an edge list representation:
# Create an empty edge set
= set()
edges
with open('data/influencers.tsv', 'r') as file:
for line in file:
= line.split('\t')
a, b = (int(a), int(b))
e
edges.add(e)
Instructions:
Begin with the edges
variable resulting from the
above code.$$$$
[1 mark] Use a for
loop to populate a
dict
that will contain the neighbour list
network representation.$$$$
[1 mark] Verify that your solution matches that for nodes 55, 57, and 25 (shown below).$$$$
[1 mark] With your entire neighbour list code in a
single Jupyter notebook cell, use the timeit
Magic command
to report the performance of your code, executing 27
runs of 1500 loops each.
[1 mark] Discuss your solution, describing the Python constructs/tools you have used, and the design of your implementation.$$$$
[3 marks] Design, implement, and verify the results of a second, less efficient implementation and record its equivalent performance metrics, as done above in the more efficient case.$$$$
[1 mark] Discuss your second solution, describing the Python constructs/tools you have used and the design of your less efficient implementation.$$$$
[2 marks] Explain why the efficiency differs between your two implementations.
There will be many nodes in the final dataset. The result should have the form:
55: {11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65},
{57: {41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67},
25: {11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75}...
}.
This shows that influencer 55
is frequently associated
with influencers
11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, and 65
.
# Begin with edges
# Begin with edges
= set()
edges
with open('data/influencers.tsv', 'r') as file:
for line in file:
= line.split('\t')
a, b = (int(a), int(b))
e
edges.add(e)
def edge_to_neighbour(edges): #defining function to create dictionary for neighbour list
= {}
neighbour_list for edge in edges:
= edge #defining relevant parameters including
node1, node2 if node1 not in neighbour_list:
= set() #checking if node is already a key in neighbour list
neighbour_list[node1] if node2 not in neighbour_list:
= set()
neighbour_list[node2] #appending neighbouring nodes to neighbour list nodes
neighbour_list[node1].add(node2)
neighbour_list[node2].add(node1)
return neighbour_list #return neighbour list values instead of edge list values
= edge_to_neighbour(edges) #defining content of neighbour_list dictionary
neighbour_list
for node, neighbours in neighbour_list.items():
= sorted(neighbours)
neighbours print(f"Node {node}: Neighbours {neighbours}")
Node 55: Neighbours [11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65]
Node 57: Neighbours [41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67]
Node 25: Neighbours [11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75]
Node 23: Neighbours [11, 12, 16, 17, 18, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31]
Node 16: Neighbours [17, 18, 19, 20, 21, 22, 23, 26, 55]
Node 20: Neighbours [16, 17, 18, 19, 21, 22, 23]
Node 59: Neighbours [48, 55, 57, 58, 60, 61, 62, 63, 64, 65, 66]
Node 58: Neighbours [11, 27, 48, 55, 57, 59, 60, 61, 62, 63, 64, 65, 66, 70, 76]
Node 65: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 63, 64, 66, 76]
Node 11: Neighbours [0, 2, 3, 10, 12, 13, 14, 15, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 43, 44, 48, 49, 51, 55, 58, 64, 68, 69, 70, 71, 72]
Node 8: Neighbours [0]
Node 0: Neighbours [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]
Node 51: Neighbours [11, 26, 49, 52, 53, 54, 55]
Node 22: Neighbours [16, 17, 18, 19, 20, 21, 23]
Node 17: Neighbours [16, 18, 19, 20, 21, 22, 23]
Node 68: Neighbours [11, 24, 25, 27, 41, 48, 69, 70, 71, 75]
Node 48: Neighbours [11, 25, 27, 47, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 71, 73, 74, 75, 76]
Node 66: Neighbours [48, 58, 59, 60, 61, 62, 63, 64, 65, 76]
Node 60: Neighbours [48, 58, 59, 61, 62, 63, 64, 65, 66]
Node 21: Neighbours [16, 17, 18, 19, 20, 22, 23]
Node 19: Neighbours [16, 17, 18, 20, 21, 22, 23]
Node 18: Neighbours [16, 17, 19, 20, 21, 22, 23]
Node 69: Neighbours [11, 24, 25, 27, 41, 48, 68, 70, 71, 75]
Node 34: Neighbours [11, 29, 35, 36, 37, 38]
Node 37: Neighbours [11, 29, 34, 35, 36, 38]
Node 54: Neighbours [26, 49, 51, 55]
Node 49: Neighbours [11, 26, 50, 51, 54, 55, 56]
Node 63: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 76]
Node 41: Neighbours [24, 25, 42, 55, 57, 62, 68, 69, 70, 71, 75]
Node 24: Neighbours [11, 23, 25, 26, 27, 41, 42, 50, 68, 69, 70]
Node 72: Neighbours [11, 26, 27]
Node 27: Neighbours [11, 23, 24, 25, 26, 28, 29, 31, 33, 43, 48, 58, 68, 69, 70, 71, 72]
Node 43: Neighbours [11, 26, 27]
Node 42: Neighbours [24, 25, 41]
Node 73: Neighbours [48, 74]
Node 74: Neighbours [48, 73]
Node 70: Neighbours [11, 24, 25, 27, 41, 58, 68, 69, 71, 75]
Node 61: Neighbours [48, 55, 57, 58, 59, 60, 62, 63, 64, 65, 66]
Node 44: Neighbours [11, 28]
Node 29: Neighbours [11, 23, 27, 34, 35, 36, 37, 38]
Node 76: Neighbours [48, 58, 62, 63, 64, 65, 66]
Node 62: Neighbours [41, 48, 55, 57, 58, 59, 60, 61, 63, 64, 65, 66, 76]
Node 7: Neighbours [0]
Node 71: Neighbours [11, 25, 27, 41, 48, 68, 69, 70, 75]
Node 47: Neighbours [46, 48]
Node 35: Neighbours [11, 29, 34, 36, 37, 38]
Node 9: Neighbours [0]
Node 64: Neighbours [11, 48, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 76]
Node 26: Neighbours [11, 16, 24, 25, 27, 43, 49, 51, 54, 55, 72]
Node 2: Neighbours [0, 3, 11]
Node 50: Neighbours [24, 49]
Node 31: Neighbours [11, 23, 27, 30]
Node 15: Neighbours [11]
Node 56: Neighbours [49, 55]
Node 32: Neighbours [11]
Node 52: Neighbours [39, 51]
Node 5: Neighbours [0]
Node 75: Neighbours [25, 41, 48, 68, 69, 70, 71]
Node 33: Neighbours [11, 27]
Node 39: Neighbours [25, 52, 55]
Node 30: Neighbours [23, 31]
Node 36: Neighbours [11, 29, 34, 35, 37, 38]
Node 38: Neighbours [11, 29, 34, 35, 36, 37]
Node 13: Neighbours [11]
Node 28: Neighbours [11, 27, 44, 45]
Node 45: Neighbours [28]
Node 4: Neighbours [0]
Node 14: Neighbours [11]
Node 12: Neighbours [11, 23]
Node 6: Neighbours [0]
Node 3: Neighbours [0, 2, 11]
Node 67: Neighbours [57]
Node 53: Neighbours [51]
Node 46: Neighbours [47]
Node 40: Neighbours [25]
Node 1: Neighbours [0]
Node 10: Neighbours [11]
# Verifying my more efficient solution
= {
expected_neighbours 55: {11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65}, #Expected neighbours for node 1
57: {41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67}, #Expected neighbours for node 2
25: {11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75} #Expected neighbours for node 3
}
# Check if the generated neighbour lists match the expected neighbours for each node
for node, expected_neighbour_set in expected_neighbours.items():
if node not in neighbour_list:
print(f"Node {node} not found in the neighbor list.")
elif neighbour_list[node] == expected_neighbour_set:
print(f"Solution for node {node} Verified")
else:
print(f"Solution for node {node} is not Verified")
Solution for node 55 Verified
Solution for node 57 Verified
Solution for node 25 Verified
# Testing the performance of my more efficient solution
import timeit
# Read edges from file and store them in a set
= set()
edges with open('data/influencers.tsv', 'r') as file:
for line in file:
= line.split('\t')
a, b = (int(a), int(b))
e
edges.add(e)
# Define the function to convert edges to neighbor list
def edge_to_neighbour(edges):
= {}
neighbour_list for edge in edges:
= edge
node1, node2 if node1 not in neighbour_list:
= set()
neighbour_list[node1] if node2 not in neighbour_list:
= set()
neighbour_list[node2]
neighbour_list[node1].add(node2)
neighbour_list[node2].add(node1)# Assuming undirected graph, add both directions
return neighbour_list
# Specify operation to calcualate the time taken to convert edges to neighbor list
= timeit.repeat(lambda: edge_to_neighbour(edges), repeat=19, number=2000)
time_taken
# Return Result
print("Average Performance:", sum(time_taken) / len(time_taken))
Average Performance: 0.6957541842104643
Discussion of more efficient solution characteristics:
# Verifying my less efficient solution
= set()
edges with open('data/influencers.tsv', 'r') as file:
for line in file:
= line.split('\t')
a, b = (int(a), int(b))
e
edges.add(e)
= {}
neighbour_list for edge in edges:
# Introduce unnecessary conversion to a list
= list(edge)
edge_list = edge_list[0], edge_list[1]
node1, node2 if node1 not in neighbour_list:
= set()
neighbour_list[node1] if node2 not in neighbour_list:
= set()
neighbour_list[node2]
neighbour_list[node1].add(node2)
neighbour_list[node2].add(node1)
for node, neighbours in neighbour_list.items():
# Introduce redundant sorting of neighbors
= sorted(neighbours)
sorted_neighbours print(f"Node {node}: Neighbours {sorted_neighbours}")
Node 55: Neighbours [11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65]
Node 57: Neighbours [41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67]
Node 25: Neighbours [11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75]
Node 23: Neighbours [11, 12, 16, 17, 18, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31]
Node 16: Neighbours [17, 18, 19, 20, 21, 22, 23, 26, 55]
Node 20: Neighbours [16, 17, 18, 19, 21, 22, 23]
Node 59: Neighbours [48, 55, 57, 58, 60, 61, 62, 63, 64, 65, 66]
Node 58: Neighbours [11, 27, 48, 55, 57, 59, 60, 61, 62, 63, 64, 65, 66, 70, 76]
Node 65: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 63, 64, 66, 76]
Node 11: Neighbours [0, 2, 3, 10, 12, 13, 14, 15, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 43, 44, 48, 49, 51, 55, 58, 64, 68, 69, 70, 71, 72]
Node 8: Neighbours [0]
Node 0: Neighbours [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]
Node 51: Neighbours [11, 26, 49, 52, 53, 54, 55]
Node 22: Neighbours [16, 17, 18, 19, 20, 21, 23]
Node 17: Neighbours [16, 18, 19, 20, 21, 22, 23]
Node 68: Neighbours [11, 24, 25, 27, 41, 48, 69, 70, 71, 75]
Node 48: Neighbours [11, 25, 27, 47, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 71, 73, 74, 75, 76]
Node 66: Neighbours [48, 58, 59, 60, 61, 62, 63, 64, 65, 76]
Node 60: Neighbours [48, 58, 59, 61, 62, 63, 64, 65, 66]
Node 21: Neighbours [16, 17, 18, 19, 20, 22, 23]
Node 19: Neighbours [16, 17, 18, 20, 21, 22, 23]
Node 18: Neighbours [16, 17, 19, 20, 21, 22, 23]
Node 69: Neighbours [11, 24, 25, 27, 41, 48, 68, 70, 71, 75]
Node 34: Neighbours [11, 29, 35, 36, 37, 38]
Node 37: Neighbours [11, 29, 34, 35, 36, 38]
Node 54: Neighbours [26, 49, 51, 55]
Node 49: Neighbours [11, 26, 50, 51, 54, 55, 56]
Node 63: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 76]
Node 41: Neighbours [24, 25, 42, 55, 57, 62, 68, 69, 70, 71, 75]
Node 24: Neighbours [11, 23, 25, 26, 27, 41, 42, 50, 68, 69, 70]
Node 72: Neighbours [11, 26, 27]
Node 27: Neighbours [11, 23, 24, 25, 26, 28, 29, 31, 33, 43, 48, 58, 68, 69, 70, 71, 72]
Node 43: Neighbours [11, 26, 27]
Node 42: Neighbours [24, 25, 41]
Node 73: Neighbours [48, 74]
Node 74: Neighbours [48, 73]
Node 70: Neighbours [11, 24, 25, 27, 41, 58, 68, 69, 71, 75]
Node 61: Neighbours [48, 55, 57, 58, 59, 60, 62, 63, 64, 65, 66]
Node 44: Neighbours [11, 28]
Node 29: Neighbours [11, 23, 27, 34, 35, 36, 37, 38]
Node 76: Neighbours [48, 58, 62, 63, 64, 65, 66]
Node 62: Neighbours [41, 48, 55, 57, 58, 59, 60, 61, 63, 64, 65, 66, 76]
Node 7: Neighbours [0]
Node 71: Neighbours [11, 25, 27, 41, 48, 68, 69, 70, 75]
Node 47: Neighbours [46, 48]
Node 35: Neighbours [11, 29, 34, 36, 37, 38]
Node 9: Neighbours [0]
Node 64: Neighbours [11, 48, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 76]
Node 26: Neighbours [11, 16, 24, 25, 27, 43, 49, 51, 54, 55, 72]
Node 2: Neighbours [0, 3, 11]
Node 50: Neighbours [24, 49]
Node 31: Neighbours [11, 23, 27, 30]
Node 15: Neighbours [11]
Node 56: Neighbours [49, 55]
Node 32: Neighbours [11]
Node 52: Neighbours [39, 51]
Node 5: Neighbours [0]
Node 75: Neighbours [25, 41, 48, 68, 69, 70, 71]
Node 33: Neighbours [11, 27]
Node 39: Neighbours [25, 52, 55]
Node 30: Neighbours [23, 31]
Node 36: Neighbours [11, 29, 34, 35, 37, 38]
Node 38: Neighbours [11, 29, 34, 35, 36, 37]
Node 13: Neighbours [11]
Node 28: Neighbours [11, 27, 44, 45]
Node 45: Neighbours [28]
Node 4: Neighbours [0]
Node 14: Neighbours [11]
Node 12: Neighbours [11, 23]
Node 6: Neighbours [0]
Node 3: Neighbours [0, 2, 11]
Node 67: Neighbours [57]
Node 53: Neighbours [51]
Node 46: Neighbours [47]
Node 40: Neighbours [25]
Node 1: Neighbours [0]
Node 10: Neighbours [11]
# Testing the performance of my less efficient solution
import timeit
# Read edges from file and store them in a set
= set()
edges with open('data/influencers.tsv', 'r') as file:
for line in file:
= line.split('\t')
a, b = (int(a), int(b))
e
edges.add(e)
= {}
neighbour_list for edge in edges:
# Introduce unnecessary conversion to a list
= list(edge)
edge_list = edge_list[0], edge_list[1]
node1, node2 if node1 not in neighbour_list:
= set()
neighbour_list[node1] if node2 not in neighbour_list:
= set()
neighbour_list[node2]
neighbour_list[node1].add(node2)
neighbour_list[node2].add(node1)
for node, neighbours in neighbour_list.items():
# Introduce redundant sorting of neighbors
= sorted(neighbours)
sorted_neighbours print(f"Node {node}: Neighbours {sorted_neighbours}")
# Specify operation to calcualate the time taken to convert edges to neighbor list
= timeit.repeat(lambda: edge_to_neighbour(edges), repeat=19, number=2000)
time_taken
# Return Result
print("Performance:", sum(time_taken) / len(time_taken))
Node 55: Neighbours [11, 16, 25, 26, 39, 41, 48, 49, 51, 54, 56, 57, 58, 59, 61, 62, 63, 64, 65]
Node 57: Neighbours [41, 48, 55, 58, 59, 61, 62, 63, 64, 65, 67]
Node 25: Neighbours [11, 23, 24, 26, 27, 39, 40, 41, 42, 48, 55, 68, 69, 70, 71, 75]
Node 23: Neighbours [11, 12, 16, 17, 18, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31]
Node 16: Neighbours [17, 18, 19, 20, 21, 22, 23, 26, 55]
Node 20: Neighbours [16, 17, 18, 19, 21, 22, 23]
Node 59: Neighbours [48, 55, 57, 58, 60, 61, 62, 63, 64, 65, 66]
Node 58: Neighbours [11, 27, 48, 55, 57, 59, 60, 61, 62, 63, 64, 65, 66, 70, 76]
Node 65: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 63, 64, 66, 76]
Node 11: Neighbours [0, 2, 3, 10, 12, 13, 14, 15, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 43, 44, 48, 49, 51, 55, 58, 64, 68, 69, 70, 71, 72]
Node 8: Neighbours [0]
Node 0: Neighbours [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]
Node 51: Neighbours [11, 26, 49, 52, 53, 54, 55]
Node 22: Neighbours [16, 17, 18, 19, 20, 21, 23]
Node 17: Neighbours [16, 18, 19, 20, 21, 22, 23]
Node 68: Neighbours [11, 24, 25, 27, 41, 48, 69, 70, 71, 75]
Node 48: Neighbours [11, 25, 27, 47, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 71, 73, 74, 75, 76]
Node 66: Neighbours [48, 58, 59, 60, 61, 62, 63, 64, 65, 76]
Node 60: Neighbours [48, 58, 59, 61, 62, 63, 64, 65, 66]
Node 21: Neighbours [16, 17, 18, 19, 20, 22, 23]
Node 19: Neighbours [16, 17, 18, 20, 21, 22, 23]
Node 18: Neighbours [16, 17, 19, 20, 21, 22, 23]
Node 69: Neighbours [11, 24, 25, 27, 41, 48, 68, 70, 71, 75]
Node 34: Neighbours [11, 29, 35, 36, 37, 38]
Node 37: Neighbours [11, 29, 34, 35, 36, 38]
Node 54: Neighbours [26, 49, 51, 55]
Node 49: Neighbours [11, 26, 50, 51, 54, 55, 56]
Node 63: Neighbours [48, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 76]
Node 41: Neighbours [24, 25, 42, 55, 57, 62, 68, 69, 70, 71, 75]
Node 24: Neighbours [11, 23, 25, 26, 27, 41, 42, 50, 68, 69, 70]
Node 72: Neighbours [11, 26, 27]
Node 27: Neighbours [11, 23, 24, 25, 26, 28, 29, 31, 33, 43, 48, 58, 68, 69, 70, 71, 72]
Node 43: Neighbours [11, 26, 27]
Node 42: Neighbours [24, 25, 41]
Node 73: Neighbours [48, 74]
Node 74: Neighbours [48, 73]
Node 70: Neighbours [11, 24, 25, 27, 41, 58, 68, 69, 71, 75]
Node 61: Neighbours [48, 55, 57, 58, 59, 60, 62, 63, 64, 65, 66]
Node 44: Neighbours [11, 28]
Node 29: Neighbours [11, 23, 27, 34, 35, 36, 37, 38]
Node 76: Neighbours [48, 58, 62, 63, 64, 65, 66]
Node 62: Neighbours [41, 48, 55, 57, 58, 59, 60, 61, 63, 64, 65, 66, 76]
Node 7: Neighbours [0]
Node 71: Neighbours [11, 25, 27, 41, 48, 68, 69, 70, 75]
Node 47: Neighbours [46, 48]
Node 35: Neighbours [11, 29, 34, 36, 37, 38]
Node 9: Neighbours [0]
Node 64: Neighbours [11, 48, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 76]
Node 26: Neighbours [11, 16, 24, 25, 27, 43, 49, 51, 54, 55, 72]
Node 2: Neighbours [0, 3, 11]
Node 50: Neighbours [24, 49]
Node 31: Neighbours [11, 23, 27, 30]
Node 15: Neighbours [11]
Node 56: Neighbours [49, 55]
Node 32: Neighbours [11]
Node 52: Neighbours [39, 51]
Node 5: Neighbours [0]
Node 75: Neighbours [25, 41, 48, 68, 69, 70, 71]
Node 33: Neighbours [11, 27]
Node 39: Neighbours [25, 52, 55]
Node 30: Neighbours [23, 31]
Node 36: Neighbours [11, 29, 34, 35, 37, 38]
Node 38: Neighbours [11, 29, 34, 35, 36, 37]
Node 13: Neighbours [11]
Node 28: Neighbours [11, 27, 44, 45]
Node 45: Neighbours [28]
Node 4: Neighbours [0]
Node 14: Neighbours [11]
Node 12: Neighbours [11, 23]
Node 6: Neighbours [0]
Node 3: Neighbours [0, 2, 11]
Node 67: Neighbours [57]
Node 53: Neighbours [51]
Node 46: Neighbours [47]
Node 40: Neighbours [25]
Node 1: Neighbours [0]
Node 10: Neighbours [11]
Performance: 0.6348627842104945
Discussion of less efficient solution characteristics:
Discussion of differing efficiency:
Data Reference:
United States AirBnB Listings, 2020: airbnb.csv
$$$$
This dataset includes information about individual rental accommodations and about associated guest reviews. These have been compiled by compiled by insideairbnb.com. $$$$
Each row corresponds to a single rental accommodation property. $$$$
The columns correspond to:
Column | Description |
---|---|
ID | id number that identifies the property |
name | Property name |
Host ID | id number that identifies the host |
Host Name | Host name |
neighbourhood_group | The main regions of the city |
neighbourhood | The neighbourhoods |
latitude | Property latitude |
longitude | Property longitude |
Room Type | Type of the room |
price | The price for one night ($) |
minimum_nights | Minimum number of nights to book the place |
number_of_reviews | Number of reviews received |
last_review | Date of the last review |
reviews_per_month | Number of reviews per month |
calculated_host_listings_count | Number of properties available on Airbnb owned by the host |
availability_365 | Number of days of availability within 365 days |
city | Property city |
[5 Marks]
Import the libraries that will be used in your solutions for Task 2.
import pandas as pd
import numpy as np
Locate the data file airbnb.csv
within the zipped file
you have downloaded from Blackboard (under ./data/
) and
read the data from file to a pandas
DataFrame
variable called air
.
= pd.read_csv('data/airbnb.csv') # read data from csv into DataFrame
df
print(df)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_10644\3209262031.py:1: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv') # read data from csv into DataFrame
ID name \
0 38585 Charming Victorian home - twin beds + breakfast
1 80905 French Chic Loft
2 108061 Walk to stores/parks/downtown. Fenced yard/Pet...
3 155305 Cottage! BonPaul + Sharky's Hostel
4 160594 Historic Grove Park
... ... ...
226025 45506143 DC Hidden In Plain "Site"
226026 45511428 DC 3 BR w/ screen porch 3 blck to metro w/ par...
226027 45514685 Charming Penthouse Apt w/ Rooftop Terrace in L...
226028 45516412 Adams Morgan/Nat'l Zoo 1 BR Apt #32
226029 45517735 Beautiful large one-bedroom w/ washer and dryer
Host ID Host Name neighbourhood_group \
0 165529 Evelyne NaN
1 427027 Celeste NaN
2 320564 Lisa NaN
3 746673 BonPaul NaN
4 769252 Elizabeth NaN
... ... ... ...
226025 25973146 Marci NaN
226026 231133074 Thomas NaN
226027 33758935 Bassem NaN
226028 23193071 Michael NaN
226029 17789858 Adam NaN
neighbourhood latitude \
0 28804 35.651460
1 28801 35.597790
2 28801 35.606700
3 28806 35.578640
4 28801 35.614420
... ... ...
226025 Downtown, Chinatown, Penn Quarters, Mount Vern... 38.903880
226026 Brookland, Brentwood, Langdon 38.920820
226027 Shaw, Logan Circle 38.911170
226028 Kalorama Heights, Adams Morgan, Lanier Heights 38.926630
226029 Edgewood, Bloomingdale, Truxton Circle, Eckington 38.911569
longitude Room Type price minimum_nights number_of_reviews \
0 -82.627920 Private room 60.0 1 138
1 -82.555400 Entire home/apt 470.0 1 114
2 -82.555630 Entire home/apt NaN 30 89
3 -82.595780 Entire home/apt 90.0 1 267
4 -82.541270 Private room 125.0 30 58
... ... ... ... ... ...
226025 -77.029730 Entire home/apt 104.0 1 0
226026 -76.990980 Entire home/apt 151.0 2 0
226027 -77.033540 Entire home/apt 240.0 2 0
226028 -77.044360 Entire home/apt 60.0 21 0
226029 -77.009431 Entire home/apt 79.0 7 0
last_review reviews_per_month calculated_host_listings_count \
0 16 02 2020 1.14 1
1 07 09 2020 1.03 11
2 30 11 2019 0.81 2
3 22 09 2020 2.39 5
4 19 10 2015 0.52 1
... ... ... ...
226025 NaN NaN 2
226026 NaN NaN 1
226027 NaN NaN 1
226028 NaN NaN 5
226029 NaN NaN 2
availability_365 city
0 0 Asheville
1 288 Asheville
2 298 Asheville
3 0 Asheville
4 0 Asheville
... ... ...
226025 99 Washington D.C.
226026 300 Washington D.C.
226027 173 Washington D.C.
226028 362 Washington D.C.
226029 62 Washington D.C.
[226030 rows x 17 columns]
Display the first 5 rows of the DataFrame
.
print(df.head())
In one or more code cells, display the following information of the
DataFrame
:
Following the display of this information, use a Markdown cell to write:
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
print("Column Names")
print(df.columns)
print ("\nData Types")
print(df.dtypes)
Number of rows: 226030
Number of columns: 15
Column Names
Index(['name', 'Host Name', 'neighbourhood_group', 'neighbourhood', 'latitude',
'longitude', 'Room Type', 'price', 'minimum_nights',
'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365', 'city'],
dtype='object')
Data Types
name object
Host Name object
neighbourhood_group object
neighbourhood object
latitude float64
longitude float64
Room Type object
price float64
minimum_nights int64
number_of_reviews int64
last_review object
reviews_per_month float64
calculated_host_listings_count int64
availability_365 int64
city object
dtype: object
Markdown...
[15 Marks]
Drop the following features from the DataFrame
:
Then, store and display the resulting
DataFrame
and write a statement,
explaining why one might choose to exclude these features.
import pandas as pd
= df.drop(columns=['ID', 'Host ID', 'Neighbourhood Group'])
df_modified
df_modified
#One may choose to exclude these features because they are unnecessary
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[16], line 3
1 import pandas as pd
----> 3 df_modified = df.drop(columns=['ID', 'Host ID', 'Neighbourhood Group'])
5 df_modified
9 #One may choose to exclude these features because they are unnecessary
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\frame.py:5581, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
5433 def drop(
5434 self,
5435 labels: IndexLabel | None = None,
(...)
5442 errors: IgnoreRaise = "raise",
5443 ) -> DataFrame | None:
5444 """
5445 Drop specified labels from rows or columns.
5446
(...)
5579 weight 1.0 0.8
5580 """
-> 5581 return super().drop(
5582 labels=labels,
5583 axis=axis,
5584 index=index,
5585 columns=columns,
5586 level=level,
5587 inplace=inplace,
5588 errors=errors,
5589 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\generic.py:4788, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
4786 for axis, labels in axes.items():
4787 if labels is not None:
-> 4788 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
4790 if inplace:
4791 self._update_inplace(obj)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\generic.py:4830, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
4828 new_axis = axis.drop(labels, level=level, errors=errors)
4829 else:
-> 4830 new_axis = axis.drop(labels, errors=errors)
4831 indexer = axis.get_indexer(new_axis)
4833 # Case for non-unique axis
4834 else:
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\indexes\base.py:7070, in Index.drop(self, labels, errors)
7068 if mask.any():
7069 if errors != "ignore":
-> 7070 raise KeyError(f"{labels[mask].tolist()} not found in axis")
7071 indexer = indexer[~mask]
7072 return self.delete(indexer)
KeyError: "['ID', 'Host ID', 'Neighbourhood Group'] not found in axis"
Rename the following columns:
Column | New name |
---|---|
Host Name | host_name |
Room Type | room_type |
reviews_per_month | rpm |
calculated_host_listings_count | listing_count |
availability_365 | availability |
Then, display the resulting DataFrame
and write a statement, discussing why one might change
the names in this way.
import pandas as pd
# Load data from TSV file
= pd.read_csv('data/airbnb.csv')
df
# Check the column names in the DataFrame
print("Original column names:")
print(df.columns)
# Perform the renaming of columns
= df.rename(columns={'Host Name': 'host_name',
df 'Room Type': 'room_type',
'reviews_per_month': 'RPM',
'calculated_host_listings_count': 'listing_count',
'availability_365': 'availability',
})
# Verify the renamed column names
print("\nRenamed column names:")
print(df.columns)
"""It may be considered useful to change the names in this manner because it ensures that the column headings are all recorded in the same manner, where previously some were recorded as
capitalised headings and some were named as traditional file names e.g. city mpg. This disparity housed inconsistency within the format of the table, meaning that not every column's purpose was clear, especially if headed by an abbreviation e.g. MSRP. By changing the headings to the proposed names, consistency and uniformity is reintroduced to the table."""
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\272338296.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv')
Original column names:
Index(['ID', 'name', 'Host ID', 'Host Name', 'neighbourhood_group',
'neighbourhood', 'latitude', 'longitude', 'Room Type', 'price',
'minimum_nights', 'number_of_reviews', 'last_review',
'reviews_per_month', 'calculated_host_listings_count',
'availability_365', 'city'],
dtype='object')
Renamed column names:
Index(['ID', 'name', 'Host ID', 'host_name', 'neighbourhood_group',
'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
'minimum_nights', 'number_of_reviews', 'last_review', 'RPM',
'listing_count', 'availability', 'city'],
dtype='object')
"It may be considered useful to change the names in this manner because it ensures that the column headings are all recorded in the same manner, where previously some were recorded as\ncapitalised headings and some were named as traditional file names e.g. city mpg. This disparity housed inconsistency within the format of the table, meaning that not every column's purpose was clear, especially if headed by an abbreviation e.g. MSRP. By changing the headings to the proposed names, consistency and uniformity is reintroduced to the table."
Use an f-string
to print()
the number of
irrelevant duplicate rows (count repeats only, not both originals and
repeats).
Drop the duplicated rows, retain the resulting
DataFrame
, and show the resulting number of rows in the new
DataFrame
.
import pandas as pd
# reload dataset
= pd.read_csv('data/airbnb.csv')
df
# renaming columns code
= {
new_column_names 'Host Name': 'host_name',
'Room Type': 'room_type',
'reviews_per_month': 'RPM',
'calculated_host_listings_count': 'listing_count',
'availability_365': 'availability',
}
=new_column_names, inplace=True)
df.rename(columns
# counting duplicate rows
= df.duplicated().sum()
num_duplicate_rows print(f"Number of irrelevant duplicate rows: {num_duplicate_rows}")
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\1593544922.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv')
Number of irrelevant duplicate rows: 0
Report the number of null values remaining in each column.
import pandas as pd
# reload dataset
= pd.read_csv('data/airbnb.csv')
df
# renaming columns code
= {
new_column_names 'Host Name': 'host_name',
'Room Type': 'room_type',
'reviews_per_month': 'RPM',
'calculated_host_listings_count': 'listing_count',
'availability_365': 'availability',
}
=new_column_names, inplace=True)
df.rename(columns
= df.isnull().sum()
num_null_values print(f"Number of null values: {num_null_values}")
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3555504588.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv')
Number of null values: ID 0
name 28
Host ID 0
host_name 33
neighbourhood_group 115845
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 9
minimum_nights 0
number_of_reviews 0
last_review 48602
RPM 48602
listing_count 0
availability 0
city 0
dtype: int64
For the price
column, replace missing values with
the column mean while using a numpy
routine to calculate
the mean.
Drop all other rows that still contain missing values.
Display the final number of rows left in the resulting
DataFrame
.
import pandas as pd
import numpy as np
# Load data from TSV file
= pd.read_csv('data/airbnb.csv')
df
# Calculate the mean of the 'price' column using a numpy routine
= np.nanmean(df['price'])
price_mean
# Replace missing values in the 'price' column with the calculated mean
'price'].fillna(price_mean, inplace=True)
df[
# Drop all other rows that still contain missing values
=True)
df.dropna(inplace
# Display the final number of rows left in the resulting DataFrame
= len(df)
num_rows_left print(f"Final number of rows left in the resulting DataFrame: {num_rows_left}")
# Optionally, display the modified DataFrame
print(df)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\987998301.py:5: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv')
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\987998301.py:11: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df['price'].fillna(price_mean, inplace=True)
Final number of rows left in the resulting DataFrame: 85144
ID name \
48150 5065 MAUKA BB
48151 5269 Upcountry Hospitality in the 'Auwai Suite
48152 5387 Hale Koa Studio & 1 Bedroom Units!!
48153 5389 Keauhou Villa
48154 5390 STAY AT PRINCE KUHIO!
... ... ...
212157 43531859 Chic Seattle Apartment near Greenlake
212159 43554835 SL6 - Private·Modern·Quality·Convenient·N ...
212160 43554849 SL4 - Modern,Private,Quality,Convenient,N Seattle
212161 43589616 Perfect spot for group events or family stays!
212164 43603849 Rooftop Tent in the Heart of Seattle
Host ID Host Name neighbourhood_group neighbourhood latitude \
48150 7257 Wayne Hawaii Hamakua 20.04095
48151 7620 Lea & Pat Hawaii South Kohala 20.02740
48152 7878 Edward Hawaii South Kona 19.43119
48153 7878 Edward Hawaii North Kona 19.56413
48154 7887 Todd Kauai Koloa-Poipu 21.88305
... ... ... ... ... ...
212157 6601753 Isabel Other neighborhoods Roosevelt 47.68457
212159 287025852 Li Lake City Olympic Hills 47.73462
212160 287025852 Li Lake City Olympic Hills 47.73357
212161 347974040 Marcus Delridge Highland Park 47.51284
212164 51816582 Fei Queen Anne East Queen Anne 47.63913
longitude Room Type price minimum_nights number_of_reviews \
48150 -155.43251 Entire home/apt 85.0 2 42
48151 -155.70200 Entire home/apt 124.0 30 10
48152 -155.88079 Entire home/apt 85.0 5 168
48153 -155.96347 Entire home/apt 239.0 6 20
48154 -159.47372 Entire home/apt 92.0 3 143
... ... ... ... ... ...
212157 -122.31550 Entire home/apt 100.0 3 1
212159 -122.29509 Private room 79.0 1 2
212160 -122.29651 Entire home/apt 79.0 1 1
212161 -122.33587 Entire home/apt 200.0 1 2
212164 -122.34293 Private room 35.0 1 2
last_review reviews_per_month calculated_host_listings_count \
48150 22 03 2020 0.45 2
48151 01 03 2020 0.09 5
48152 18 03 2020 1.30 3
48153 22 03 2020 0.24 3
48154 10 08 2020 1.03 1
... ... ... ...
212157 06 06 2020 1.00 1
212159 10 06 2020 2.00 3
212160 13 06 2020 1.00 3
212161 14 06 2020 2.00 1
212164 14 06 2020 2.00 2
availability_365 city
48150 365 Hawaii
48151 261 Hawaii
48152 242 Hawaii
48153 287 Hawaii
48154 116 Hawaii
... ... ...
212157 99 Seattle
212159 337 Seattle
212160 351 Seattle
212161 330 Seattle
212164 0 Seattle
[85144 rows x 17 columns]
In the previous step, we eliminated entries with missing data from the dataset. If this dataset were to be used to train a machine learning model to predict price for one night, what is a potential drawback of this approach? Describe an alternative approach and any related caveats of which we should be aware.
Markdown... It will shrink the dataset and overall based on which entries have been removed,this will potentially bias the model and affect it's ability to be trained fairly, as it will have significantly differently amounts of data in each category.
An alternative approach to removing missing values could be regression imputation, which utilises this tactic from the data science process in a predictive context, aiming to simply predict missing values based on a regression model, also allowing it to only be selectively used if there is a strong relationship between variables.
[5 Marks]
name_len
[2
marks]Using an implementation of list comprehension,
create a new column, name_len
, such that,
if a property's name
length is greater than or equal
to 35 characters, name_len
is
'verbose'.
Otherwise, name_len
is 'brief'.
Then, display the modified DataFrame
and report the number of entries in these two new
categories.
import pandas as pd
# Load data from TSV file
= pd.read_csv('data/airbnb.csv')
df
#df = pd.DataFrame()
# Create a new column 'name_len' using list comprehension with a conditional
'name_len'] = ['verbose' if len(name) >= 35 else 'brief' for name in df['name']]
df[
# Display the modified DataFrame
print(df)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\1541629390.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[68], line 10
4 df = pd.read_csv('data/airbnb.csv')
7 #df = pd.DataFrame()
8
9 # Create a new column 'name_len' using list comprehension with a conditional
---> 10 df['name_len'] = ['verbose' if len(name) >= 35 else 'brief' for name in df['name']]
12 # Display the modified DataFrame
13 print(df)
TypeError: object of type 'float' has no len()
price_class
[2 marks]Using an implementation of a function, create a new
column, price_class
, such that it becomes equal to:
price
is greater than or equal to
$5,000price
is between 1,000 (inclusive) and
$5,000 (exclusive)price
is between 200 (inclusive) and $1,000
(exclusive), andprice
is below $200.Then, display the modified DataFrame
and report the number of entries in these two new
categories.
import pandas as pd
def classify_price(price):
if price >= 5000:
return 'astronomical'
elif 1000 <= price < 5000:
return 'high'
elif 200 <= price < 1000:
return 'mid'
else:
return 'low'
# Assuming you have a DataFrame called 'df' with a column 'price'
'price_class'] = df['price'].apply(classify_price)
df[
print(df)
ID name \
0 38585 Charming Victorian home - twin beds + breakfast
1 80905 French Chic Loft
2 108061 Walk to stores/parks/downtown. Fenced yard/Pet...
3 155305 Cottage! BonPaul + Sharky's Hostel
4 160594 Historic Grove Park
... ... ...
226025 45506143 DC Hidden In Plain "Site"
226026 45511428 DC 3 BR w/ screen porch 3 blck to metro w/ par...
226027 45514685 Charming Penthouse Apt w/ Rooftop Terrace in L...
226028 45516412 Adams Morgan/Nat'l Zoo 1 BR Apt #32
226029 45517735 Beautiful large one-bedroom w/ washer and dryer
Host ID Host Name neighbourhood_group \
0 165529 Evelyne NaN
1 427027 Celeste NaN
2 320564 Lisa NaN
3 746673 BonPaul NaN
4 769252 Elizabeth NaN
... ... ... ...
226025 25973146 Marci NaN
226026 231133074 Thomas NaN
226027 33758935 Bassem NaN
226028 23193071 Michael NaN
226029 17789858 Adam NaN
neighbourhood latitude \
0 28804 35.651460
1 28801 35.597790
2 28801 35.606700
3 28806 35.578640
4 28801 35.614420
... ... ...
226025 Downtown, Chinatown, Penn Quarters, Mount Vern... 38.903880
226026 Brookland, Brentwood, Langdon 38.920820
226027 Shaw, Logan Circle 38.911170
226028 Kalorama Heights, Adams Morgan, Lanier Heights 38.926630
226029 Edgewood, Bloomingdale, Truxton Circle, Eckington 38.911569
longitude Room Type price minimum_nights number_of_reviews \
0 -82.627920 Private room 60.0 1 138
1 -82.555400 Entire home/apt 470.0 1 114
2 -82.555630 Entire home/apt NaN 30 89
3 -82.595780 Entire home/apt 90.0 1 267
4 -82.541270 Private room 125.0 30 58
... ... ... ... ... ...
226025 -77.029730 Entire home/apt 104.0 1 0
226026 -76.990980 Entire home/apt 151.0 2 0
226027 -77.033540 Entire home/apt 240.0 2 0
226028 -77.044360 Entire home/apt 60.0 21 0
226029 -77.009431 Entire home/apt 79.0 7 0
last_review reviews_per_month calculated_host_listings_count \
0 16 02 2020 1.14 1
1 07 09 2020 1.03 11
2 30 11 2019 0.81 2
3 22 09 2020 2.39 5
4 19 10 2015 0.52 1
... ... ... ...
226025 NaN NaN 2
226026 NaN NaN 1
226027 NaN NaN 1
226028 NaN NaN 5
226029 NaN NaN 2
availability_365 city price_class
0 0 Asheville low
1 288 Asheville mid
2 298 Asheville low
3 0 Asheville low
4 0 Asheville low
... ... ... ...
226025 99 Washington D.C. low
226026 300 Washington D.C. low
226027 173 Washington D.C. mid
226028 362 Washington D.C. low
226029 62 Washington D.C. low
[226030 rows x 18 columns]
Save the cumulatively modified DataFrame
to a new
comma-separated file called airbnb_modified.csv
to be
stored under the ./data/
directory.
Do not include the row indices in the file.
import pandas as pd
= pd.read_csv('data/airbnb.csv')
df
= df
airbnb_modified
#directory = 'CS2PP22_CW1_new/data/modified_data' #Specify directory to save file
= 'data/modified_data' #Specify directory to save file
directory
= directory + 'airbnb.csv' #Define file path
file_path
=False) #save modified dataset to new CSV file airbnb_modified.to_csv(file_path, index
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3801390647.py:3: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv')
In this section, you will not need to retain any
DataFrame
modifications.
[15 Marks]
Find the mean price
of all properties. Report the
solution rounded to 2 decimal places.
import pandas as pd
# Calculate the mean price
= df['price'].mean()
mean_price
# Report the solution rounded to 2 decimal places
= round(mean_price, 2)
mean_price_rounded
print("Mean price of all properties:", mean_price_rounded)
Mean price of all properties: 219.71
Report the number of properties for each room type.
import pandas as pd
# Count the number of properties for each room type
= df['room_type'].value_counts()
properties_per_room_type
print("Number of properties for each room type:")
print(properties_per_room_type)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'room_type'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[71], line 4
1 import pandas as pd
3 # Count the number of properties for each room type
----> 4 properties_per_room_type = df['room_type'].value_counts()
6 print("Number of properties for each room type:")
7 print(properties_per_room_type)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 'room_type'
Report a list of unique property cities, sorted in ascending alphabetical order.
import pandas as pd
# Extract unique property cities and sort them
= sorted(df['city'].unique())
unique_cities
print("List of unique property cities sorted in ascending alphabetical order:")
print(unique_cities)
List of unique property cities sorted in ascending alphabetical order:
['Asheville', 'Austin', 'Boston', 'Broward County', 'Cambridge', 'Chicago', 'Clark County', 'Columbus', 'Denver', 'Hawaii', 'Jersey City', 'Los Angeles', 'Nashville', 'New Orleans', 'New York City', 'Oakland', 'Pacific Grove', 'Portland', 'Rhode Island', 'Salem', 'San Clara Country', 'San Diego', 'San Francisco', 'San Mateo County', 'Santa Cruz County', 'Seattle', 'Twin Cities MSA', 'Washington D.C.']
Report the number of unique cities.
Display the mean price and mean number of reviews for each room type
with each city by storing the information in a MultiIndex
DataFrame
. Report the values rounded to 2 decimal
places.
Verify that the expected number of unique cities are represented in
the resulting DataFrame
.
import pandas as pd
# 1. Calculate the number of unique cities
= df['city'].nunique()
num_unique_cities
# 2. Group the DataFrame by 'room_type' and 'city'
= df.groupby(['Room Type', 'city'])
grouped_data
# 3. Calculate the mean price and mean number of reviews for each group
= grouped_data.agg({'price': 'mean', 'number_of_reviews': 'mean'})
mean_data
# 4. Store the information in a MultiIndex DataFrame
= mean_data.round(2)
multiindex_df
# 5. Verify the expected number of unique cities
= multiindex_df.index.get_level_values('city').nunique()
num_cities_in_df = num_unique_cities
expected_num_cities
print("Number of unique cities:", num_unique_cities)
print("\nMean price and mean number of reviews for each room type with each city:")
print(multiindex_df)
# Verify that the expected number of unique cities are represented in the resulting DataFrame
if num_cities_in_df == expected_num_cities:
print("\nThe expected number of unique cities are represented in the resulting DataFrame.")
else:
print("\nThe expected number of unique cities are not represented in the resulting DataFrame.")
Number of unique cities: 28
Mean price and mean number of reviews for each room type with each city:
price number_of_reviews
Room Type city
Entire home/apt Asheville 217.91 77.34
Austin 315.40 33.62
Boston 209.97 37.92
Broward County 262.55 21.42
Cambridge 224.80 49.45
... ... ...
Shared room San Mateo County 40.12 17.82
Santa Cruz County 51.29 24.57
Seattle 48.19 21.15
Twin Cities MSA 309.47 5.71
Washington D.C. 46.53 17.19
[109 rows x 2 columns]
The expected number of unique cities are represented in the resulting DataFrame.
Display in a single DataFrame
the number of property
entries for each room_type
and city
combination, with a Total
column (showing the sum of the
rows) and a Total
row (showing the sum of the columns) in
the margins
.
import pandas as pd
# reload dataset
= pd.read_csv('data/airbnb.csv')
df
# renaming columns code
= {
new_column_names 'Host Name': 'host_name',
'Room Type': 'room_type',
'reviews_per_month': 'RPM',
'calculated_host_listings_count': 'listing_count',
'availability_365': 'availability',
}
# Create a cross-tabulation of 'room_type' and 'city'
= pd.crosstab(df['Room Type'], df['city'], margins=True, margins_name='Total')
cross_tab
# Rename the 'Total' column to 'Total' if needed
= cross_tab.rename(columns={'All': 'Total'})
cross_tab
# Rename the 'Total' row to 'Total' if needed
= cross_tab.rename(index={'All': 'Total'})
cross_tab
print("Number of property entries for each room_type and city combination:")
print(cross_tab)
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3283301359.py:4: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/airbnb.csv')
Number of property entries for each room_type and city combination:
city Asheville Austin Boston Broward County Cambridge \
Room Type
Entire home/apt 1684 8085 2162 8267 536
Hotel room 19 15 27 134 0
Private room 364 2202 1142 2295 484
Shared room 7 134 8 162 9
Total 2074 10436 3339 10858 1029
city Chicago Clark County Columbus Denver Hawaii ... Salem \
Room Type ...
Entire home/apt 4401 5721 1043 3186 20050 ... 101
Hotel room 73 330 3 35 195 ... 0
Private room 1833 2291 346 935 2119 ... 98
Shared room 90 66 17 44 70 ... 3
Total 6397 8408 1409 4200 22434 ... 202
city San Clara Country San Diego San Francisco \
Room Type
Entire home/apt 3378 9213 4282
Hotel room 2 25 98
Private room 3402 3013 2496
Shared room 309 153 177
Total 7091 12404 7053
city San Mateo County Santa Cruz County Seattle \
Room Type
Entire home/apt 1620 1236 5028
Hotel room 2 2 51
Private room 1064 326 1399
Shared room 169 7 97
Total 2855 1571 6575
city Twin Cities MSA Washington D.C. Total
Room Type
Entire home/apt 4563 5222 154173
Hotel room 13 41 1941
Private room 1845 1902 65887
Shared room 49 185 4029
Total 6470 7350 226030
[5 rows x 29 columns]
Display data corresponding to hosts with the name 'Mary'
who offer private rooms with verbose property names.
How many of these are present?
# Filter the DataFrame for hosts with the name 'Mary' and private rooms
= df[(df['Host Name'] == 'Mary') & (df['Room Type'] == 'Private room')]
filtered_df
# Display verbose property names
= filtered_df[['name', 'Host Name', 'Room Type']]
verbose_property_names
print("Data corresponding to hosts with the name 'Mary' who offer private rooms with verbose property names:")
print(verbose_property_names)
Data corresponding to hosts with the name 'Mary' who offer private rooms with verbose property names:
name Host Name \
58460 Diamond Head-Waikiki View Mary
73473 MY HOLLYWOOD SPACE RENTAL 15 Mary
73558 MY HOLLYWOOD SPACE RENTAL 16 Mary
73559 MY HOLLYWOOD SPACE RENTAL 17 Mary
73580 MY HOLLYWOOD SPACE RENTAL 18 Mary
73995 MY HOLLYWOOD SPACE RENTALS 3 Mary
73996 MY HOLLYWOOD SPACE RENTALS 5 Mary
73997 MY HOLLYWOOD SPACE RENTALS 4 Mary
73998 MY HOLLYWOOD SPACE RENTALS 2 Mary
73999 MY HOLLYWOOD SPACE RENTALS 1 Mary
74196 Hilltop 1 Bedroom View of LA Lights! Mary
74492 MY HOLLYWOOD SPACE RENTAL 12 Mary
75650 Spacious and Cozy Bedroom Mary
79432 Hilltop Gorgeous Views MasterSuite! Mary
86563 Cozy&bright room 5-10 min to Santa Monica Pier Mary
86592 Big private room&bathroom in heart of SM/Parking Mary
96107 Hotel Jackman Mary
100784 Private Bedroom near LAX, El Segundo, Beach Ci... Mary
102879 Private Bedroom close to LAX & Beach Cites! Mary
117624 Sunny, calm room in Victorian home Mary
125450 Large comfy room in Victorian home Mary
126692 A room for you in Williamsburg! Only for Women. Mary
130985 Clean bright room in spacious apt! Mary
131182 Bright, quiet room in 2br close to park/trains. Mary
131457 Comfy bed in Cozy Home - GRAND ARMY PLAZA Mary
131689 Great quiet room,great location Mary
139221 Cozy room in great Prospect Heights neighborhood. Mary
143458 Secured Apartment Queens NY - 15 mins from JFK. Mary
147407 ***Beautiful quiet small room in noisy city*** Mary
155106 Private Room in Upper West Side (no kitchen) Mary
158373 Full floor in Park Slope Brownstone duplex Mary
Room Type
58460 Private room
73473 Private room
73558 Private room
73559 Private room
73580 Private room
73995 Private room
73996 Private room
73997 Private room
73998 Private room
73999 Private room
74196 Private room
74492 Private room
75650 Private room
79432 Private room
86563 Private room
86592 Private room
96107 Private room
100784 Private room
102879 Private room
117624 Private room
125450 Private room
126692 Private room
130985 Private room
131182 Private room
131457 Private room
131689 Private room
139221 Private room
143458 Private room
147407 Private room
155106 Private room
158373 Private room
Display data corresponding to properties located north of 40 degrees
latitude for hosts named 'Lori'
and 'Rita'
that have a 30-night minimum stay together in a single
DataFrame
.
How many of these entries are present?
# and hosted by 'Lori' or 'Rita' with a 30-night minimum stay
= df[(df['latitude'] > 40) &
filtered_df 'Host Name'] == 'Lori') | (df['Host Name'] == 'Rita')) &
((df['minimum_nights'] >= 30)]
(df[
# Print the number of entries present
= len(filtered_df)
num_entries print("Number of entries present:", num_entries)
# Display the filtered DataFrame
print("\nData corresponding to properties located north of 40 degrees latitude for hosts named 'Lori' or 'Rita' with a 30-night minimum stay:")
print(filtered_df)
Number of entries present: 7
Data corresponding to properties located north of 40 degrees latitude for hosts named 'Lori' or 'Rita' with a 30-night minimum stay:
ID name \
128293 10192564 Room in Lovely, Spacious Upper West Side Apt.
206979 10295151 MODERN, CLEAN Capitol Hill LOCATION
207998 17589837 Modern 1 Bdrm Full Kitchen Retreat Near UW
208821 21717949 Bright, charming, newly remodeled apt.- W Seattle
209324 24560661 Capitol Hill & Broadway Home Room#A,Free Parking
209326 24561629 Capitol Hill & Broadway Home Room#B, Free Parking
209409 24996483 Capitol Hill & Broadway Home Room#C , Free Par...
Host ID Host Name neighbourhood_group neighbourhood latitude \
128293 52320041 Lori Manhattan Upper West Side 40.80114
206979 32689598 Lori Capitol Hill Broadway 47.62351
207998 117593676 Lori Other neighborhoods Bryant 47.66977
208821 158136273 Rita Delridge Riverview 47.54253
209324 18668299 Rita Capitol Hill Broadway 47.62411
209326 18668299 Rita Capitol Hill Broadway 47.62275
209409 18668299 Rita Capitol Hill Broadway 47.62226
longitude Room Type price minimum_nights number_of_reviews \
128293 -73.96772 Private room 70.0 60 17
206979 -122.32594 Entire home/apt 100.0 30 71
207998 -122.28766 Entire home/apt 130.0 30 25
208821 -122.35022 Entire home/apt 80.0 45 45
209324 -122.31979 Private room 90.0 30 51
209326 -122.31835 Private room 90.0 30 80
209409 -122.31990 Private room 80.0 30 68
last_review reviews_per_month calculated_host_listings_count \
128293 2018-09-05 0.34 1
206979 2020-03-03 1.31 1
207998 2019-09-08 0.65 1
208821 2019-08-28 1.47 1
209324 2019-08-31 1.97 6
209326 2019-09-02 3.16 6
209409 2019-08-30 2.71 6
availability_365 city
128293 212 New York City
206979 365 Seattle
207998 323 Seattle
208821 83 Seattle
209324 280 Seattle
209326 294 Seattle
209409 289 Seattle
For properties in Boston with a price
greater than $250,
report the number of entries for each room type category.
import pandas as pd
# Filter the DataFrame for properties in Boston with a price greater than $250
= df[(df['city'] == 'Boston') & (df['price'] > 250)]
filtered_df
# Report the number of entries for each room type category
= filtered_df['Room Type'].value_counts()
entries_per_room_type
print("Number of entries for each room type category for properties in Boston with a price greater than $250:")
print(entries_per_room_type)
Number of entries for each room type category for properties in Boston with a price greater than $250:
Series([], Name: count, dtype: int64)
For properties with a minimum stay of 3 or fewer nights, show the
minimum and maximum price
, as well as minimum and maximum
reviews per month for each city.
Display the values as integers in a single
DataFrame
.
import pandas as pd
# Filter the DataFrame for properties with a minimum stay of 3 or fewer nights
= df[df['minimum_nights'] <= 3]
filtered_df
# Group the filtered DataFrame by city and calculate the required statistics
= filtered_df.groupby('city').agg({
grouped_data 'price': ['min', 'max'],
'reviews_per_month': ['min', 'max']
})
# Convert the calculated values to integers
= grouped_data.astype(int)
grouped_data
# Rename the columns for better clarity
= ['min_price', 'max_price', 'min_reviews_per_month', 'max_reviews_per_month']
grouped_data.columns
# Display the DataFrame
print("Minimum and maximum price, and minimum and maximum reviews per month for properties with a minimum stay of 3 or fewer nights in each city:")
print(grouped_data)
Minimum and maximum price, and minimum and maximum reviews per month for properties with a minimum stay of 3 or fewer nights in each city:
min_price max_price min_reviews_per_month \
city
Hawaii 10 25000 0
Los Angeles 10 4280 0
New York City 0 10000 0
Seattle 10 1650 0
max_reviews_per_month
city
Hawaii 17
Los Angeles 33
New York City 44
Seattle 15
Determine which city has the highest mean property price. Display this city and its corresponding mean price (rounded to 2 decimal places).
import pandas as pd
# Group the DataFrame by city and calculate the mean price for each city
= df.groupby('city')['price'].mean()
mean_price_per_city
# Find the city with the highest mean price
= mean_price_per_city.idxmax()
highest_mean_price_city = mean_price_per_city.max()
highest_mean_price
# Display the city with the highest mean price and its corresponding mean price (rounded to 2 decimal places)
print("City with the highest mean property price:")
print("City:", highest_mean_price_city)
print("Mean Price:", round(highest_mean_price, 2))
City with the highest mean property price:
City: Hawaii
Mean Price: 207.95
[40 Marks]
Design a Market
class to represent groups
investors in the AirBnB market within individual
cities. The investors can collect
portfolios of rental properties by purchasing them with
a budget that you will allocate.
Once investor characteristics are set, they will compete by reporting portfolio scores each month during a simulated year. These values will determine a winning investor for the year.
You will operate at least 2 instances of this class in different markets to compare the performance of winners in each area.
Begin with the data you have saved in the
airbnb_modified.csv
file.
Read in and modify this data so that:
DataFrame
.DataFrame
.Investors will be able to choose properties for their
portfolio from this modified DataFrame
.
import pandas as pd
import numpy as np
= pd.read_csv('data/modified_dataairbnb.csv') df
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\3646692886.py:1: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('data/modified_dataairbnb.csv')
'price'] = df['price'].round(-1) df[
# Convert the 'last_review' column to datetime
'last_review'] = pd.to_datetime(df['last_review']) df[
C:\Users\OttosMum\AppData\Local\Temp\ipykernel_18424\2186937433.py:2: UserWarning: Parsing dates in %d %m %Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
df['last_review'] = pd.to_datetime(df['last_review'])
# Filter the DataFrame to retain only properties with their last reviews in March, August, and September
= df[df['last_review'].dt.month.isin([3, 8, 9])] df
# Filter the DataFrame to retain only cities with more than 960 entries
= df['city'].value_counts()
cities_counts = cities_counts[cities_counts > 960].index
cities_to_keep = df[df['city'].isin(cities_to_keep)] df
# Display the modified DataFrame
print("Modified DataFrame:")
print(df)
Modified DataFrame:
ID name \
1 80905 French Chic Loft
3 155305 Cottage! BonPaul + Sharky's Hostel
10 304379 Refocus Cottage - paradise
11 353092 Athena's Loft: Find yourself here!
12 427497 Luxurious Mountain Guest Suite Apartment
... ... ...
225936 45326130 Modern and cozy home located in Washington DC
225950 45349877 Penthouse w/ Patio ‚òÜ Capitol Hill Condo ‚òÜ ...
225955 45352724 MODERN ‚òÜ Well-Located Shaw Town Home ‚òÜ 2BR...
225964 45385834 Brand new modern apartment with private entry
226002 45472150 LUXE 1-BR STUDIO SPACE / Desirable Location + ...
Host ID Host Name neighbourhood_group \
1 427027 Celeste NaN
3 746673 BonPaul NaN
10 1566145 Gayle NaN
11 1788071 Beth NaN
12 1909922 Milan NaN
... ... ... ...
225936 55489711 Amir NaN
225950 3850096 Ije NaN
225955 3850096 Ije NaN
225964 16561471 Victor NaN
226002 367917574 Tara NaN
neighbourhood latitude \
1 28801 35.59779
3 28806 35.57864
10 28804 35.64453
11 28806 35.58217
12 28803 35.49111
... ... ...
225936 Union Station, Stanton Park, Kingman Park 38.90101
225950 Capitol Hill, Lincoln Park 38.88703
225955 Howard University, Le Droit Park, Cardozo/Shaw 38.91626
225964 Brightwood Park, Crestwood, Petworth 38.94358
226002 Edgewood, Bloomingdale, Truxton Circle, Eckington 38.91457
longitude Room Type price minimum_nights number_of_reviews \
1 -82.55540 Entire home/apt 470.0 1 114
3 -82.59578 Entire home/apt 90.0 1 267
10 -82.52586 Entire home/apt 290.0 30 24
11 -82.59997 Entire home/apt 80.0 4 497
12 -82.48438 Entire home/apt 120.0 2 40
... ... ... ... ... ...
225936 -77.00283 Entire home/apt 140.0 1 1
225950 -77.00586 Entire home/apt 130.0 2 1
225955 -77.02074 Entire home/apt 110.0 2 1
225964 -77.01283 Entire home/apt 80.0 1 1
226002 -77.00911 Entire home/apt 80.0 1 1
last_review reviews_per_month calculated_host_listings_count \
1 2020-09-07 1.03 11
3 2020-09-22 2.39 5
10 2019-08-03 0.23 2
11 2020-03-15 5.76 1
12 2020-08-17 0.42 1
... ... ... ...
225936 2020-09-18 1.00 1
225950 2020-09-16 1.00 8
225955 2020-09-13 1.00 8
225964 2020-09-19 1.00 1
226002 2020-09-18 1.00 1
availability_365 city
1 288 Asheville
3 0 Asheville
10 0 Asheville
11 315 Asheville
12 339 Asheville
... ... ...
225936 328 Washington D.C.
225950 162 Washington D.C.
225955 171 Washington D.C.
225964 75 Washington D.C.
226002 173 Washington D.C.
[70758 rows x 17 columns]
##################################
Market
class [30 marks]Your are to translate these requirements (with consideration of their usage, defined in 3.3 and 3.4) into working code.
Overall: Coding efficiency and structure, including comments and docstrings, where appropriate, will contribute to the mark in this section.
The Market
class is initialised with:
DataFrame
of AirBnB properties, as
designed above (3.1).
DataFrame
raise
the appropriate type of exception, and a message
saying, "The number of investors must be an integer."assert
that this value is positive and
non-zero.Include sensible object representation dunder methods (i.e.,
__repr__
and __str__
). $$$$
There is a method to select_neighbourhoods
.
size
containing the most properties in the city.low
,
high
, and incr
keyword parameters to this
select_neighbourhoods
method. Choose sensible default
values for these.Market
class.$$$$There is a method to generate_investors
.
Investor
objects inside the Market
class.$$$$The investors are members of the Investor
class, a
class internal to the Market
class.
Investor
object holds information about its:
host_name
entries
in this Market
's cities).__str__
representation.There is a method to initialise_portfolios
.
Investors
s to each purchase
their initial portfolios of rental properties. $$$$There is a method to _purchase_portfolio
.
Investor
object.Investor
, isolate
properties that are available from their
neighbourhood.Investor
budget, select the set of
properties representing the optimal choice for their
budget. That is, purchase as many properties as the budget permits,
while trying to maximise a property's potential return
value.
Investor
's budget
and inventory should be updated._purchase_inventory
action.$$$$
simulate_year
(i.e., simulate a
year of investing and collecting rental returns).
Investor
performance metrics (e.g., their score and monthly
ranking in the city category).Investor
to
_purchase_inventory
again (increasing the number of
properties in their inventory) before the next month.
Investor
's existing inventory, but only one any kind of
property is permitted to be purchased in an individual
_purchase_inventory
action.Market.champion
Investor
.
import pandas as pd
import numpy as np
from random import choice, sample, randint
class Market:
def __init__(self, properties_df, city_name, num_investors=10):
# Check if properties_df has any missing values
assert not properties_df.isnull().values.any(), "DataFrame must not contain missing values"
# Initialize properties DataFrame, city name, and number of investors
self.properties_df = properties_df
self.city_name = city_name
self.num_investors = num_investors
# Check if the number of investors is an integer
if not isinstance(self.num_investors, int):
raise TypeError("The number of investors must be an integer.")
# Check if the number of investors is positive and non-zero
assert self.num_investors > 0, "The number of investors must be positive and non-zero."
# Ensure that the number of investors is evenly divisible by 5
assert self.num_investors % 5 == 0, "The number of investors must be evenly divisible by 5."
# Select neighbourhoods containing the most properties in the city
self.selected_neighbourhoods = self.select_neighbourhoods()
# Generate investors
self.generate_investors()
# Initialise portfolios for investors
self.initialise_portfolios()
# Store champion investor
self.champion = None
def select_neighbourhoods(self, size=None, specified_neighbourhoods=None):
# Implementation to select neighbourhoods
pass
def generate_investors(self):
# Implementation to generate investors
pass
def initialise_portfolios(self):
# Implementation to initialise portfolios
pass
def simulate_year(self):
# Implementation to simulate a year
pass
def _purchase_portfolio(self, investor):
# Implementation for purchasing portfolio
pass
Market
class [5 marks]= Market(air, "Hawaii")
m1
m1.select_neighbourhoods()
m1.generate_investors()
m1.initialise_portfolios()
m1.simulate_year()print(f'The champion of {m1.name} Market is the {m1.champion}')
...
The champion of Hawaii Market is Kaulana with a score of 24298578.
Market
monthly scores and rankings by invoking the
Market
class show_win_record
method m1.show_win_record()
# Check for missing values in the DataFrame
= df.isnull().sum()
missing_values
# If there are missing values, handle them accordingly
if missing_values.any():
# You can either drop rows with missing values
= df.dropna()
df
# Create a Market instance
= Market(df, "Hawaii")
m1
# Select neighbourhoods
m1.select_neighbourhoods()
# Generate investors
m1.generate_investors()
# Initialise portfolios
m1.initialise_portfolios()
# Simulate a year
m1.simulate_year()
# Print the champion of the market
print(f'The champion of {m1.city_name} Market is the {m1.champion}')
# Show win record
m1.show_win_record()
The champion of Hawaii Market is the None
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[110], line 29
26 print(f'The champion of {m1.city_name} Market is the {m1.champion}')
28 # Show win record
---> 29 m1.show_win_record()
AttributeError: 'Market' object has no attribute 'show_win_record'
Market
executions
[5 marks]Market
instances in full..champion
s and produce
a ranked representation of the different Market
s. For
example, when we rank markets by the score of their overall champion, we
might produce:Market rankings:
Position Name Champion Score
1 Hawaii Kaulana 24298578
2 Boston Nat 21440474
class Investor:
def __init__(self, name):
self.name = name
class Market:
def __init__(self):
self.champion = None # Assume champion is not set initially
def set_champion(self, investor):
self.champion = investor
# Create a Market instance
= Market()
market
# Set champion to None initially
print("Champion before setting:", market.champion)
# Try to access the 'name' attribute on the champion
try:
= market.champion.name
champion_name print("Champion name:", champion_name)
except AttributeError:
print("Champion is None, cannot access 'name' attribute")
# Set a valid investor as the champion
= Investor("John")
valid_investor
market.set_champion(valid_investor)
# Now try to access the 'name' attribute again
= market.champion.name
champion_name print("Champion name:", champion_name)
print(f"{i}\t\t{market['Name']}\t\t{market['Champion']}\t\t{market['Score']}")
Champion before setting: None
Champion is None, cannot access 'name' attribute
Champion name: John
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[116], line 34
31 champion_name = market.champion.name
32 print("Champion name:", champion_name)
---> 34 print(f"{i}\t\t{market['Name']}\t\t{market['Champion']}\t\t{market['Score']}")
NameError: name 'i' is not defined