Python Data Structures Compared – KDnuggets


Figure
Photo by Hitesh Choudhary on Unsplash

 

Choosing a structure for storing your data is an important part of solving programming tasks and implementing solutions, yet it is often not given the attention a choice of such potential weight deserves. Unfortunately, in Python I often see the list used as a catch-all data structure. The list has its advantages, of course, but also its drawbacks. There are lots of other data structure options as well.

Let’s take a look at 5 different Python data structures and see how they could be used to store data we might be processing in our everyday tasks, as well as the relative memory they use for storage and time they take to create and access.

 

Types of Data Structures

 
First, let’s lay out the 5 data structures we will consider herein, and provide some preliminary insight.

 
Class

In this case we are talking about vanilla classes (as opposed to data classes below), which are described at a high level in Python documentation as follows:

Classes provide a means of bundling data and functionality together. Creating a new class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.

 

The advantages of using classes is that they are conventional, and are well-used and -understood. Whether or not they are overkill in terms of relative required memory or time is something to look at.

 
Data Class

Added in Python 3.7, the data class is a special class meant for mainly holding data, which comes with some freebie methods out of the box for typical functionality like instantiating and printing instance contents. Creating a data class is accomplished using the @dataclass decorator.

Although they use a very different mechanism, Data Classes can be thought of as “mutable namedtuples with defaults”. Because Data Classes use normal class definition syntax, you are free to use inheritance, metaclasses, docstrings, user-defined methods, class factories, and other Python class features. Such a class is called a Data Class, but there’s really nothing special about the class: the decorator adds generated methods to the class and returns the same class it was given.

 

As you can see, the automatically generated methods and the related time-savings are the main reason to consider data classes.

 
Named Tuple

Named tuples are an elegant implementation of a useful data structure, essentially tuple subclasses with named fields.

Named tuples assign meaning to each position in a tuple and allow for more readable, self-documenting code. They can be used wherever regular tuples are used, and they add the ability to access fields by name instead of position index.

 

At first look, named tuples appear to be the closest thing to simple C-like struct types natively available in Python, making them naively attractive to many.

 
Dictionary

The Python dictionary is a collection of key-value pairs.

Dictionaries are mutable unordered collections (they do not record element position or order of insertion) of key-value pairs. Keys within the dictionary must be unique and must be hashable. That includes types like numbers, strings and tuples. Lists and dicts can not be used as keys since they are mutable.

 

The advantage of dictionaries is that they are simple, the data within are easily accessible, and they are well-used and -understood.

 
List

Here it is, the one-size-fits-all Python data superstructure, or so lots of code would have you believe. Here is what the list really is:

Lists are mutable ordered and indexed collections of objects. The items of a list are arbitrary Python objects. Lists are formed by placing a comma-separated list of expressions in square brackets.

 

Why the widespread use of the list? It’s very simple to understand and implement, and is usually the first structure one learns when picking up Python. Are there disadvantages related to speed and memory usage? Let’s take a look.

 

Implementations

 
First off, let’s have a look at the creation process of each of these structures and how they compare to one another.

The reason we might be using any of these data structures to store our data would vary widely, but for the unimaginative, let’s imagine we are extracting data from a SQL database and need to store each record in one such structure in order to perform some processing prior to moving the data further along our pipeline.

With that in mind, here is instantiation code for creating each of the five structures.

""" class """
class CustomerClass:
    def __init__(self, cust_num:str, f_name:str, l_name:str, address:str,
                       city:str, state:str, phone:str, age:int):
        self.cust_num = cust_num
        self.f_name = f_name
        self.l_name = l_name
        self.address = address
        self.city = city
        self.state = state
        self.phone = phone
        self.age = age

    def to_string(self):
        return(f'{self.cust_num}, {self.f_name}, {self.l_name}, {self.age}, 
                 {self.address}, {self.city}, {self.state}, {self.phone}'stgrcutures)

""" data class """
from dataclasses import dataclass
@dataclass
class CustomerDataClass:
    cust_num: int
    f_name: str
    l_name: str
    address: str
    city: str
    state: str
    phone: str
    age: int

""" named tuple """
from collections import namedtuple
CustomerNamedTuple = namedtuple('CustomerNamedTuple', 
                                'cust_num f_name l_name address city state phone age')

""" dict """
def make_customer_dict(cust_num: int, f_name: str, l_name: str, address: str, 
                       city: str, state: str, phone: str, age: int):
    return {'cust_num': cust_num,
            'f_name': f_name,
            'l_name': l_name,
            'address': address,
            'city': city,
            'state': state,
            'phone': phone,
            'age': age}

""" list """
def make_customer_list(cust_num: int, f_name: str, l_name: str, address: str, 
                       city: str, state: str, phone: str, age: int):
    return [cust_num, f_name, l_name, address,
            city, state, phone, age]

Note the following:

  • The creation of an instance of the built-in types dictionary and list have been place inside functions
  • The difference between the class and the data class implementations, in light of the discussion above
  • The (clearly subjective) elegance and simplicity of the named tuple

Let’s have a look at instantiation of these structures, and a comparison of the resources required to do so.

 

Testing and Results

 
We will create a single instance of each of the 5 structures, each housing a single data record. We will repeat this process using the same data fields for each structure 1,000,000 times to get a better sense of average time, performing this process on my modest Dell notebook, using an Ubuntu-derived operating system.

Image

Compare the code between the 5 structure instantiations below.

""" instantiating structures """

from sys import getsizeof
import time

# class
customer_1 = CustomerClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
print(f'Data: {customer_1.to_string()}')
print(f'Type: {type(customer_1)}')
print(f'Size: {getsizeof(customer_1)} bytes')

t0 = time.time()
for i in range(1000000):
    customer = CustomerClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
t1 = time.time()
print('Time: {:.3f}sn'.format(t1-t0))

# data class
customer_2 = CustomerDataClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
print(f'Data: {customer_2}')
print(f'Type: {type(customer_2)}')
print(f'Size: {getsizeof(customer_2)} bytes')

t0 = time.time()
for i in range(1000000):
    customer = CustomerDataClass('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
t1 = time.time()
print('Time: {:.3f}sn'.format(t1-t0))

# named tuple
customer_3 = CustomerNamedTuple('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
print(f'Data: {customer_3}')
print(f'Type: {type(customer_3)}')
print(f'Size: {getsizeof(customer_3)} bytes')

t0 = time.time()
for i in range(1000000):
    customer = CustomerNamedTuple('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
t1 = time.time()
print('Time: {:.3f}sn'.format(t1-t0))

# dict
customer_4 = make_customer_dict('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
print(f'Data: {customer_4}')
print(f'Type: {type(customer_4)}')
print(f'Size: {getsizeof(customer_4)} bytes')

t0 = time.time()
for i in range(1000000):
    customer = make_customer_dict('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
t1 = time.time()
print('Time: {:.3f}sn'.format(t1-t0))

# list
customer_5 = make_customer_list('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
print(f'Data: {customer_5}')
print(f'Type: {type(customer_5)}')
print(f'Size: {getsizeof(customer_5)} bytes')

t0 = time.time()
for i in range(1000000):
    customer = make_customer_list('EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56)
t1 = time.time()
print('Time: {:.3f}sn'.format(t1-t0))

And here is the output of the above:

Data: EP90210, Edward, Perez, 56, 123 Fake Street, Cityville, TX, 888-456-1234
Type: <class '__main__.CustomerClass'>
Size: 56 bytes
Time: 0.657s

Data: CustomerDataClass(cust_num='EP90210', f_name='Edward', l_name='Perez', address='123 Fake Street', city='Cityville', state='TX', phone='888-456-1234', age=56)
Type: <class '__main__.CustomerDataClass'>
Size: 56 bytes
Time: 0.630s

Data: CustomerNamedTuple(cust_num='EP90210', f_name='Edward', l_name='Perez', address='123 Fake Street', city='Cityville', state='TX', phone='888-456-1234', age=56)
Type: <class '__main__.CustomerNamedTuple'>
Size: 112 bytes
Time: 0.447s

Data: {'cust_num': 'EP90210', 'f_name': 'Edward', 'l_name': 'Perez', 'address': '123 Fake Street', 'city': 'Cityville', 'state': 'TX', 'phone': '888-456-1234', 'age': 56}
Type: <class 'dict'>
Size: 368 bytes
Time: 0.318s

Data: ['EP90210', 'Edward', 'Perez', '123 Fake Street', 'Cityville', 'TX', '888-456-1234', 56]
Type: <class 'list'>
Size: 128 bytes
Time: 0.184s

Finally, another useful piece of data would be to know the relative access times of values stored within our structures (in the case below, the address). The same retrieval will be repeated 1,000,000 times, and the average time reported below.

""" accessing an element """

# class
t0 = time.time()
for i in range(1000000):
    address = customer_1.address
t1 = time.time()
print(f'Type: {type(customer_1)}')
print('Time: {:.3f}sn'.format(t1-t0))

# data class
t0 = time.time()
for i in range(1000000):
    address = customer_2.address
t1 = time.time()
print(f'Type: {type(customer_2)}')
print('Time: {:.3f}sn'.format(t1-t0))

# named tuple
t0 = time.time()
for i in range(1000000):
    address = customer_3.address
t1 = time.time()
print(f'Type: {type(customer_3)}')
print('Time: {:.3f}sn'.format(t1-t0))

# dictionary
t0 = time.time()
for i in range(1000000):
    address = customer_4['address']
t1 = time.time()
print(f'Type: {type(customer_4)}')
print('Time: {:.3f}sn'.format(t1-t0))

# list
t0 = time.time()
for i in range(1000000):
    address = customer_5[3]
t1 = time.time()
print(f'Type: {type(customer_5)}')
print('Time: {:.3f}sn'.format(t1-t0))

And the output:

Type: <class '__main__.CustomerClass'>
Time: 0.098s

Type: <class '__main__.CustomerDataClass'>
Time: 0.092s

Type: <class '__main__.CustomerNamedTuple'>
Time: 0.134s

Type: <class 'dict'>
Time: 0.095s

Type: <class 'list'>
Time: 0.117s

The intent of this article is not to make a recommendation one way or another as to which data structure to use, nor is it to suggest that there is a universal best structure for every case. Instead, we wanted to have a look at some different options and their relative strength and weakness. As with all things, there are trade-offs to be made, and less quantitative considerations such as understandability, ease of use, etc. are to be taken into account when making these types of decisions.

That said, a few things do stand out from the above analysis:

  1. The dictionary uses the greatest amount of storage of all the structures, in our case almost 3 times as much as the next greatest — though we should be careful about generalizing until we look at the effects of scaling and internal field data types
  2. Unsurprisingly, the list is the fastest to instantiate, yet not the fastest from which to retrieve an element (it’s almost the slowest)
  3. In our case, the named tuple is the slowest structure from which to retrieve an element, yet is middle of the pack for storage space
  4. Both classes take relatively longer to instantiate (expected), but element retrieval and space used, in both cases, are very competitive with the other structures

So not only are we not looking to recommend a single structure in every case, there is no clear winner that could be recommended in every case. Even taking the caution to generalize based on our small experiment, it is clear that priorities will need to be taken into account for making a decision as to which structure you use for a particular job. At the very least, this limited experimentation has provided some small window of insight into the performance of data structures available in Python.

 
Related:



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *