5 Advanced Tips on Python Sequences


By Michael Berk, Data Scientist at Tubi



Photo by NASA on Unsplash

“66% of data scientists are applying Python daily.” — src

If you’re in that 66%, this post is for you.

We’re going to cover the major takeaways from chapter 2 of Fluent Python by Luciano Ramalho, which covers sequences e.g. lists, tuples, etc.

 

1 — Lists vs. Tuples

 
 
Tip: lists should hold the same kind of information whereas tuples can hold different kinds of information.

Starting with the basics, let’s discuss the main difference between lists and tuples. Below we can see an example of each — lists are surrounded by square brackets [] and tuples are surrounded by parentheses ().

my_tuple = (1,'a',False)
my_list =  [1,'a',False]

 

On the backend, lists are mutable but tuples are not. Immutable variables often require less memory so try to use tuples when possible.

However, there is a deeper note covered in Fluent Python.

Semantically, it’s best practice to store different kinds of data in a tuple and the same kinds in a list. Note that both tuples and lists support multiple python datatypes in the same variable, but we’re talking about the type of variable conceptually.

For instance, a tuple could be used to store the following information: (latitude, longitude, city_name). Not only are these different data types (float, float, str) , but they’re also different conceptually. Lists, on the other hand, should only store latitude, longitude, city name, or a tuple of all three.

# list of [float, float, str]
bad_practice_list = [[39.9526, 75.1652, 'Philadelphia'], 
                     [6.2476, 75.5658m 'Medellín']]# list of tuples
good_practice_list = [(39.9526, 75.1652, 'Philadelphia'), 
                      (6.2476, 75.5658m 'Medellín')]

 

To improve the organization of your python code, you should always keep information of the same kind in a list. Tuples are for structure, lists are for sequence.

 

2 — Unpacking Iterables

 
 
Tip: use * and _ to improve your unpacking.

Unpacking is a very smooth and readable way to access values inside of an iterable. They’re quite common in loops, list comprehensions, and function calls.

Unpacking is done by assigning a sequence-like datatype to comma separated variable names, for example…

 

However, Fluent Python goes into some fancy unpacking methods. One example is you can use * to unpack “the rest” of the items in a long iterable. Using the asterisk notation is common when you have some items of interest, and other items that are less important.

x, *y, z = [1,2,3,4,5,6,7]
x #1
y #[2,3,4,5,6]
z #7

 

As you can see, the * operator can occur in the middle of a set of variables and python will assign all unaccounted for values to that variable.

But, we can take the asterisk unpacking operator one step further. You can use _ to unpack and not save a value. This convention comes in handy when you’re looking to unpack something but, unlike the above example, you don’t need all the variables.

 

One use case for the underscore _ unpacking operator is if you’re working with dictionaries or builtin methods that return multiple values.

And finally, for the cherry on top, we can combine both methods to unpack and not store “the rest” of the values.

 

 

3 — Does the Function Return None?

 
 
Tip: if a function returns None, it performs in-place operations.

Many python data types have two versions of the same function, such as x.sort() and sorted(x) shown below.

x = [3,1,5,2]
x.sort()
x # [1,2,3,5]x = [3,1,5,2]
y = sorted(x)
x # [3,1,5,2]
y # [1,2,3,5]

 

In the first example using x.sort(), we perform an in-place sort which is more efficient and requires less memory. But, in the second exampling using sorted(x), we are able to retain the original order of the list.

In general, Python maintains this notation. Dot operators like x.sort()often return None and perform in-place mutations. Functions that take the variable as a parameter like sorted(x) return a copy of the mutated variable, but leave the original variable unchanged.

 

4 — GenExps vs. ListComps

 
 
Tip: use generator expressions if you’re only accessing the variable once. If not, use a list comprehension.

List comprehensions (listcomps) and generator expressions (genexps) are different ways to instantiate a sequence data type.

list_comp = [x for x in range(5)]
gen_exp = (x for x in range(5))

 

As shown above, the only syntactical difference between list comps and genexps are the surrounding bracket type — parentheses ()are used for genexps and square brackets [] are used for list comps.

List comps are instantiated, which means they are evaluated and saved in memory. Genexps are not. Each time a genexp is needed by the program, it will perform the computation to evaluate that expression.

So that’s why generator expressions are better if you’re only using the variable once — they are never actually stored in memory so they’re far more efficient. But, if you’re repeatedly accessing a sequence or need list-specific method, it’s better to store it in memory.

Fun side note — you can also create dictionaries using the list comprehension syntax…

my_dict = {k:v for k,v in zip(['a','b'], [1,2])}

 

 

5 — Slicing

 
 
Finally, let’s conclude with a quick note on slicing. Unlike with unpacking, sometimes we want to access a value in an iterable using the index. Slicing allows us to do this by using the following format: my_list[start:stop:step]

For those of you who know that my_list[::-1] reverses a list order but didn’t know why (such as myself), that’s why. By passing a -1 as our step parameter, we step through the list in reverse.

Now most python packages abide by the [start:stop:index]syntax. Numpy and pandas are some notable examples. Let’s take a look at each parameter in turn…

  • start: starting index in your slice
  • end: not-inclusive ending index in your slice
  • step: the step size (and direction) within your start and stop index

So, because each of these values are optional, we can do all sorts of cool slicing…

x = [1,2,3,4]x[1:3]                   # [2,3]
x[2:0:-1]                # [3,2]last = [-1::]            # 4
all_but_last = x[:-1:]   # [1,2,3]
reversed = x[::-1]       # [4,3,2,1]

 

And there you have it! 5 major tips from chapter 2 of Fluent Python. Just one more section…

 

Useful Notes for Data Scientists

 
 
Disclaimer, I’m not super qualified to add my opinions to this piece. However, these notes should be pretty intuitive. Let me know if you disagree.

  1. List comprehensions should almost always replace loops. If the body of the loop is complex, you can create a function that does the operations. By combining user-defined functions with list comprehension syntax, you make readable and efficient code. And, if you need to iterate over more than one variable, use enumerate() or zip().
  2. Being “optimal” in python doesn’t matter. If you’re writing production-level code, it may be different. But, realistically you won’t see major performance bumps when using a tuple over a list. Ensuring that your data manipulation steps are logical and efficient is 99% of the work. If the 1% matters, then you can start worrying about tuple vs. list. Moreover, if you are really in the business of efficient code, you’re probably not using python.
  3. Finally, slicing is super cool. I had always known that x[::-1] reverses a list, but never knew why until reading this chapter of Fluent Python. And it works for numpy and pandas!

Thanks for reading! I’ll be writing 35 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.

 
Bio: Michael Berk (https://michaeldberk.com/) is a Data Scientist at Tubi.

Original. Reposted with permission.

Related:



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *