4.4 Masks

How to mask data (a Boolean statement to fish out data that you want, square brackets after a dataframe)


Watch this video from 8:06 to 12:06

# To load the video, execute this cell by pressing shift + enter

from IPython.display import YouTubeVideo
from datetime import timedelta
start=int(timedelta(hours=0, minutes=8, seconds=6).total_seconds())
end=int(timedelta(hours=0, minutes=12, seconds=6).total_seconds())

YouTubeVideo("jEQRU55x0e4",start=start,end=end,width=640,height=360)

The following is a transcript of the video.

💡 Remember: Import pandas and read in the dataset below to complete this lesson.

# Import pandas

import pandas as pd
# Download the dataset from the
# Jupyter Book to read in locally or 
# read in from GitHub, below:

data = pd.read_csv('https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/co2_mlo_weekly.csv')

Next we’re going to talk about something called masking. Masking is very important and it’s a way to find specific data that fulfills a criteria that is a Boolean statement; remember, a Boolean statement is something that can only evaluate to True or False, to find all the data in your data frame that is true with respect to the Boolean statement.

So this is what masking does: it gives us specific data and as I said it’s very simple, a mask is just a Boolean statement.

We can create a mask variable and set it equal to a Boolean statement. For the data in the month column, let’s make a Boolean statement saying that month equals August. We use a double equals here because remember, when we’re using Python you can have less than or equal to, greater than or equal to, but equals equals is truly equal. So the Boolean statement is, is it the month august, True or False? That is the mask. And we’re setting this Boolean statement equal to the mask.

So we hit shift + enter and then we can see what the mask is.

# How do we get specific data?
# What if we want all data points from just the month of August?
# We create a mask!
# A mask is a Boolean statement where the data you want is TRUE

mask = data['month']=='aug'

And we can see what the mask is: it returns the rows of the dataframe that fulfill the criteria of the Boolean statement. This is a large dataframe. It starts at the beginning then it breaks and then it goes all the way to the end. If you remember, the month of august was the very first data values. So you can see that these august dates, which were at the top of the  dataframe, are evaluating as True. But then they evaluate as False when the month is not august at the end.

# The mask returns which rows evaluate as TRUE for the Boolean statement

mask
0       True
1       True
2       True
3       True
4       True
       ...  
709    False
710    False
711    False
712    False
713    False
Name: month, Length: 714, dtype: bool

We can get the full data back as well. This is what a mask does: it is returning the rows which are True and False, but if you put the mask within the dataframe brackets what you get back is the whole dataframe, those rows of the dataframe that evaluate as True.

For example, here you can see what we get returned is only data that is the month   of august. These were the only rows that fulfilled the Boolean statement of being True that they were august.

So that’s all the mask is: remember a mask is just a Boolean statement and we put the mask in the brackets of the dataframe to return those rows that evaluate as True for the Boolean statement.

# If we place the mask inside the dataframe brackets
# then the rows where the statement is True are returned

data[mask]
date running_date month year CO2ppm
0 8/13/17 1 aug 2017 405.20
1 8/14/17 2 aug 2017 405.20
2 8/15/17 3 aug 2017 405.20
3 8/16/17 4 aug 2017 405.20
4 8/17/17 5 aug 2017 405.20
5 8/18/17 6 aug 2017 405.20
6 8/19/17 7 aug 2017 405.20
7 8/20/17 8 aug 2017 404.54
8 8/21/17 9 aug 2017 404.54
9 8/22/17 10 aug 2017 404.54
10 8/23/17 11 aug 2017 404.54
11 8/24/17 12 aug 2017 404.54
12 8/25/17 13 aug 2017 404.54
13 8/26/17 14 aug 2017 404.54
14 8/27/17 15 aug 2017 404.23
15 8/28/17 16 aug 2017 404.23
16 8/29/17 17 aug 2017 404.23
17 8/30/17 18 aug 2017 404.23
18 8/31/17 19 aug 2017 404.23
353 8/1/18 354 aug 2018 407.46
354 8/2/18 355 aug 2018 407.46
355 8/3/18 356 aug 2018 407.46
356 8/4/18 357 aug 2018 407.46
357 8/5/18 358 aug 2018 407.23
358 8/6/18 359 aug 2018 407.23
359 8/7/18 360 aug 2018 407.23
360 8/8/18 361 aug 2018 407.23
361 8/9/18 362 aug 2018 407.23
362 8/10/18 363 aug 2018 407.23
363 8/11/18 364 aug 2018 407.23
364 8/12/18 365 aug 2018 407.07
365 8/13/18 366 aug 2018 407.07
366 8/14/18 367 aug 2018 407.07
367 8/15/18 368 aug 2018 407.07
368 8/16/18 369 aug 2018 407.07
369 8/17/18 370 aug 2018 407.07
370 8/18/18 371 aug 2018 407.07
371 8/19/18 372 aug 2018 406.84
372 8/20/18 373 aug 2018 406.84
373 8/21/18 374 aug 2018 406.84
374 8/22/18 375 aug 2018 406.84
375 8/23/18 376 aug 2018 406.84
376 8/24/18 377 aug 2018 406.84
377 8/25/18 378 aug 2018 406.84
378 8/26/18 379 aug 2018 406.26
379 8/27/18 380 aug 2018 406.26
380 8/28/18 381 aug 2018 406.26
381 8/29/18 382 aug 2018 406.26
382 8/30/18 383 aug 2018 406.26
383 8/31/18 384 aug 2018 406.26

You can use this technique also with a specific column. For example, if you say you’re only interested in the CO2 parts per million values,  you refer to that column and the square brackets to get it, but you have a second set of brackets with the mask and when you do this you will have only returned the CO2 parts per million values, none of the rest of the dataframe, because you specified this specific column.

# We can limit what is returned by refering to a specific column
# and placing the mask in a double bracket

data['CO2ppm'][mask]
0      405.20
1      405.20
2      405.20
3      405.20
4      405.20
5      405.20
6      405.20
7      404.54
8      404.54
9      404.54
10     404.54
11     404.54
12     404.54
13     404.54
14     404.23
15     404.23
16     404.23
17     404.23
18     404.23
353    407.46
354    407.46
355    407.46
356    407.46
357    407.23
358    407.23
359    407.23
360    407.23
361    407.23
362    407.23
363    407.23
364    407.07
365    407.07
366    407.07
367    407.07
368    407.07
369    407.07
370    407.07
371    406.84
372    406.84
373    406.84
374    406.84
375    406.84
376    406.84
377    406.84
378    406.26
379    406.26
380    406.26
381    406.26
382    406.26
383    406.26
Name: CO2ppm, dtype: float64

You saw that we created a mask, and we called it mask, and we were writing out the mask into our code. But usually people don’t do that. They do it much more simple, where they just write out the mask.

You just put the mask, all in one line of code, with the data or the specific column that you want. This is a little complicated to look at but always remember go to the mask first and it will be in the second set of brackets or it will be in the first set of brackets if you are returning all the data and it’s just a Boolean statement. And it will return those rows that are true for the Boolean statement. And this is a way that you get specific data that you want.

# Usually, the mask is written out straight into the line of code
# Always look to the second set of brackets to see
# what the mask is

data['CO2ppm'][data['month']=='aug']
0      405.20
1      405.20
2      405.20
3      405.20
4      405.20
5      405.20
6      405.20
7      404.54
8      404.54
9      404.54
10     404.54
11     404.54
12     404.54
13     404.54
14     404.23
15     404.23
16     404.23
17     404.23
18     404.23
353    407.46
354    407.46
355    407.46
356    407.46
357    407.23
358    407.23
359    407.23
360    407.23
361    407.23
362    407.23
363    407.23
364    407.07
365    407.07
366    407.07
367    407.07
368    407.07
369    407.07
370    407.07
371    406.84
372    406.84
373    406.84
374    406.84
375    406.84
376    406.84
377    406.84
378    406.26
379    406.26
380    406.26
381    406.26
382    406.26
383    406.26
Name: CO2ppm, dtype: float64