4.4 Masks¶
How to mask data (a Boolean statement to fish out data that you want, square brackets after a dataframe)¶
Watch this video from 8:06 to 12:06
# To load the video, execute this cell by pressing shift + enter
from IPython.display import YouTubeVideo
from datetime import timedelta
start=int(timedelta(hours=0, minutes=8, seconds=6).total_seconds())
end=int(timedelta(hours=0, minutes=12, seconds=6).total_seconds())
YouTubeVideo("jEQRU55x0e4",start=start,end=end,width=640,height=360)
The following is a transcript of the video.
💡 Remember: Import
pandas
and read in the dataset below to complete this lesson.
# Import pandas
import pandas as pd
# Download the dataset from the
# Jupyter Book to read in locally or
# read in from GitHub, below:
data = pd.read_csv('https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/co2_mlo_weekly.csv')
Next we’re going to talk about something called masking. Masking is very important and it’s a way to find specific data that fulfills a criteria that is a Boolean statement; remember, a Boolean statement is something that can only evaluate to True
or False
, to find all the data in your data frame that is true with respect to the Boolean statement.
So this is what masking does: it gives us specific data and as I said it’s very simple, a mask is just a Boolean statement.
We can create a mask variable and set it equal to a Boolean statement. For the data in the month column, let’s make a Boolean statement saying that month equals August. We use a double equals here because remember, when we’re using Python you can have less than or equal to, greater than or equal to, but equals equals is truly equal. So the Boolean statement is, is it the month august
, True
or False
? That is the mask. And we’re setting this Boolean statement equal to the mask.
So we hit shift + enter and then we can see what the mask is.
# How do we get specific data?
# What if we want all data points from just the month of August?
# We create a mask!
# A mask is a Boolean statement where the data you want is TRUE
mask = data['month']=='aug'
And we can see what the mask is: it returns the rows of the dataframe that fulfill the criteria of the Boolean statement. This is a large dataframe. It starts at the beginning then it breaks and then it goes all the way to the end. If you remember, the month of august
was the very first data values. So you can see that these august
dates, which were at the top of the dataframe, are evaluating as True
. But then they evaluate as False
when the month is not august
at the end.
# The mask returns which rows evaluate as TRUE for the Boolean statement
mask
0 True
1 True
2 True
3 True
4 True
...
709 False
710 False
711 False
712 False
713 False
Name: month, Length: 714, dtype: bool
We can get the full data back as well. This is what a mask does: it is returning the rows which are True
and False
, but if you put the mask within the dataframe brackets what you get back is the whole dataframe, those rows of the dataframe that evaluate as True
.
For example, here you can see what we get returned is only data that is the month
of august
. These were the only rows that fulfilled the Boolean statement of being True
that they were august
.
So that’s all the mask is: remember a mask is just a Boolean statement and we put the mask in the brackets of the dataframe to return those rows that evaluate as True
for the Boolean statement.
# If we place the mask inside the dataframe brackets
# then the rows where the statement is True are returned
data[mask]
date | running_date | month | year | CO2ppm | |
---|---|---|---|---|---|
0 | 8/13/17 | 1 | aug | 2017 | 405.20 |
1 | 8/14/17 | 2 | aug | 2017 | 405.20 |
2 | 8/15/17 | 3 | aug | 2017 | 405.20 |
3 | 8/16/17 | 4 | aug | 2017 | 405.20 |
4 | 8/17/17 | 5 | aug | 2017 | 405.20 |
5 | 8/18/17 | 6 | aug | 2017 | 405.20 |
6 | 8/19/17 | 7 | aug | 2017 | 405.20 |
7 | 8/20/17 | 8 | aug | 2017 | 404.54 |
8 | 8/21/17 | 9 | aug | 2017 | 404.54 |
9 | 8/22/17 | 10 | aug | 2017 | 404.54 |
10 | 8/23/17 | 11 | aug | 2017 | 404.54 |
11 | 8/24/17 | 12 | aug | 2017 | 404.54 |
12 | 8/25/17 | 13 | aug | 2017 | 404.54 |
13 | 8/26/17 | 14 | aug | 2017 | 404.54 |
14 | 8/27/17 | 15 | aug | 2017 | 404.23 |
15 | 8/28/17 | 16 | aug | 2017 | 404.23 |
16 | 8/29/17 | 17 | aug | 2017 | 404.23 |
17 | 8/30/17 | 18 | aug | 2017 | 404.23 |
18 | 8/31/17 | 19 | aug | 2017 | 404.23 |
353 | 8/1/18 | 354 | aug | 2018 | 407.46 |
354 | 8/2/18 | 355 | aug | 2018 | 407.46 |
355 | 8/3/18 | 356 | aug | 2018 | 407.46 |
356 | 8/4/18 | 357 | aug | 2018 | 407.46 |
357 | 8/5/18 | 358 | aug | 2018 | 407.23 |
358 | 8/6/18 | 359 | aug | 2018 | 407.23 |
359 | 8/7/18 | 360 | aug | 2018 | 407.23 |
360 | 8/8/18 | 361 | aug | 2018 | 407.23 |
361 | 8/9/18 | 362 | aug | 2018 | 407.23 |
362 | 8/10/18 | 363 | aug | 2018 | 407.23 |
363 | 8/11/18 | 364 | aug | 2018 | 407.23 |
364 | 8/12/18 | 365 | aug | 2018 | 407.07 |
365 | 8/13/18 | 366 | aug | 2018 | 407.07 |
366 | 8/14/18 | 367 | aug | 2018 | 407.07 |
367 | 8/15/18 | 368 | aug | 2018 | 407.07 |
368 | 8/16/18 | 369 | aug | 2018 | 407.07 |
369 | 8/17/18 | 370 | aug | 2018 | 407.07 |
370 | 8/18/18 | 371 | aug | 2018 | 407.07 |
371 | 8/19/18 | 372 | aug | 2018 | 406.84 |
372 | 8/20/18 | 373 | aug | 2018 | 406.84 |
373 | 8/21/18 | 374 | aug | 2018 | 406.84 |
374 | 8/22/18 | 375 | aug | 2018 | 406.84 |
375 | 8/23/18 | 376 | aug | 2018 | 406.84 |
376 | 8/24/18 | 377 | aug | 2018 | 406.84 |
377 | 8/25/18 | 378 | aug | 2018 | 406.84 |
378 | 8/26/18 | 379 | aug | 2018 | 406.26 |
379 | 8/27/18 | 380 | aug | 2018 | 406.26 |
380 | 8/28/18 | 381 | aug | 2018 | 406.26 |
381 | 8/29/18 | 382 | aug | 2018 | 406.26 |
382 | 8/30/18 | 383 | aug | 2018 | 406.26 |
383 | 8/31/18 | 384 | aug | 2018 | 406.26 |
You can use this technique also with a specific column. For example, if you say you’re only interested in the CO2 parts per million values, you refer to that column and the square brackets to get it, but you have a second set of brackets with the mask and when you do this you will have only returned the CO2 parts per million values, none of the rest of the dataframe, because you specified this specific column.
# We can limit what is returned by refering to a specific column
# and placing the mask in a double bracket
data['CO2ppm'][mask]
0 405.20
1 405.20
2 405.20
3 405.20
4 405.20
5 405.20
6 405.20
7 404.54
8 404.54
9 404.54
10 404.54
11 404.54
12 404.54
13 404.54
14 404.23
15 404.23
16 404.23
17 404.23
18 404.23
353 407.46
354 407.46
355 407.46
356 407.46
357 407.23
358 407.23
359 407.23
360 407.23
361 407.23
362 407.23
363 407.23
364 407.07
365 407.07
366 407.07
367 407.07
368 407.07
369 407.07
370 407.07
371 406.84
372 406.84
373 406.84
374 406.84
375 406.84
376 406.84
377 406.84
378 406.26
379 406.26
380 406.26
381 406.26
382 406.26
383 406.26
Name: CO2ppm, dtype: float64
You saw that we created a mask, and we called it mask, and we were writing out the mask into our code. But usually people don’t do that. They do it much more simple, where they just write out the mask.
You just put the mask, all in one line of code, with the data or the specific column that you want. This is a little complicated to look at but always remember go to the mask first and it will be in the second set of brackets or it will be in the first set of brackets if you are returning all the data and it’s just a Boolean statement. And it will return those rows that are true for the Boolean statement. And this is a way that you get specific data that you want.
# Usually, the mask is written out straight into the line of code
# Always look to the second set of brackets to see
# what the mask is
data['CO2ppm'][data['month']=='aug']
0 405.20
1 405.20
2 405.20
3 405.20
4 405.20
5 405.20
6 405.20
7 404.54
8 404.54
9 404.54
10 404.54
11 404.54
12 404.54
13 404.54
14 404.23
15 404.23
16 404.23
17 404.23
18 404.23
353 407.46
354 407.46
355 407.46
356 407.46
357 407.23
358 407.23
359 407.23
360 407.23
361 407.23
362 407.23
363 407.23
364 407.07
365 407.07
366 407.07
367 407.07
368 407.07
369 407.07
370 407.07
371 406.84
372 406.84
373 406.84
374 406.84
375 406.84
376 406.84
377 406.84
378 406.26
379 406.26
380 406.26
381 406.26
382 406.26
383 406.26
Name: CO2ppm, dtype: float64