4.4 Masks¶

How to mask data (a Boolean statement to fish out data that you want, square brackets after a dataframe)¶

Watch this video from 8:06 to 12:06

# To load the video, execute this cell by pressing shift + enter

from IPython.display import YouTubeVideo
from datetime import timedelta
start=int(timedelta(hours=0, minutes=8, seconds=6).total_seconds())
end=int(timedelta(hours=0, minutes=12, seconds=6).total_seconds())

YouTubeVideo("jEQRU55x0e4",start=start,end=end,width=640,height=360)

The following is a transcript of the video.

💡 Remember: Import pandas and read in the dataset below to complete this lesson.

# Import pandas

import pandas as pd

# Download the dataset from the
# Jupyter Book to read in locally or 
# read in from GitHub, below:

data = pd.read_csv('https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/co2_mlo_weekly.csv')

Next we’re going to talk about something called masking. Masking is very important and it’s a way to find specific data that fulfills a criteria that is a Boolean statement; remember, a Boolean statement is something that can only evaluate to True or False, to find all the data in your data frame that is true with respect to the Boolean statement.

So this is what masking does: it gives us specific data and as I said it’s very simple, a mask is just a Boolean statement.

We can create a mask variable and set it equal to a Boolean statement. For the data in the month column, let’s make a Boolean statement saying that month equals August. We use a double equals here because remember, when we’re using Python you can have less than or equal to, greater than or equal to, but equals equals is truly equal. So the Boolean statement is, is it the month august, True or False? That is the mask. And we’re setting this Boolean statement equal to the mask.

So we hit shift + enter and then we can see what the mask is.

# How do we get specific data?
# What if we want all data points from just the month of August?
# We create a mask!
# A mask is a Boolean statement where the data you want is TRUE

mask = data['month']=='aug'

And we can see what the mask is: it returns the rows of the dataframe that fulfill the criteria of the Boolean statement. This is a large dataframe. It starts at the beginning then it breaks and then it goes all the way to the end. If you remember, the month of august was the very first data values. So you can see that these august dates, which were at the top of the dataframe, are evaluating as True. But then they evaluate as False when the month is not august at the end.

# The mask returns which rows evaluate as TRUE for the Boolean statement

mask

     True
     True
     True
     True
     True
       ...  
  False
  False
  False
  False
  False
Name: month, Length: 714, dtype: bool

We can get the full data back as well. This is what a mask does: it is returning the rows which are True and False, but if you put the mask within the dataframe brackets what you get back is the whole dataframe, those rows of the dataframe that evaluate as True.

For example, here you can see what we get returned is only data that is the month of august. These were the only rows that fulfilled the Boolean statement of being True that they were august.

So that’s all the mask is: remember a mask is just a Boolean statement and we put the mask in the brackets of the dataframe to return those rows that evaluate as True for the Boolean statement.

# If we place the mask inside the dataframe brackets
# then the rows where the statement is True are returned

data[mask]

	date	running_date	month	year	CO2ppm
0	8/13/17	1	aug	2017	405.20
1	8/14/17	2	aug	2017	405.20
2	8/15/17	3	aug	2017	405.20
3	8/16/17	4	aug	2017	405.20
4	8/17/17	5	aug	2017	405.20
5	8/18/17	6	aug	2017	405.20
6	8/19/17	7	aug	2017	405.20
7	8/20/17	8	aug	2017	404.54
8	8/21/17	9	aug	2017	404.54
9	8/22/17	10	aug	2017	404.54
10	8/23/17	11	aug	2017	404.54
11	8/24/17	12	aug	2017	404.54
12	8/25/17	13	aug	2017	404.54
13	8/26/17	14	aug	2017	404.54
14	8/27/17	15	aug	2017	404.23
15	8/28/17	16	aug	2017	404.23
16	8/29/17	17	aug	2017	404.23
17	8/30/17	18	aug	2017	404.23
18	8/31/17	19	aug	2017	404.23
353	8/1/18	354	aug	2018	407.46
354	8/2/18	355	aug	2018	407.46
355	8/3/18	356	aug	2018	407.46
356	8/4/18	357	aug	2018	407.46
357	8/5/18	358	aug	2018	407.23
358	8/6/18	359	aug	2018	407.23
359	8/7/18	360	aug	2018	407.23
360	8/8/18	361	aug	2018	407.23
361	8/9/18	362	aug	2018	407.23
362	8/10/18	363	aug	2018	407.23
363	8/11/18	364	aug	2018	407.23
364	8/12/18	365	aug	2018	407.07
365	8/13/18	366	aug	2018	407.07
366	8/14/18	367	aug	2018	407.07
367	8/15/18	368	aug	2018	407.07
368	8/16/18	369	aug	2018	407.07
369	8/17/18	370	aug	2018	407.07
370	8/18/18	371	aug	2018	407.07
371	8/19/18	372	aug	2018	406.84
372	8/20/18	373	aug	2018	406.84
373	8/21/18	374	aug	2018	406.84
374	8/22/18	375	aug	2018	406.84
375	8/23/18	376	aug	2018	406.84
376	8/24/18	377	aug	2018	406.84
377	8/25/18	378	aug	2018	406.84
378	8/26/18	379	aug	2018	406.26
379	8/27/18	380	aug	2018	406.26
380	8/28/18	381	aug	2018	406.26
381	8/29/18	382	aug	2018	406.26
382	8/30/18	383	aug	2018	406.26
383	8/31/18	384	aug	2018	406.26

You can use this technique also with a specific column. For example, if you say you’re only interested in the CO2 parts per million values, you refer to that column and the square brackets to get it, but you have a second set of brackets with the mask and when you do this you will have only returned the CO2 parts per million values, none of the rest of the dataframe, because you specified this specific column.

# We can limit what is returned by refering to a specific column
# and placing the mask in a double bracket

data['CO2ppm'][mask]

    405.20
    405.20
    405.20
    405.20
    405.20
    405.20
    405.20
    404.54
    404.54
    404.54
   404.54
   404.54
   404.54
   404.54
   404.23
   404.23
   404.23
   404.23
   404.23
  407.46
  407.46
  407.46
  407.46
  407.23
  407.23
  407.23
  407.23
  407.23
  407.23
  407.23
  407.07
  407.07
  407.07
  407.07
  407.07
  407.07
  407.07
  406.84
  406.84
  406.84
  406.84
  406.84
  406.84
  406.84
  406.26
  406.26
  406.26
  406.26
  406.26
  406.26
Name: CO2ppm, dtype: float64

You saw that we created a mask, and we called it mask, and we were writing out the mask into our code. But usually people don’t do that. They do it much more simple, where they just write out the mask.

You just put the mask, all in one line of code, with the data or the specific column that you want. This is a little complicated to look at but always remember go to the mask first and it will be in the second set of brackets or it will be in the first set of brackets if you are returning all the data and it’s just a Boolean statement. And it will return those rows that are true for the Boolean statement. And this is a way that you get specific data that you want.

# Usually, the mask is written out straight into the line of code
# Always look to the second set of brackets to see
# what the mask is

data['CO2ppm'][data['month']=='aug']

    405.20
    405.20
    405.20
    405.20
    405.20
    405.20
    405.20
    404.54
    404.54
    404.54
   404.54
   404.54
   404.54
   404.54
   404.23
   404.23
   404.23
   404.23
   404.23
  407.46
  407.46
  407.46
  407.46
  407.23
  407.23
  407.23
  407.23
  407.23
  407.23
  407.23
  407.07
  407.07
  407.07
  407.07
  407.07
  407.07
  407.07
  406.84
  406.84
  406.84
  406.84
  406.84
  406.84
  406.84
  406.26
  406.26
  406.26
  406.26
  406.26
  406.26
Name: CO2ppm, dtype: float64

Plants & Python

4.4 Masks¶

How to mask data (a Boolean statement to fish out data that you want, square brackets after a dataframe)¶