4.1 Head and Tail¶
How to quickly assess dataframe properties, using head
and tail
¶
Watch this video from 2:25 to 4:12
# To load the video, execute this cell by pressing shift + enter
from IPython.display import YouTubeVideo
from datetime import timedelta
start=int(timedelta(hours=0, minutes=2, seconds=25).total_seconds())
end=int(timedelta(hours=0, minutes=4, seconds=12).total_seconds())
YouTubeVideo("jEQRU55x0e4",start=start,end=end,width=640,height=360)
The following is a transcript of the video.
💡 Remember: Import
pandas
and read in the dataset below to complete this lesson.
# Import pandas
import pandas as pd
# Download the dataset from the
# Jupyter Book to read in locally or
# read in from GitHub, below:
data = pd.read_csv('https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/co2_mlo_weekly.csv')
Now let’s see how to look at our dataframe using head and tail.
So it’s important to know what you’re working with, to “see” the dataframe. Head and tail allow us to do that. Remember that the data is stored in the variable data
. We use .head()
to see a preview of the first five rows and we’ll also get the column names.
# It's important to know what you are working with
# to "see" the dataframe
# .head() shows the first rows and column names
data.head()
date | running_date | month | year | CO2ppm | |
---|---|---|---|---|---|
0 | 8/13/17 | 1 | aug | 2017 | 405.2 |
1 | 8/14/17 | 2 | aug | 2017 | 405.2 |
2 | 8/15/17 | 3 | aug | 2017 | 405.2 |
3 | 8/16/17 | 4 | aug | 2017 | 405.2 |
4 | 8/17/17 | 5 | aug | 2017 | 405.2 |
We can also use tail. And tail, if the head is at the beginning, then the tail is the end. You can see that we get the last rows in our dataset. Tail is very useful to see just how many rows that you have overall. Remember that we’re beginning with zero, so we have 714 rows or data points in this dataset.
# .tail() shows the las rows
data.tail()
date | running_date | month | year | CO2ppm | |
---|---|---|---|---|---|
709 | 7/23/19 | 710 | jul | 2019 | 410.87 |
710 | 7/24/19 | 711 | jul | 2019 | 410.87 |
711 | 7/25/19 | 712 | jul | 2019 | 410.87 |
712 | 7/26/19 | 713 | jul | 2019 | 410.87 |
713 | 7/27/19 | 714 | jul | 2019 | 410.87 |
Describe is a very useful function that gives you back summary statistics for your continuous variables. If we use describe on our data we get back information for running date, which is just a number that is increasing to keep track of day; year is included as a continuous variable, even though we do not want it to be a continuous variable; and CO2 parts per million, which is of course a continuous variable as well. We get back how many entries of each variable that we have. We have 714 of each. We get the mean of each; the standard deviation; the minimum; the quartiles at the 25th, 50th, and 75th percentiles; and the maximum value as well.
# .describe() is very useful, showing summary statistics
# it provides stats for continuous variables
data.describe()
running_date | year | CO2ppm | |
---|---|---|---|
count | 714.000000 | 714.000000 | 714.000000 |
mean | 357.500000 | 2018.093838 | 408.977059 |
std | 206.258333 | 0.693299 | 3.189098 |
min | 1.000000 | 2017.000000 | 402.760000 |
25% | 179.250000 | 2018.000000 | 406.530000 |
50% | 357.500000 | 2018.000000 | 409.010000 |
75% | 535.750000 | 2019.000000 | 411.450000 |
max | 714.000000 | 2019.000000 | 415.390000 |
So that’s how to read in a dataframe, how to quickly look at the dataframe and get summary statistics of the continuous variables.