In the broadest definition, a time series is any data set where the values are measured at different points in time. Many time series are uniformly spaced at a specific frequency, for example, hourly weather measurements, daily counts of web site visits, or monthly sales totals. Time series can also be irregularly spaced and sporadic, for example, timestamp data in a computer system’s event log or a history of 911 emergency calls. Pandas time series tools apply equally well to either type of time series.
Why we need Time series
Time series helps us understand past trends so we can forecast and plan for the future. For example, you own a coffee shop, what you’d likely see is how many coffees you sell every day or month and when you want to see how your shop has performed over the past six months, you’re likely going to add all the six-month sales. Now, what if you want to be able to forecast sales for the next six months or year. In this kind of scenario, the only variable known to you is time (either in seconds, minutes, days, months, years, etc) — hence you need Time Series Analysis to predict the other unknown variables like trends, seasonality, etc.
Time Series data structures
Before we dive into the OPSD data, let’s briefly introduce the main pandas data structures for working with dates and times. In pandas, a single point in time is represented as a Timestamp. We can use the to_datetime() function to create Timestamps from strings in a wide variety of date/time formats. Let’s import pandas and convert a few dates and times to Timestamps.
As we can see, to_datetime() automatically infers a date/time format based on the input. In the example above, the ambiguous date ‘7/8/1952’ is assumed to be month/day/year and is interpreted as July 8, 1952. Alternatively, we can use the dayfirst parameter to tell pandas to interpret the date as August 7, 1952.
If we’re dealing with a sequence of strings all in the same date/time format, we can explicitly specify it with the format parameter. For very large data sets, this can greatly speed up the performance of to_datetime() compared to the default behavior, where the format is inferred separately for each individual string. Any of the format codes from the strftime() and strptime() functions in Python’s built-in DateTime module can be used. The example below uses the format codes %m (numeric month), %d (day of the month), and %y(2-digit year) to specify the format.
Manipulating Time Series Data
Here we have provided the date column as a list to parse_dates which wi;; inform read_csv to convert this column from strings to datetime object.
Similar to .str() methid .dt() method is used for datetime transformation for ex, We can get the hour from datetime column.
This tells the hour at which item was sold. Similarly, we can use year, month, date, minute, seconds, etc.
We can set the time zone of our date time column by using dt’s tz_localize() method but later if we want to change the time zone we can do it with dt.tz_convert().
We can do these both steps at one step using method chaining.
Notice, that we have again applied .dt method while converting the timezone as .dt.tz_localize() creates a new series.
Resampling involves changing the frequency of out time series observation like in upsampling we increase the frequency of data such as from minutes to hours.
In the case of upsampling, care must be taken on how the fine-grained observation is calculated using interpolation. In downsampling, care must be taken in selecting the summary statistic used to calculate new aggregated values.
The final manipulation we will see here as to how to manipulate values.
Here also we have a date column and we have pulled it as the index. Notice the date is 31st Dec. ‘A’ used as the argument in resample() method to denote the year and first() method will give the first entry of every year.
If we do not have an entry for any particular year then the value of that column will be NaN so to fill that value we can use ffill() or bfill() or we can use interpolation which will return a very smooth value.
For interpolation we can use .interpolate() method after first() and we can pass “linear” or “spline” as argument. With interpolate(“linear”) it draws a straight line between available data that is, suppose there are 8 missing values between 11th and 20th entry so it will draw a straight line between 11th and 20th value and fill the entries with values on the line.
Another common interpolation method is to use a polynomial or a spline to connect the values. This creates more curves and can look more natural on many datasets. Using a spline interpolation requires you to specify the order (number of terms in the polynomial)
When the data points of a time series are uniformly spaced in time (e.g., hourly, daily, monthly, etc.), the time series can be associated with a frequency in pandas. For example, let’s use the date_range() function to create a sequence of uniformly spaced dates from 1998-03-10 through 1998-03-05 at a daily frequency.
The resulting DatetimeIndex has an attribute freq with a value of ‘D’, indicating daily frequency. Available frequencies in pandas include hourly (‘H’), calendar daily (‘D’), business daily (‘B’), weekly (‘W’), monthly (‘M’), quarterly (‘Q’), annual (‘A’), and many others. Frequencies can also be specified as multiples of any of the base frequencies, for example, ‘5D’ for every five days.