Author(s: Yashveer Sing Sohi Data Visualization Statistical Modeling Time Series Data Part 1. Data Preparation & Preprocessing. Photo by Chris Liverani, Unsplash. This article analyzes the S&P 500 Market Index using popular Statistical Models: SARIMA (Seasonal Autoregressive Integrated Movement Average) and GARCH) The first section of this article shows how the time series can be scrapped and pre-processed to create additional series which could indicate stability and market profitability. This article uses code from Preprocessing.ipynb. Table of Contents. A brief introduction to Time Series data Downloading the data from Yahoo Finance. Handling missing data Deriving S&P returns and volatility Conclusion. Links to other parts this series. A Time Series is a series that has regular time intervals. Examples of time series include pollutant levels, such as SO2, NO2, and birth rates. For each day, the closing prices of a market index such as S&P 500), etc. Time series analysis uses models that are able to uncover and exploit dependence between data and past versions. These models were used to forecast future price trends in the S&P 500 case. They tried to determine the correlation between the current price and that of a few days or weeks ago. Yahoo Finance Yahoo Finance is the best site to get stock price data. It is easy to download stock prices data from Yahoo Finance using the Python library yfinance. This only requires a few lines code. For this series of articles, the S&P 500 stock price from 1994-01-06 (6th January 1994) to 2019-08-30 (30th August 2019) is downloaded via the yfinance API. https://medium.com/media/381bca262485a5c6bfef06f56631c3fc/href In the above code cell, 2 standard libraries of python used in almost all data analysis projects: pandas and numpy are imported. Imported are the following plotting libraries: matplotlib.pyplot, and seaborn. The line sns.set applies the seaborn wrapper to all plots. Except for styling changes, this line won’t have any effect on outputs. The yfinance Library is next. Follow these instructions to download it. You will need to provide the following arguments for the yfinance download function: tickers (a unique identifier that Yahoo Finance uses to identify each series of time), interval (the time between consecutive data points which in this instance is one day, or “1d”), and the start and ending dates. In this instance, the data will be stored in a raw_data dataframe. These are the first five rows (shown with raw_data.head) and the last few rows (shown by raw_data.tail) of raw_data. Extracting relevant series In this article, we analyze the close prices for the S&P 500 index. Here, we extract the series we’re interested in: https://medium.com/media/2c24b8ecafb6aca5486176595884a04c/href Since the data is stock market data, we will not observe any value for the weekends. The time between two consecutive observations is one day, even though they are made on Fridays and Mondays. We use pandas’ asfreq method with the argument “b” to convert dates into business days (5 days per week). We will now examine the data for missing values. It is important to check if there are missing values. https://medium.com/media/dcd2c27cb1c6686b920d93893a99ddbe/href Output for the cleaning_spx.py code block In the code cell above, data.spx.isnull.sum takes the dataframe (data), extracts the column (spx) and applies the function isnull to it. The result is a boolean array that contains True for each Null value. Sum returns the sum of all these boolean numbers. True can be represented by 1 while False is represented as 0. This gives us the total number of values that are missing from the spx series. A few statistics are provided by the describe function. The difference in null values (233) to observations (6459). clearly shows that the number is very low. A simple pandas imputing function will suffice in this instance. Fillna fills in the empty values by using the value found just before it. This behavior can be regulated by the argument “ffill”, which is the front fill, passed to method. For more information about other argument types for method, click here The S&P Volatility and Returns can now be calculated. They are Volatility and Returns. These are the percent changes in stock prices over a certain period of time. The column spx_ret stores the daily returns. Volatility refers to fluctuations in market returns. Sometimes, the squared return or magnitude of returns is used to gauge market stability or fluctuations. The magnitude of returns in this series is used to measure volatility. In the column spx_vol, the Volatility value of spx can be found. Returns indicate the market’s gain or loss. Volatility, which is the magnitude of returns, is the indicator of index stability. Formulas for Market Returns and Volatility https://medium.com/media/5a33a9017b7d97c60c2b4fd22840837e/href In the above code cell, the Returns and Volatility of spx is calculated. The function pct_change calculates the percentage change between the current and previous values in the series. The argument to this function, a numeric value, controls how far one should go back to get the previous value. The argument 1 in pct_change calculates the percentage change between current and previous values. Mul is used only to scale percents between 0-1 and 0 -100.. After the Returns have been calculated, the abs function retrieves the magnitude. This is how Volatility can be calculated. Notice: Returns and Volatility are being calculated with reference to data from a previous time period. This means we won’t have any Value (or Null/NA), for Volatility or Returns for the first observation. This is obvious as for the first value (recorded on 1994-01-06 in this case), we do not have any previous value, and hence, we cannot calculate the Returns or the Volatility here. We can now take a look at the last five rows using the tail and head functions. The first 5 rows are the S&P 500 preprocessed data. The last 5 rows is the S&P 500 preprocessed data. The 3 generated series will be visualized with common time series exploration methods in the next section. This series contains links to the following parts: Statistical Modeling of Time Series Data, Part 1: Preprocessing Statistical Modeling of Time Series Data, Part 2: Exploratory Data Analysis Statistical Modeling
Home Innovation Statistical Modeling Time Series Data Part I – Data Preparation & Preprocessing
THE FOREFRONT OF TECHNOLOGY
We monitors and writes about new technologies in areas such as technology, innovation, digitization, space, Earth, IT and AI.