Forecasting the Yen to Dollar Exchange Rate

Part 1: Making Time Series Data Stationary

Scott Okamura
6 min readFeb 12, 2021
Photo by Lukas Blazek on Unsplash

Time series analysis is a very common topic in data science and in business. Being able to accurately predict the movements of stocks and foreign exchange rates can make or break a company’s entire business model. This article outlines the process of forecasting predictions from time series datasets. The basic outline follows the OSEMN model, just like most other data science projects.

  • Obtain time series data related to the problem
  • Scrub/clean the data
  • Explore data to identify patterns and trends
  • Model training/validating/testing
  • iNterpret model results

This article will outline the steps in building a basic time series projection model. The first half will cover obtaining, cleaning, and exploring the data to make it stationary. The second part will demonstrate how to fit and train a model and interpret its results.

Obtain

Obtaining international exchange rate data is relatively simple. There are a large number of sites that track and compile that information for you. A few examples of sites that I explored for this project:

These sites also typically allow users to choose the time frame to observe as well as downloading all of the exchange rates from that period to a .csv file. After downloading the data, all that was left was using Pandas to read the file into a Jupyter Notebook using pd.read_csv().

Scrub

Scrubbing and cleaning the data is almost always the longest step. This section of the OSEMN process is also very crucial. “Scrubbing” the data involves ensuring that the values are in the proper format, any null values are dealt with, and remove or replace any corrupt or invalid data.

When working with this dataset, the first order of business was to convert the 'Date' column into the proper datetime format. This reformatted datetime column is also set as the index for the data frame. Although not required, this step makes plotting and visualizing the time series data a bit easier.

df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df.columns = ['yen']

For this project, we are only interested in the previous exchange rates of the yen. This makes the cleaning process much simpler as we only have 1 column of values that we need to investigate.

# convert object dtype to numeric
df['yen'] = pd.to_numeric(df['yen'], errors='coerce')

The errors='coerce' parameter will default values to NaN if those values cannot be converted to a numerical datatype.

df.isnull().sum()yen    109
dtype: int64

to_numeric() produced 109 null values in our data frame. These can be dealt with in various ways and will be different for every project and every data scientist.

  • delete/drop usingdf.dropna()
  • replace using df.fillna()
  • impute using sklearn.impute.SimpleImputer
  • interpolate using df.interpolate()

The method I chose for this particular case was to interpolate the missing values. The null values were all isolated cases. A missing exchange rate value always occurred between two dates that had the rates in the proper format. Therefore, by applying df.interpolate(method='linear'), the null values can be replaced with the average rate of the day before and after our NaN index. Although this isn’t perfect, it’s a good estimate as to what the exchange rates for our NaN dates were, barring any significant 1-day spikes/dips.

df.interpolate(method='linear', inplace=True)
df.isnull().sum()
yen 0
dtype: int64

Explore

Now that the dataset is in the proper datetime format and all null values were interpolated and taken care of, we can start exploring and visualizing our dataset. An easy place to start for any time series data is a simple line plot.

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(16,10))
plt.plot(df.index, df['yen'], c='b')
plt.title('Yen to USD Exchange Rate 2011 - 2021')
plt.xlabel('Date')
plt.ylabel('Yen Value per USD')
plt.show()
#alternative to df.plot()
Plot of JPY to USD Exchange Rates from 2011 to 2021

This plot gives an overview of the dataset and how it changed over time. When working with time series data, the dataset must be stationary. This doesn’t mean that the exchange rate must stay constant over time. If that was the case, predicting the future prices wouldn’t be such a hard question. Stationarity is concerned with trends and patterns in the dataset. If the time series data is found to have a near-constant mean, covariance, and no seasonal trends, the dataset is said to be stationary. Once stationarity is established, then modeling and training can begin on the data.

Stationarity can be defined [as] a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuations (seasonality) — NIST

How can we figure out if our dataset is stationary? Two tests that can be used are the Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS). While both of these tests can be found in thestatsmodels library and test for stationarity, they test using different methods. ADF uses differencing to check for stationarity while KPSS removes trends in the dataset. The function below will test our dataset using both methods and print out the results.

from statsmodels.tsa.stattools import adfuller, kpssdef stationarity_tests(series):
'''
Input: time series data

Output: ADF and KPSS test results for time series
'''

warnings.filterwarnings("ignore")

# ADF test
result = adfuller(series)
print(f'ADF Statistic: {round(result[0], 3)}')
print(f'n_lags: {round(result[1],3)}')
print(f'p-value: {round(result[1],3)}')
for key, value in result[4].items():
print('Critical Values:')
print(f' {key}, {round(value,3)}')
if result[1] > 0.05:
print('Probably not stationary\n')
else:
print('Probably stationary\n')

# KPSS test
statistic, p_value, n_lags, critical_values = kpss(series)
# Format Output
print(f'KPSS Statistic: {round(statistic,3)}')
print(f'p-value: {p_value}')
print(f'n_lags: {n_lags}')
print('Critial Values:')
for key, value in critical_values.items():
print(f' {key} : {value}')
if p_value < 0.05:
print('Probably not stationary\n')
else:
print('Probably stationary\n')
Output[]:ADF Statistic: -1.676
n_lags: 0.444
p-value: 0.444
Critical Values:
1%, -3.433
Critical Values:
5%, -2.863
Critical Values:
10%, -2.567
Probably not stationary

KPSS Statistic: 4.943
p-value: 0.01
n_lags: 28
Critial Values:
10% : 0.347
5% : 0.463
2.5% : 0.574
1% : 0.739
Probably not stationary

Since our data frame was found to be “Probably not stationary”, we are going to have to transform our data in order to be able to use it later in our model. To see where to start, statsmodels has a seasonal_decompose function that will visualize the dataset along with its trend, seasonal, and residuals. This can give you an idea of which factor is contributing the most to the non-stationarity of the data.

from statsmodels.tsa.seasonal import seasonal_decomposedecomp = seasonal_decompose(df, model='additive', period=1)plt.figure(figsize=(16,10))
decomp.plot();

The residuals and seasonality appear to be non-existent in our dataset. If we look at the original plot at the top, you can see that there are no clear patterns when it comes to seasonality, whether that be quarterly, annually, etc. Clearly all of the non-stationarity in our data is coming from the trend in the data as the two plots look identical to one another.

To remove the trends in our data, there are a few ways to approach the problem:

  • log transforming using np.log
  • square and cube powers
  • square and cube roots using np.sqrt

For this project, I chose to log transform the data only because I personally prefer to work with a smaller range of values. After the log transformation, I also subtracted the rolling mean from the time series in order to potentially increase stationarity. After differencing out the rolling mean, the first n values need to be dropped from the dataset. n represents the number of days that the rolling average will be using to calculate the value.

import numpy as npdf_log = np.log(df)
roll_mean = df_log.rolling(window=20, center=False).mean()
df_log_mean = df_log - roll_mean
df_log_mean.dropna(inplace=True)
stationarity_tests(df_log_mean)
Output[]:
ADF Statistic: -8.332
n_lags: 0.0
p-value: 0.0
Critical Values:
1%, -3.433
Critical Values:
5%, -2.863
Critical Values:
10%, -2.567
Probably stationary

KPSS Statistic: 0.344
p-value: 0.1
n_lags: 28
Critial Values:
10% : 0.347
5% : 0.463
2.5% : 0.574
1% : 0.739
Probably stationary

Great! Our tests found our newly transformed dataset to be stationary. If we plot it now, we can see that the trend is much harder to spot and looks more like random noise. There are a couple significant dips (most notably in Q1 of 2020) that can be investigated further in future work. For the sake of this project, we will assume our tests are correct and move on to modeling.

Part two of this project will cover the remaining sections: modeling and interpreting the results.

--

--