Distributed ARIMA models for ultra-long time series (2024)

Xiaoqian Wangxiaoqianwang@buaa.edu.cnYanfei Kangyanfeikang@buaa.edu.cnRob J Hyndmanrob.hyndman@monash.eduFeng Lifeng.li@cufe.edu.cnSchool of Economics and Management, Beihang University, Beijing,100191, ChinaDepartment of Econometrics and BusinessStatistics, Monash University, Clayton, VIC 3800, AustraliaSchool of Statistics and Mathematics, Central University of Financeand Economics, Beijing 102206, China

Abstract

Providing forecasts for ultra-long time series plays a vital role in various activities, such as investment decisions, industrial production arrangements, and farm management. This paper develops a novel distributed forecasting framework to tackle challenges associated with forecasting ultra-long time series by using the industry-standard MapReduce framework. The proposed model combination approach retains the local time dependency and utilizes a straightforward implementation of splitting across samples to facilitate distributed forecasting by combining the local estimators of time series models delivered from worker nodes and minimizing a global loss function. In this way, instead of unrealistically assuming the data generating process (DGP) of an ultra-long time series stays invariant, we make assumptions only on the DGP of subseries spanning shorter time periods. We investigate the performance of the proposed approach with AutoRegressive Integrated Moving Average (ARIMA) models using the real data application as well as numerical simulations. Compared to directly fitting the whole data with ARIMA models, our approach results in improved forecasting accuracy and computational efficiency both in point forecasts and prediction intervals, especially for longer forecast horizons. Moreover, we explore some potential factors that may affect the forecasting performance of our approach.

keywords:

Ultra-long time series, Distributed forecasting, ARIMA models, Least squares approximation, MapReduce

^†^†journal: arXiv.org\xpatchbibmacro

volume+number+eid\DeclareDelimFormat[cbx@textcite]nameyeardelim

1 Introduction

Ultra-long time series (i.e.time series data observed over a long time interval) arebecoming increasingly common. Examples include hourly electricity demands spanning severalyears, stock indices observed every minute over several months, daily maximum temperaturesrecorded for hundreds of years, and streaming data continuously generated inreal-time. Attempts to forecast these data play a vital role in investment decisions,industrial production arrangement, farm management, and business riskidentification. However, it is challenging to deal with such long time series withtraditional time series forecasting approaches.

We identify three significant challenges associated with forecasting ultra-long timeseries. First, the optimization of parameters in training forecasting algorithms istime-consuming due to the time dependency nature of time series. Second, processing timeseries spanning such a long time interval drives significant storage requirements,especially in the algorithms’ training process, where a standalone computer could hardlytackle. The third and most serious difficulty is that the standard time series models donot perform well for ultra-long time series hyndman2018forecasting (22). One possiblereason is that it is usually unrealistic to assume that the data generating process (DGP)of time series has remained invariant over an ultra-long time. Hence, there is an apparentdifference between the models we use and the actual DGP. The more realistic idea is toassume that the DGP stays locally stable for short time-windows.

Forecasters have made some attempts to address these limitations in forecasting ultra-longtime series. A straightforward approach is to discard the earliest observations and usethe resulting shorter time series for model fitting. But this approach only works well forforecasting a few future values, and is not an efficient use of the available historicaldata. A better approach is to allow the model itself to evolve over time. For example, ARIMA(AutoRegressive Integrated Moving Average) models and ETS (ExponenTial Smoothing) models canaddress this issue by allowing the trend and seasonal components to vary over timehyndman2018forecasting (22). An alternative, proposed by (das2020predictive, 11, ), isto apply a model-free prediction assuming that the series changes slowly and smoothly withtime. However, the aforementioned methods require considerable computational time inmodel fitting and parameter optimization, making them less practically useful in modernenterprises.

In industry, distributed computing platforms usually lack forecastingmodules. For example, it is well known that Spark supports time series forecasting poorly,especially multi-step forecasting. To support large-scale time seriesforecasting on such platforms, practitioners commonly have to adopt inadequate but availablemethods on distributed platforms meng2016mllib (39, 16). For instance, onehas to utilize the regression models in Spark’s MLlib to implement an autoregressivetype regression and artificially convert the multi-step prediction problem intomulti-subproblems, to fit the Spark framework for time series forecasting with improvedcomputational efficiency galicia2018novel (16).

On the other hand, the unprecedented scale of data collection has driven a vast literatureon studying statistical estimation and inference problems in a distributed manner, both ina frequentist settingkleiner2014scalable (29, 56, 26, 9), and a Bayesiansettingsuchard2010understanding (47, 53, 36, 10).Among the existing methods, the divide-and-conquer approach, one of the most important andeasy to implement algorithms on distributed platforms, provides an algorithm designparadigm in which a given problem is divided into a set of related subproblems that aresimple to process. Their solutions are then aggregated using proper aggregation strategies. Ideally, the subproblems can be solved in parallel, which is computationally moremanageable than dealing with the original problem. Due to its simplicity for parallelprocessing, the divide-and-conquer strategy has been widely used in the statistical literatureto untangle large-scale problems with independent data (zhang2013communication, 57, 29, 32, 58, 41, see, e.g.,).

In this paper, we follow the idea of divide-and-conquer and provide a novel approach toaddressing the large-scale time series forecasting problems in distributed environments.Specifically, we propose a distributed time series forecasting framework, in which thelong time series is first divided into several subseries spanning short time periods, andmodels can be applied to each subseries under the reasonable assumption that the DGPremains invariant over a short time. In this view, our framework has the flavor of a“varying coefficient model” fan2008statistical (15) for a long time series. However,unlike the varying coefficient models, we combine the local estimators trained on eachsubseries using weighted least squares to minimize a global loss function.

Considering a general loss function enables our framework to, in principle, be applied tostatistical time series forecasting models that can be considered as parametric regressionproblems. However, properly combining the local estimators is of great importance and challenge in practice.For example, directly combining the original parameters of ARIMA models trained on subseriessometimes leads to a global ARIMA model with roots inside the unit circle, thus causingthe stationary, causality, or invertibility problems. Therefore, when applying the proposeddistributed forecasting framework to a specific type of model, the key issue is how to processlocal estimators based on the model settings so that they can be combined properly.Conventionally, ARIMA models are among the most widely used forecasting models because(i) they can handle non-stationary and seasonal patterns, (ii) ARIMA models also frequentlyserve as benchmark methods because of their excellent performancemontero2020fforma (40, 54).Nonetheless, such models are difficult to scale up with the current Spark distributed platformdue to the nature of time dependency, making it infeasible for large scale time seriesforecasting. Therefore, we restrict our attention to ARIMA models and proposethe Distributed ARIMA (DARIMA) models, in which a necessary linear transformation stepis involved to address the aforementioned issue of estimator combination.

The proposed method preserves the local time dependency and performs a straightforwardachievement of splitting across samples to make distributed forecasting possible forultra-long time series with only one round of communication. Thus, the proposed methodcan be naturally integrated into industry-standard distributed systems with aMapReduce architecture. Such a MapReduce algorithm requires only one “master-and-worker”communication for each worker node and avoids further iterative steps. No direct communicationbetween workers is required. To this end, it is highly efficient in terms of communication.Additionally, our proposed method is scalable for a large forecast horizon, which is thetypical case when dealing with ultra-long time series, and is therefore superior to thedistributed forecasting methods (reviewed in Section2.1) in which multipletrainings or iterations are required in order to perform multi-step forecasting.

Our Distributed ARIMA modeling framework is built using an efficientdistributed computing algorithm without modifying the underlying estimation scheme forindividual time series models, making it possible to incorporate a large variation offorecasting models. In both the real data application and the simulations, we show thatour approach consistently achieves an improved forecasting accuracy over conventional globaltime series modeling approaches, both in point forecasts and prediction intervals. Theachieved performance improvements become more pronounced with increasing forecasthorizon. Moreover, our approach delivers substantially improved computational efficiencyfor ultra-long time series.

The rest of the paper is organized as follows. Section2 describes thedistributed systems, ARIMA models, and highlights of our contributions to ultra-long timeseries forecasting. Section3 introduces the framework of the proposedforecasting approach in distributed systems, further elaborated by its corecomponents. Section4 applies the proposed method with a real data andprovides a sensitivity analysis. A simulation study is performed in Section5to further investigate and justify our proposed method in terms of forecasting accuracyand computational cost. Section6 discusses other potentialsand suggests possible avenues of research. Finally, Section7 concludesthe paper.

2 Background

This section surveys the forecasting challenges on distributed systems with a specialfocus on the ARIMA models, and highlights the contributions of our framework.

2.1 Forecasting with Distributed Systems

A typical distributed system consists of two core components: the Distributed File System(DFS) and the MapReduce framework. See A for an overview for distributedsystems. The DFS provides the primary storage infrastructure for distributed systems. Bystoring files in multiple devices, the DFS effectively eliminates the negative impacts ofdata loss and data corruption. It enables devices to handle large-scale data sets andaccess data files in parallel. MapReduce provides the batch-based computing framework fordistributed systems. The MapReduce framework refers to two steps, namely Map andReduce. The input data set is first split into a collection of independent data tuples,represented by key/value pairs. The Map task takes the assigned tuples, processes them ina parallel manner, and creates a set of key/value pairs as the output, illustrated by $\langle k1,v1\rangle\rightarrow Map(\cdot)\rightarrow list(\langle k2,v2\rangle)$ . Subsequently, the Reduce task takes the intermediate outputs that come fromthe Map tasks as its inputs, combines these data, and produces a new set of key/valuepairs as the output, which can be described as $\langle k2,list(v2)\rangle\rightarrow Reduce(\cdot)\rightarrow list(\langle k3%,v3\rangle)$ . The main advantage of the MapReduce computing paradigm is that data processingis enabled to be easily extended on multiple computing nodes with high computationalefficiency.

Many distributed systems were designed for processing massive independent data. However,a distinct feature of time series data is that the observations are serially dependent,and so additional considerations are required in processing time series with distributedsystems. How to efficiently bridge time series forecasting with the distributed systemsis of crucial importance in the forecasting domain.

We identify four main challenges associated with using distributed systems for timeseries forecasting: (i) time series partitioning for MapReduce tasksli2014rolling (33); (ii) independent subseries modeling with distributed systems;(iii) model communication between worker nodes; and (iv) distributed time seriesforecasting, especially for multi-step prediction galicia2018novel (16).

Some attempts have been made by researchers to allow for efficient implementations oflarge-scale time series analysis. For example, (mirko2013hadoop, 27, ) propose aHadoop-based analysis framework for efficient data collection and accurate preprocessing,which are at least as important as the algorithm selection for a given forecasting problem.Unlike (mirko2013hadoop, 27, ), (li2014rolling, 33, ) focus on model selection.Specifically, MapReduce is used to perform cross-validation and facilitate parallelrolling-window forecasting using the training set of a large-scale time series data. Theforecasting errors are then collected and used to calculate the accuracy, allowing modelcomparison and selection. Therefore, the method is not designed to address the challengesassociated with forecasting using ultra-long historical data. (talavera2018big, 48, ) and(galicia2018novel, 16, ) manage multi-step forecasting for ultra-long time series from theperspective of machine learning by computing with the Spark platform. Specifically,(talavera2018big, 48, ) require $H$ iterations to perform multi-step forecasting, while(galicia2018novel, 16, ) formulate the forecasting problem as $H$ parallel forecastingsubproblems to support multivariate regression with the MLlib library, where $H$ is theforecast horizon. Consequently, the approaches they propose have poor scalability for a largeforecast horizon. However, ultra-long time series forecasting is typically characterized with theneed to forecast quite a few future values in practice. A recent overview of forecasting withbig data time series is provided in Section 2.7 of the encyclopedic review in(petropoulos2020forecasting, 42, ). In summary, the attempts mentioned above provide a premisefor distributed time series forecasting but it is still impractical to apply these methods tohandle ultra-long time series for the direct purpose of forecasting quite a few futureobservations.

In recent years, research has been deeply engaged in distributed statistical inferencewith concerns about practical computational cost for large-scale data processing. The mostprominent papers build upon the divide-and-conquer strategy.The communication cost, which refers to the time cost for data communication between different computernodes, is often recognized as one of the key challenges faced by distributed statistical inference.The first stream of researchemploys one-shot aggregation and averages the local estimators computed on each batch ofthe entire data to obtain the global estimators. Examples include but are not limited tothe subsampled average algorithm based on an additional level of sampling on worker nodeszhang2013communication (57), the generative model parameters estimation via the maximumlikelihood estimator liu2014distributed (35), quantile regression processesvolgushev2019distributed (51), parametric regression handled by least squareapproximation zhu2021least (58), and nonparametric and infinite-dimensional regressionzhang2015divide (56).

Another stream of research looks at the underlying iterative algorithms with multi-roundcommunications and aggregations for distributed optimization. Multiple iterations areconsidered with the purpose of further matching the aggregated estimators to the globalestimators. These iterations unfortunately result in considerable “master-and-worker”communication, which is communicationally expensive.For example, (shamir2014communication, 44, ) propose a distributedapproximate Newton algorithm to solve general subproblems available locally beforeaveraging computations, requiring a constant number of iterations (and thus rounds ofcommunication). Based on their framework, recent works by (wang2017efficient, 52, ) and(Jordan2019, 26, ) propose iterative methods with communication efficiency for distributedsparse learning and Bayesian inference. (chen2019quantile, 9, ) restrict themselves torefine the estimator of quantile regression via multiple rounds of aggregations underdistributed computing environment. In addition, another popular approach is thealternating direction method of multipliers (boyd2011distributed, 5, ADMM,) developedfor distributed convex optimization. It blends the decomposability of dual ascent with thesuperior convergence properties of the method of multipliers.

Both the streams are difficult to apply directly to time series forecasting models. Theone-shot averaging strategy is straightforward to implement and requires only a singleround of communication. While naively merging the local estimators that are processedseparately may yield inference procedures that are highly biased and variable, leading toinefficient estimations in most occasions; see (zhang2013communication, 57, ),(shamir2014communication, 44, ), (Jordan2019, 26, ), and (pan2021note, 41, ) for furtherdiscussions. The divide-and-conquer strategy is difficult to the time seriesforecasting. So some research focuses on distributed learning by splitting across thefeatures in the frequency domain for different time series rather than samples (differenttimestamps) sommer2021online (46, 19). Distributed algorithms likeADMM suffer a crucial limitation that (i)they require a reimplementation of eachestimation scheme with distributed systems, and (ii)they can be very slow to converge tohigh accuracy compared to existing algorithm designed for standalone computers, see(boyd2011distributed, 5, ) for more details. Additionally, communication cost isrecognized as a key challenge faced by statistical computation on a distributed systemJordan2019 (26, 58).

The studies reviewed above focus on scaling up optimization algorithms for solvinglarge-scale statistical tasks in a distributed manner, while our study adopts a modelcombination strategy in which model parameters obtained from multiple subseries areaggregated to minimize the global loss function. In this way, our proposed method can achievelower computational and implementation complexity than modeling the whole ultra-long timeseries with a single model.

2.2 ARIMA Models

An ARIMA (AutoRegressive Integrated Moving Average) model is composed of differencing,autoregressive (AR), and moving average (MA) components box2015time (4). We refer toan ARIMA model as ARIMA $(p,d,q)$ , where $p$ is the order of the AR component, $d$ is thenumber of differences required for a stationary series, and $q$ is the order of the MAcomponent. An ARIMA model can be extended to a seasonal ARIMA model by includingadditional seasonal terms to deal with time series exhibiting strong seasonal behavior. Aseasonal ARIMA model is generally denoted as ARIMA $(p,d,q)(P,D,Q)_{m}$ , where theuppercase $P,D,Q$ refer to the AR order, the number of differences required for aseasonally stationary series, and the MA order for the seasonal component, respectively,while $m$ denotes the period of the seasonality tsay2005analysis (50).

An ARIMA $(p,d,q)(P,D,Q)_{m}$ model for time series $\{y_{t},t\in\mathbb{Z}\}$ can bewritten with the backshift notation as

	$\displaystyle\bigg{(}1-\sum_{i=1}^{p}\phi_{i}B^{i}\bigg{)}\bigg{(}1-\sum_{i=1}%^{P}\Phi_{i}B^{im}\bigg{)}(1-B)^{d}(1-B^{m})^{D}y_{t}$		(1)
	$\displaystyle=\bigg{(}1+\sum_{i=1}^{q}\theta_{i}B^{i}\bigg{)}\bigg{(}1+\sum_{i%=1}^{Q}\Theta_{i}B^{im}\bigg{)}\varepsilon_{t},$

where $B$ is the backward shift operator, $\varepsilon_{t}$ is white noise, $m$ is thelength of the seasonal period, $\phi_{i}$ and $\Phi_{i}$ refer to the AR parameters ofnon-seasonal and seasonal parts, $\theta_{i}$ and $\Theta_{i}$ refer to the MA parametersof non-seasonal and seasonal parts respectively.

Combinations of the non-seasonal orders $p,d,q$ and seasonal orders $P,D,Q$ provide arich variation of ARIMA models. Consequently, identifying the best model among thesepossibilities is of crucial importance in obtaining good forecasting performance usingARIMA models. Fortunately, vast automatic ARIMA model selection schemes are developed. Oneof the most widely used algorithms is the auto.arima() function developed forautomatic time series forecasting with ARIMA models in the R packageforecast Hyndman2008b (23). Despite that those algorithms allow us to implementthe order selection process with relative ease in a standalone computer for short timeseries, efficient ARIMA model selection for ultra-long time series is challenging withmodern distributed computational environments.

Distributed ARIMA models for ultra-long time series (1)

We take the algorithm in the auto.arima() function to describe the ARIMA modelselection process. Other algorithms follow a similar fashion. Figure1depicts how the auto.arima() function is applied to conduct a search process overpossible models. The algorithm consists of three main steps: stationary tests, orderselection, and model refit. First, the stationary tests aim to decide the order offirst-differencing and seasonal-differencing, using a KPSS testkwiatkowski1992testing (31) for estimating $d$ and either a Canova-Hansen testcanova1995seasonal (8) or a measure of seasonality hyndman2018forecasting (22) forestimating $D$ . Second, the order selection process chooses the model orders via aninformation criterion such as AIC, AICc, or BIC values. There are two options for theorder selection approach, namely (greedy) stepwise selection and global selection, whichcan be customized according to time series characteristics such as time series length andseasonality. Such selection can be time-consuming because each information criterion isobtained by a model fitting process. Finally, the selected model orders are applied torefit best models without approximation if the information criteria used for modelselection are approximated.

The automatic ARIMA modeling has been extended in many ways by forecasting researchers(calheiros2014workload, 7, 45, 38, e.g.,). Despite itssuperb performance in forecasting time series, several difficulties hinder the extensionof this approach to ultra-long time series. We use the electricity demand for theNortheast Massachusetts (NEMASSBOST) zone of the Global Energy Forecasting Competition2017 (hong2019global, 20, GEFCom2017;) to elaborate on the following challenges ofextending the auto.arima() function to ultra-long time series data.

1.
Modeling the whole time series with a single model relies on an unrealisticassumption that the DGP of time series has remained invariant over an ultra-longperiod. Note that the assumption is made on the DGP of the target time series, ratherthan the time series itself.
2.
Order selection is an extremely time-consuming process, which requires to fit allavailable models. Even though we can select model orders by adopting global orderselection approach with parallelism, it still takes a lot of time to run a single timeseries model for ultra-long time series. The computational time grows exponentially withthe length of time series increasing.
3.
Multiple models may be considered in the model refit process because theauto.arima() function carries out several strict checks for unit roots, alsoresulting in a loss of computing efficiency.
4.
The existing approaches for model fitting, such as CSS (conditional sum-of-squares),ML (maximum likelihood), and hybrid CSS-ML (find starting values by CSS, then ML), arehard to parallel due to the nature of time dependency. The ML approach is the mostcommonly used but time-consuming approach for fitting ARIMA modelsHyndman2008b (23). Figure2 compares the execution time of theauto.arima() function under the CSS and CSS-ML fitting methods, and shows theimpact of fitting methods on the function’s execution time.When considering minimizing the conditional sum-of-squares as the estimation methodfor ARIMA models, the auto.arima() function in the forecast package forR commonly uses the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithmfor optimization purposes. The computational complexity of ARIMA modeling is $\mathcal{O}(n^{2}T)$ when the non-seasonal orders ( $p,d,q$ ) and seasonal orders( $P,D,Q$ ) are given, where $n$ is the number of parameters (i.e. $n=p+q+P+Q$ ),and $T$ is the length of the time series of interest.
5.
The length of time series has a significant impact on automatic ARIMA modeling. Wenotice that a standalone computer may not have sufficient resources to fit an ultra-longtime series. From Figure2, we find that time series with longer lengthyield much longer computation time, which provides another good explanation of why theorder selection and model refit processes are time-consuming.
6.
Most model selection schemes only allow a small range of lag values in ARIMAmodeling to limit the computation. The maximum values of model orders directly determinethe available models in the order selection process. If the model orders are allowed totake a broader range of values, the number of candidate models will increaserapidly. Therefore, relaxing the restriction of maximum values of model orders becomesan obstacle in automating ARIMA modeling for ultra-long time series.

Distributed ARIMA models for ultra-long time series (2)

In this study, the proposed algorithm is designed to tackle these challenges by estimating ARIMAmodels on distributed systems, making it suitable for the processing of big data time series.More specifically, the proposed algorithm retains the local time dependency and utilizes a straightforwardimplementation of splitting across samples to facilitate distributed forecasting forultra-long time series with only one round of communication. To the best of our knowledge,this study is the first distributed forecasting approach that integrates distributedcomputing and forecast combinations to process time series spanning large-scale timeperiods, in which we use weighted least squares to combine the local estimators trained oneach subseries by minimizing a global loss function. The proposed framework helps inextending existing forecasting methods to distributed ultra-long time series forecastingby merely making assumptions about the DGP of subseries spanning a short timeinterval. Our proposed approach makes the following contributions compared to the existingliterature and implementations.

1.
We extend the distributed least-square approximation (zhu2021least, 58, DLSA,)method, which is designed to solve pure regression-type problems with observablecovariates, to address the challenges associated with forecasting ultra-long time seriesin a distributed manner. While the theoretical and empirical work by(zhu2021least, 58, ) guarantees the statistical properties for independent data.
2.
Although the DLSA method ensures the estimators being consistent with the globalmodel, where all data are used to fit a single model, this is not our only concern withtime series forecasting because a conventional time series model usually unrealisticallyassumes the DGP of an ultra-long time series is invariant. In our paper, the DGPs ofeach subseries are allowed to vary over time, thus enabling the possible evolution oftrend and seasonality across consecutive subseries. With the support of DLSA, the localestimators computed on each subseries are aggregated using weighted least squares inorder to minimize the global loss function. This further prevents overfitting as aresult of averaging multiple models rather than selecting a single model.
3.
Compared with algorithm level parallelization, such as (boyd2011distributed, 5, ),our framework is intuitive and easy to implement. In principle, manyforecasting models are possible to elaborate with our framework. Moreover, our frameworkoutperforms competing methods with improved computational efficiency, especially forlong-term forecasting, which is necessary for many fields, such as investment decisions,industrial production arrangements, and farm management.
4.
Time series models, such as the ARIMA model or GARCH models, also model the errorterms in a parametric form. Our study shows that directly applying the DLSA, whichfocuses on the coefficients of regression type problems, to all the parameters in timeseries models may cause the stationary, causality, or invertibility problems. Our workfurther tackles this issue by a necessary ARIMA transformation step.
5.
Our approach retains a solid theoretical foundation and our proposed scheme can alsobe viewed as a model combination approach in the sense that it combines model parametersfrom the multiple subseries, in contrast to the classic forecast combinations ofdifferent forecasting methods (montero2020fforma, 40, 28, 34, 54, e.g.,).

3 Distributed Forecasting for Ultra-long Time Series

Given a time series spanning a long stretch of time, $\{y_{t};t=1,2,\dots,T\}$ , we aimto develop a novel framework to work well for forecasting the future $H$ observations. Define $\mathcal{S}=\{1,2,\dots,T\}$ to be the timestamp sequence of timeseries $\{y_{t}\}$ . Then the parameter estimation problem can be formulated as $f(\theta,\Sigma\mid y_{t},t\in\mathcal{S})$ , where $f$ is a parameter estimationalgorithm, $\theta$ denotes the global parameters, and $\Sigma$ denotes the covariancematrix for the global parameters.

Nevertheless, the above statement relies on the assumption that the underlying DGP of thetime series remains the same over a long stretch of time, which is unlikely inreality. Alternatively, suppose the whole time series is split into $K$ subseries withcontiguous time intervals; that is $\mathcal{S}=\cup_{k=1}^{K}\mathcal{S}_{k}$ , where $\mathcal{S}_{k}$ collects the timestamps of $k$ th subseries. We know that $T=\sum_{k=1}^{K}T_{k}$ , where $T_{k}$ is the length of the $k$ th subseries. Consequently,we divide an ultra-long time series into several subseries with a realistic assumptionmade about the DGP of each subseries, as illustrated in Figure3. In thisway, the parameter estimation problem is transformed into $K$ subproblems and onecombination problem as follows:

\displaystyle f(\theta,\Sigma\mid y_{t},t\in\mathcal{S})=g\big{(}f_{1}(\theta_%{1},\Sigma_{1}\mid y_{t},t\in\mathcal{S}_{1}),\ldots,f_{K}(\theta_{K},\Sigma_{%K}\mid y_{t},t\in\mathcal{S}_{K})\big{)},

where $f_{k}$ denotes the parameter estimation problem for the $k$ th subseries, and $g$ is acombination algorithm applied to combine the local estimators of subseries. Here we arecombining the models before forecasting, rather than forecasting from each model and thencombining the forecasts kang2020gratis (28, 54). In the simplestsituation, $g(\cdot)$ could be just a single mean function, and our framework could beviewed as a model averaging approach. Note that the optimal length of split subseries canbe different for different time series. One has to balance the computational resourceswith the needed length to replicate the underlying structure in the time series. Forexample, the subseries of an hourly time series should be long enough to capture the dailypattern, while the monthly pattern can be ignored due to the limited computationalresources and processed by preprocessing steps such as time series decomposition beforedistributed computing.

3.1 Linear Representations of ARIMA Models

The order selection process identifies the model with the minimum specified informationcriterion for each split subseries by using the automatic ARIMA modeling implemented inthe forecast package for R Hyndman2008b (23). Moreover, several checksare carried out to avoid an output ARIMA model with roots inside the unit circle, thusensuring time series properties such as stationarity, causality, and invertibility.Employing distributed systems to forecast ultra-long time series requires the local modelsfitted on the subseries capable of being combined to result in the global model for thewhole series. However, directly combining the original parameters of ARIMA models trainedon subseries may sometimes lead to a global ARIMA model with roots inside the unit circle.As a result, directly combining all parameters to produce an auxiliary global ARIMA modelcould be ill-behaved, thus resulting in numerically unstable forecasts. Consequently, inthis subsection, we are devoted to converting a general ARIMA model into a linearrepresentation which can be regarded as a regression problem to facilitate theparameters-based combination.

Following Equation(1), a general seasonal ARIMA model with intercept,time trend, and covariates used in the forecast R package is formallygiven by

	$\displaystyle\bigg{(}1-\sum_{i=1}^{p}\phi_{i}B^{i}\bigg{)}\bigg{(}1-\sum_{i=1}%^{P}\Phi_{i}B^{im}\bigg{)}(1-B)^{d}(1-B^{m})^{D}\bigg{(}y_{t}-\mu_{0}-\mu_{1}t%-\sum_{j=1}^{l}\eta_{j}\gamma_{j,t}\bigg{)}$
	$\displaystyle=\bigg{(}1+\sum_{i=1}^{q}\theta_{i}B^{i}\bigg{)}\bigg{(}1+\sum_{i%=1}^{Q}\Theta_{i}B^{im}\bigg{)}\varepsilon_{t},$		(2)

where $\mu_{0}$ is the intercept, $\mu_{1}$ is called the slope of the linear time trend, $\gamma_{j,t}(j=1,2,\cdots,l)$ is a covariate at time $t$ and $\eta_{j}$ is itscoefficient. The automatic ARIMA modeling provides flexibility in whether to include theintercept and time trend terms. The time trend coefficient may be non-zero when $d+D=1$ ,while the intercept may be non-zero when $d+D=0$ .

Let $x_{t}=y_{t}-\mu_{0}-\mu_{1}t-\sum_{j=1}^{l}\eta_{j}\gamma_{j,t}$ . Then theseasonal ARIMA model for time series $\{y_{t},t\in\mathbb{Z}\}$ is transformed into aseasonal ARIMA model for time series $\{x_{t},t\in\mathbb{Z}\}$ without the intercept,time trend, and covariates terms. First, we convert the seasonal ARIMA model into anon-seasonal ARMA model.By using the polynomial multiplication, we assumethat the converted non-seasonal ARMA model is denoted ARMA $(u,v)$ , where $u=p+mP+d+mD$ and $v=q+mQ$ . The (possibly non-stationary) ARMA $(u,v)$ is defined as

\displaystyle\bigg{(}1-\sum_{i=1}^{u}\phi_{i}^{\prime}B^{i}\bigg{)}x_{t}=\bigg%{(}1+\sum_{i=1}^{v}\theta_{i}^{\prime}B^{i}\bigg{)}\varepsilon_{t},

(3)

where $\phi_{i}^{\prime}$ and $\theta_{i}^{\prime}$ refer to the AR and MA parametersrespectively. The converted ARMA model after polynomial multiplications still satisfiesthe invertibility condition because of the unit root checks performed in the originalARIMA modeling, so that all roots of the MA characteristic polynomial lie outside the unitcircle.

The next task involves converting the ARMA $(u,v)$ model for time series $\{x_{t}\}$ to itslinear representation.Given polynomials $\phi^{\prime}(B)=\left(1-\sum_{i=1}^{u}\phi_{i}^{\prime}B^{i}\right)$ , and $\theta^{\prime}(B)=\left(1+\sum_{i=1}^{v}\theta_{i}^{\prime}B^{i}\right)$ with rootsoutside the unit circle, we have

\displaystyle\pi(B)x_{t}=\frac{\phi^{\prime}(B)}{\theta^{\prime}(B)}x_{t}=%\varepsilon_{t},

where $\pi(B)=\left(1-\sum_{i=1}^{\infty}\pi_{i}B^{i}\right)$ . The parameters of theconverted AR( $\infty$ ) model can be obtained by a recursion process. Consequently, thelinear representation of the original seasonal ARIMA model in Equation(3.1)is given by

y_{t}=\beta_{0}+\beta_{1}t+\sum_{i=1}^{\infty}\pi_{i}y_{t-i}+\sum_{j=1}^{l}%\eta_{j}\bigg{(}\gamma_{j,t}-\sum_{i=1}^{\infty}\pi_{i}\gamma_{j,t-i}\bigg{)}+%\varepsilon_{t},

(4)

where

\beta_{0}=\mu_{0}\bigg{(}1-\sum_{i=1}^{\infty}\pi_{i}\bigg{)}+\mu_{1}\sum_{i=1%}^{\infty}i\pi_{i}\qquad\text{and}\qquad\beta_{1}=\mu_{1}\bigg{(}1-\sum_{i=1}^%{\infty}\pi_{i}\bigg{)}.

Thus, the infinite order in the AR representation can be approximated by a large order of $p^{*}$ to make the AR $(p^{*})$ model infinitely close to the true AR process.

3.2 The Distributed Least Squares Approximation Method

Suppose we have obtained the appropriate models for individual subseries by traversing themodel space. The next stage entails solving the problem of combining the local estimatorsof each subseries model to perform multi-step forecasting.Inspired by (zhu2021least, 58, ), we aim to solve the time series modeling problem withthe dependency structure. The local ARIMA models trained for the subseries are unifiedinto the AR representations with a large order, making it possible to estimate the globalparameters in the master node by combining the local estimators delivered from a group ofworker nodes.

Let the model parameters in Equation(4) be given by $\theta=(\beta_{0},\beta_{1},\pi_{1},\pi_{2},\cdots,\pi_{p},\eta_{1},\cdots,%\eta_{l})^{\top}$ . If $\mathcal{L}(\theta;y_{t})$ is a twice differentiable loss function,we have the global loss function given by $\mathcal{L}(\theta)=T^{-1}\sum_{t=1}^{T}\mathcal{L}(\theta;y_{t})$ and the local lossfunction for the $k$ th subseries given by $\mathcal{L}_{k}(\theta)=T_{k}^{-1}\sum_{t\in\mathcal{S}_{k}}\mathcal{L}(\theta%;y_{t})$ . By using Taylor’s theorem and the relationship between the Hessian and covariancematrix for Gaussian random variables yuen2010bayesian (55), we have

\displaystyle\mathcal{L}(\theta)

\displaystyle=\frac{1}{T}\sum_{k=1}^{K}\sum_{t\in\mathcal{S}_{k}}\mathcal{L}(%\theta;y_{t})\approx\sum_{k=1}^{K}(\theta-\widehat{\theta}_{k})^{\top}\left(%\frac{T_{k}}{T}\widehat{\Sigma}_{k}^{-1}\right)(\theta-\widehat{\theta}_{k})+c%_{2},

(5)

where $\widehat{\theta}_{k}$ is the minimizer of the local loss function. That is $\widehat{\theta}_{k}=\arg\min\mathcal{L}_{k}(\theta)$ , $c_{1}$ and $c_{2}$ are constants, and $\widehat{\Sigma}_{k}$ is the covariance estimate for local estimator of the $k$ th subseries.

Consequently, the objective of minimizing the global loss function is achieved by minimizing the weighted least squares expression in Equation(5). The global estimator takes the following form

\displaystyle\widetilde{\theta}=\left(\sum_{k=1}^{K}T_{k}\widehat{\Sigma}_{k}^%{-1}\right)^{-1}\left(\sum_{k=1}^{K}T_{k}\widehat{\Sigma}_{k}^{-1}\widehat{%\theta}_{k}\right).

(6)

Then the covariance matrix of the estimated global parameters is given by $\widetilde{\Sigma}=T\left(\sum_{k=1}^{K}T_{k}\widehat{\Sigma}_{k}^{-1}\right)^%{-1}$ .

Note that instead of simply averaging the estimated parameters, the analytical solution inEquation(6) approximates global estimators by taking a weighted average oflocal parameters using weights $\widehat{\Sigma}_{k}^{-1}$ , eliminating the influence ofoutliers on the forecast performance when there are subseries that are poorly fitted byARIMA models. Furthermore, the simple averaging method assumes subseries as hom*ogeneousfrom worker to worker to guarantee statistical efficiency. This is highly questionable inpractice because heteroskedasticity and imbalance are common phenomenons for distributedstored data.

The analytical form of the global estimator in Equation(6) can be used toestimate the global parameters in distributed forecasting. The difficulty with this methodis that it requires knowledge of $\widehat{\Sigma}_{k}$ . Specifically, the localparameters of a subseries in Equation(4) are derived from a seasonal ARIMAmodel and it may not be possible to straightly obtain a good estimate of the covariancematrix. Moreover, a large order $p^{*}$ is often considered for a better approximation ofthe initial ARIMA model trained by the split subseries, resulting in a set of localparameters in which some entries are close to or equal to zero and a high-dimensionalcovariance matrix whose dimension $p^{*}$ is likely to be larger than the length ofsubseries. Therefore, following the literature of high-dimensional covariance matrixestimation fan2008high (12, 13, 21), we assume $\widehat{\Sigma}_{k}$ is sparse and approximate it using $\hat{\sigma}_{k}^{2}I$ for eachsubseries in this study, which greatly simplifies the computations and further reduces thecommunication costs in distributed systems.

3.3 Point Forecasts

After combining the local estimators from each subseries by minimizing the global loss function, the coefficients of the global estimators are calculated as illustrated in Section3.2. By using a large order $p^{*}$ instead of the infinite order in the converted AR representation for each subseries, the combined global model can be written generally as follows:

\displaystyle y_{t}=\tilde{\beta}_{0}+\tilde{\beta}_{1}t+\sum_{i=1}^{p^{*}}%\tilde{\pi}_{i}y_{t-i}+\sum_{j=1}^{l}\tilde{\eta}_{j}\bigg{(}\gamma_{j,t}-\sum%_{i=1}^{p^{*}}\tilde{\pi}_{i}\gamma_{j,t-i}\bigg{)}+e_{t},

(7)

where $\tilde{\theta}=(\tilde{\beta}_{0},\tilde{\beta}_{1},\tilde{\pi}_{1},\cdots,%\tilde{\pi}_{p^{*}},\tilde{\eta}_{1},\cdots,\tilde{\eta}_{l})^{\top}$ is a vector of global model coefficients, and $e_{t}$ is the observed residual.

Given the time series $\{y_{t}\}$ , suppose that we are at the time $T$ and are interested in forecasting the next $H$ observations, where the time index $T$ is the forecast origin. The $h$ -step-ahead forecast can be calculated with relative ease as

	$\displaystyle\hat{y}_{T+h\|T}=$	$\displaystyle\tilde{\beta}_{0}+\tilde{\beta}_{1}(T+h)+\sum_{j=1}^{l}\tilde{%\eta}_{j}\bigg{(}\gamma_{j,T+h}-\sum_{i=1}^{p^{*}}\tilde{\pi}_{i}\gamma_{j,T+h%-i}\bigg{)}$
		$\displaystyle+\begin{cases}\sum_{i=1}^{p^{}}\tilde{\pi}_{i}y_{T+1-i},&h=1\\\sum_{i=1}^{h-1}\tilde{\pi}_{i}\hat{y}_{T+h-i\|T}+\sum_{i=h}^{p^{}}\tilde{\pi}%_{i}y_{T+h-i},&1<h<p^{}\\\sum_{i=1}^{p^{}}\tilde{\pi}_{i}\hat{y}_{T+h-i\|T},&h\geq p^{*}\end{cases}.$

In this way, the point forecasts of the next $H$ observations can be calculated recursively for $h=1,\dots,H$ .

3.4 Prediction Intervals

As described in Section3.1, the linear representation of the seasonal ARIMA model, which is trained for each subseries of $\{y_{t}\}$ , is derived from the following AR model for time series $\{x_{t}\}$ :

\displaystyle x_{t}=\sum_{i=1}^{p^{*}}\pi_{i}x_{t-i}+\varepsilon_{t},

where the infinite order for the converted AR model is replaced by a large order $p^{*}$ . Asis well-known, once we estimate the coefficients of the AR regression and the standarddeviation of the residuals, the standard error of the $h$ -step ahead forecast can beuniquely determined. Thus, the forecast variances of the linear representation inEquation(4) are not affected by the intercept term and time trend term of theseasonal ARIMA model (ignoring estimation error). Consequently, the forecast variances ofthe combined global model in Equation(7) depend only on the AR part of themodel, that is the term $\sum_{i=1}^{p^{*}}\tilde{\pi}_{i}y_{t-i}+e_{t}$ .

To compute these variances, we convert the AR model to a MA model with infinite orderbrockwell2016introduction (6):

\displaystyle e_{t}+\sum_{i=1}^{\infty}\tilde{\psi}_{i}e_{t-i}.

Then, in the global model, the standard error of the $h$ -step ahead forecast is given by

\displaystyle\tilde{\sigma}_{h}^{2}=\begin{cases}\tilde{\sigma}^{2},&h=1\\\tilde{\sigma}^{2}\left(1+\sum_{i=1}^{h-1}\tilde{\psi}_{i}^{2}\right),&h>1,%\end{cases}

where $\tilde{\sigma}$ is the standard deviation of the residuals for the combined global model and is unknown. As illustrated in Section3.2, we suggest replacing the covariance estimate $\widehat{\Sigma}_{k}$ of local estimators with $\hat{\sigma}_{k}^{2}I$ . Subsequently, the covariance estimate of the global estimators is calculated by $\widetilde{\Sigma}=\left(\sum_{k=1}^{K}\left(T_{k}/T\right)(\hat{\sigma}_{k}^{%2}I)^{-1}\right)^{-1}$ , and we can estimate $\tilde{\sigma}^{2}=\operatorname{tr}(\widetilde{\Sigma})/p$ .

Assuming normally distributed errors, the central $100(1-\alpha)\%$ prediction interval for the $h$ -step ahead forecast is given by

\displaystyle\hat{y}_{T+h|T}\pm\Phi^{-1}(1-\alpha/2)\tilde{\sigma}_{h},

where $\Phi$ is the cumulative distribution function of the standard normal distribution.

4 Application to Electricity Data

In the section, the electricity demand data set we use to investigate the performance ofthe proposed distributed ARIMA models and the experimental design are shown in detail. Weanalyze the performance of the proposed distributed ARIMA models and explore the factorsthat affect its forecasting performance.

4.1 Data Description

To illustrate our proposed approach, we forecast the time series of the GEFCom2017hong2019global (20). The data, publicly available athttps://github.com/camroach87/gefcom2017data, was initially made available by ISONew England. It comprises the electricity load, holiday information, and weather datacomposed of dew point and dry bulb temperatures. To assess the benefits of the proposeddistributed ARIMA model, we restrict our attention to the electricity load data in thefollowing analysis.

Distributed ARIMA models for ultra-long time series (5)

The electricity load data set consists of $10$ time series of hourly data, ranging from 1March 2003 to 30 April 2017. As aforementioned in Section2.2, thecomputational time for the automatic ARIMA modeling grows exponentially over the timelength, reaching $10$ minutes at the length of $1,000$ . Allowing a wider range of modelorders will further increase the execution time beyond an acceptable range (seeSection4.5 for more experimental results). In this respect, the electricityload data, spanning $124,171$ time points, is long enough so that distributed computing isdesired, consistent with the applicable scenarios of our proposed approach.Dealing with data of this magnitude is uncommon in traditional forecasting applications,but it should be noted that load forecasting problems routinely have cases with availabledata spanning more than one million time points, most likely for several locations at the same time.The forecasts are primarily used in power systems operations, such as energy trading and unit commitment.

Figure5 presents the hourly electricity load for allbottom-level zones and two aggregated zones. We train the distributed ARIMA model usingdata from 1 March 2003 to 31 December 2016, while data from 1 January 2017 to 30 April2017 are used for testing. In this way, we provide the four-month ( $2879$ -step) aheadpoint forecasts, and the corresponding prediction intervals with multiple confidencelevels. The original GEFCom2017 hong2019global (20) only requires one month aheadforecasting. However, forecasting at longer horizons in energy and many otherhigh-frequency time series domains is in great demand as it allows for earlier managementplans. Note that we restrict our attention to time series forecasting on distributedsystems without considering the data’s hierarchical configuration.

Electricity demand may exhibit periodic patterns such as time of the day, day of the week,and month of the year. Although day of the week can be easily involved in our proposedmodel as covariates, our ex-ante analysis shows no significant distinction in patternsbetween days of the week: whether the day-of-the-week pattern is included in ARIMA modelsor distributed ARIMA models results in little difference in both point and intervalforecasting accuracy. Moreover, the subseries is not long enough to enable us to considermonthly seasonality under distributed computing environment, while the monthly seasonalitycan be handled with preprocessing steps such as time series decomposition. To better focuson assessing the benefits of the proposed distributed ARIMA models over normal ARIMAmodels, we only consider the time-of-the-day effect in the following analysis by using the seasonalcomponents of ARIMA models for hourly subseries ( $m=24$ ).

4.2 Experimental Setup

We partition each time series into $150$ subseries in the experiment with the length ofeach subseries about $800$ . The setup is inspired by the M4 competitionmakridakis2020m4 (38): the length of the hourly time series ranges from $700$ to $900$ . For time series with such lengths, traditional forecasting models perform well on astandalone computer, and the time consumed by automatic ARIMA modeling process is within $5$ minutes, which is empirically acceptable. The analysis exploring the forecastingperformance on different settings of the number of subseries is presented inSection4.5.

As illustrated in Section3.1, the AR representation with infinite order isobtained from the seasonal ARIMA model for each subseries to facilitate theparameter-based combination. We approximate the infinite order AR model with one sizeablefinite order, balancing model complexity, and approximating the original ARIMA model.In practice, one can perform an ex-ante analysis using several candidate order values andselect the smallest one from the order values resulting in no significant differencesin the forecasting results of the original ARIMA model and its converted linearrepresentation. Alternatively, if no ex-ante analysis has been conducted, a sizeableorder is always recommended as it will only lead to negligible computationalcomplexity in the combination of local parameters in the master node, while allowing abroader range of order values for the original ARIMA models. Accordingly, we set the AR order $p^{*}$ to $2000$ in this experiment, see Section4.4 for more details.

In this paper, the distributed ARIMA model is designed with the aim of facilitating the ARIMAmodeling for ultra-long time series in a distributed manner with high computational efficiencyand potentially improved forecasting performance, rather than running a horse race between theproposed method and other forecasting methods. Therefore, the main comparison we are interestedin is DARIMA versus ARIMA. Moreover, the most widely used forecastingapproaches developed with distributed systems are poorly scalable for large forecast horizons,making it impractical to apply these approaches to forecasting quite a few future values(see Section2.1 for more details). Inthis regard, we only compare the proposed approach to ARIMA models for the whole time series,as well as an additional standard for comparison: ETS models.

For comparison purposes, the argument configuration of the automatic ARIMA modeling forthe whole series and subseries is shown in Table1. To make the algorithmexecution time comparable, we consider the global order selection with parallelism infitting models for the whole time series, while using non-parallel stepwise orderselection when modeling the subseries. Furthermore, we apply the CSS method to fit ARIMAmodels instead of CSS-ML (see Section2.2 for details). With the fittingmethod CSS-ML, we may have to fit more than one model in the model refit process, sincethe model with the appropriate order identified in the order selection process may berejected by several strict checks for unit roots. Due to the uncertainty, the comparisonof execution time between the ARIMA model on the whole series and the distributed ARIMAmodel would be unreliable if we used CSS-ML. Finally, the experiment is limited tospecific maximum values of model orders. We further discuss the importance of model ordersto forecasting performance in Section4.5. Moreover, implementations of ETSmodels are available in the forecast R package with the ets() function.Given a single time series data, ETS modeling can not be extended to parallel computing.

Argument	ARIMA	DARIMA
max.p; max.q	5	5
max.P; max.Q	2	2
max.order	5	5
fitting method	CSS	CSS
parallel (multicore)	True	False
stepwise	False	True

As for the system environment, the experiments are carried out on a Spark-on-YARN cluster onAlibaba Cloud Server composed of one master node and two worker nodes. Each node contains $32$ logical cores, $64$ GB RAM and two $80$ GB SSD local hard drives. The algorithm isdeveloped on Spark platform (2.4.5), and both Python as well as Rinterfaces are freely available at https://github.com/xqnwang/darima.

4.3 Evaluation Measures

To assess the performance of the point forecasts, we consider the mean absolute scalederror (hyndman2006another, 24, MASE;) as the measure of forecasting accuracy. MASE isrecommended because of its excellent mathematical properties, such as scale-independentand less insensitive to outliers. Besides, (hyndman2006another, 24, ) suggest MASE as thestandard measure for comparing forecasting accuracy across multiple time series. The formulafor computing the MASE is the following:

MASE

\displaystyle=\frac{\frac{1}{H}\sum\limits_{t=T+1}^{T+H}|y_{t}-\hat{y}_{t|T}|}%{\frac{1}{T-m}\sum\limits_{t=m+1}^{T}|y_{t}-y_{t-m}|}.

We evaluate the accuracy of prediction intervals using the mean scaled interval score(gneiting2007strictly, 18, MSIS;), given by

\displaystyle\text{MSIS}=\frac{\displaystyle\frac{1}{H}\sum_{t=T+1}^{T+H}(U_{t%|T}-L_{t|T})+\frac{2}{\alpha}(L_{t|T}-y_{t})\bm{1}\{y_{t}<L_{t|T}\}+\frac{2}{%\alpha}(y_{t}-U_{t|T})\bm{1}\{y_{t}>U_{t|T}\}}{\displaystyle\frac{1}{T-m}\sum_%{t=m+1}^{T}|y_{t}-y_{t-m}|},

where $L_{t|T}$ and $U_{t|T}$ are lower and upper bounds of the generated $100(1-\alpha)$ % prediction interval, respectively. The scoring rule balances the width ofthe generated prediction intervals and the penalty for true values lying outside theprediction intervals.

4.4 Distributed Forecasting Results

We now investigate the performance of the proposed distributed ARIMA models on theGEFCom2017 data set compared to that of ARIMA models as well as ETS models in termsof MASE as well as MSIS. Execution time isalso considered as an important metric describing the computational efficiency ofalgorithms. For conciseness, our proposed algorithm, the distributed ARIMA model, ishereinafter referred to as DARIMA.

To verify whether the approximating order of the AR representation of $2000$ (as describedin Section4.2) is large enough to make the AR model close to its originalseasonal ARIMA model, we present the forecasting results of the ARIMA model and itsAR $(2000)$ representation on the GEFCom2017 data set in Table2. We observethat there is no difference between the forecasting performance of the ARIMA model andthat of the converted AR model (as measured by MASE and MSIS), to the degree of $10^{-3}$ .

	MASE			MSIS
	Mean	Median	SD	Mean	Median	SD
DARIMA	1.297	1.218	0.284	15.078	14.956	1.021
ARIMA	1.430	1.325	0.351	19.733	16.498	7.446
AR representation	1.430	1.325	0.351	19.733	16.498	7.446
ETS	1.491	1.338	0.408	53.783	49.109	15.834

Table2 also compares the forecasting performance of DARIMA against ARIMA and ETSfor the whole time series in terms of the mean, median, and standard deviation (SD) of the MASE and MSIS values. Asexpected, DARIMA always outperforms the benchmark methods regardless of point forecasts orprediction intervals. Specifically, for point forecasting, DARIMA achieves substantialperformance improvements compared to the benchmark methods, approximately at least $9.3\%$ for themean MASE value and $8.1\%$ for the median, with a smaller degree of variation. Meanwhile, DARIMA yields a statisticallysignificant improvement (at least $23.6\%$ ) over the benchmark methods in terms of the mean ofMSIS values. Therefore, implementing ARIMA models on distributed systems by splitting thewhole time series into several subseries dominates ARIMA and ETS in bothpoint forecasting and interval forecasting.

We proceed by observing how the forecasting performance of distributed ARIMA modelschanges with the forecast horizon. Figure6 depicts the accuracy ofDARIMA over various forecast horizons against the benchmark methods: ARIMA and ETS. First, theleft panel shows that the point forecasting performance of DARIMA displays smalldifferences with ARIMA and ETS when we are interested in obtaining the forecasts of the first fewfuture values. We also observe that DARIMA yields slightly larger values than ARIMA interms of MASE when focusing on forecasting the next $1000$ observations. This differencetapers off with increasing forecast horizon, and finally, DARIMA significantly outperformsboth ARIMA and ETS for the forecasting of long-term observations. On the other hand, the right panelillustrates that DARIMA provides much better performance than ARIMA and ETS according to MSISvalues when we turn our attention to forecasting more than $100$ futurevalues. The achieved performance improvements become more pronounced as theforecast horizon increases. Furthermore, one possible reason for the result thatDARIMA and ARIMA models perform substantially better than ETS modelsis that, for seasonal data, there are many more ARIMA models than the possiblemodels in the exponential smoothing class Hyndman2008b (23).In simple terms, we conclude that if long-term observations areconsidered, DARIMA is favorable, both in point forecasts and prediction intervals.

Figure7 shows the forecasting performance of DARIMA compared toARIMA on the electricity demand series for the NEMASSBOST zone. We observe from theforecasts that, compared to ARIMA, DARIMA captures the yearly seasonal pattern from theoriginal series. Even for large forecast horizons, DARIMA results in forecastscloser to the true future values than ARIMA. These conclusions are consistent with theprevious results shown in Table2 and Figure6.

Distributed ARIMA models for ultra-long time series (6)

Distributed ARIMA models for ultra-long time series (7)

Figure8 presents the MSIS results of forecasting with DARIMA,ARIMA, and ETS across different confidence levels varying from $50\%$ to $99\%$ . We observe thatDARIMA persistently results in better forecasting accuracy than ARIMA and ETS in terms of MSISacross various confidence levels. Besides, the superiority of DARIMA tends to be moresubstantial as the confidence level increases.

Distributed ARIMA models for ultra-long time series (8)

The aforementioned results mainly focus on the forecasting accuracy of DARIMA against thebenchmark methods. Now we compare DARIMA to ARIMA and ETS in terms of execution time toinvestigate the computational efficiency of DARIMA, as shown in Figure9.The forecasting performance of DARIMA varies with the number of subseries (seeSection4.5 for details), but not with the number of executorsin the Spark system. More specifically, the number of executors determines how many tasks(i.e.subproblems of ARIMA modeling) are assigned to each executor, which greatly affectsthe runtime in individual executors. When it comes to ARIMAmodeling for the whole time series, increasing the number of cores used can also give a significantspeedup as the specification search is done in parallel. However, for a given ultra-longtime series, ETS modeling can not be implemented in a parallel manner.So the execution time of ETS modeling does not change along with the number of virtual cores used.Figure9 shows improved computational efficiency of both ARIMA and DARIMA with increasingnumbers of executors/cores. Besides, DARIMA persistently results in less execution timethan ARIMA and ETS when using more than two executors/cores. In our application, modeling a DARIMAmodel for an ultra-long time series with the length of about $120,000$ takes an average of $1.22$ minutes with $32$ cores, while ARIMA modeling takes an average of $5.16$ minutes and ETS takes an average of $5.38$ minutes. Therefore, our approach results in significantlyimproved forecasting accuracy with remarkably less execution time compared to ARIMA and ETS models.

Distributed ARIMA models for ultra-long time series (9)

4.5 Sensitivity Analysis

This subsection focuses on the factors that may affect the forecast performance of thedistributed ARIMA models. In the following analysis, we consider two main factors: thenumber of split subseries and the maximum values of model orders. Other potential factorswill be discussed in Section6.

We first explore the relationship between the forecasting performance of the distributedARIMA models and the number of subseries $K$ preset in the preprocessing process, aspresented in Figure10. In essence, the relationship also depicts theimportance of the length of subseries to the functioning of the distributed ARIMAmodels. With the number of subseries $K$ varying from $10$ to $100$ , there is aconsiderable drop in the MASE values of DARIMA. It then slightly fluctuates when $K$ isbetween $100$ and $300$ , and has an enormous growth when $K$ equals to $350$ . Subsequently, the MASE values of DARIMA go up and down widely with a larger $K$ . Besides, the MSIS of DARIMA shows an overall trend of decreasing first and thenincreasing. Therefore, we conclude that the number of subseries should be controlledwithin a reasonable range, with too few or too many subseries causing poor forecastingperformance. In our study, we should limit the number of subseries between $100$ to $300$ .

Distributed ARIMA models for ultra-long time series (10)

Table3 compares the forecasting performance of DARIMA with that of ARIMAunder different settings of the maximum values of model orders in terms of MASE andMSIS. The maximum value of $p+q+P+Q$ only works for the process of global orderselection. Therefore, when we keep the maximum values of non-seasonal and seasonal partsfixed, the changes in the maximum value of $p+q+P+Q$ result in some changes in theforecasting accuracy of ARIMA, but no changes in that of the DARIMA. If the model ordersare allowed to range more widely, ARIMA achieves better forecasting performance on bothpoint forecasts and prediction intervals. The main reason is that the broader range ofmodel orders provides more possible models in the order selection process. In contrast,DARIMA performs higher MASE when more possible models are provided. One possible reasonfor this result is that allowing more extensive choices of model orders may lead to overfittingfor subseries with short lengths. Moreover, Table3 shows that themaximum values of model orders have a limited impact on forecasting performance: thechanges in performance both for ARIMA and DARIMA gradually disappear as the maximum ordersincreases. We also compare the results using the symmetric mean absolute percentage error(makridakis1993accuracy, 37, sMAPE;) and obtain almost identical results from thosewith MASE. As expected, DARIMA always outperforms ARIMA on different settings of themaximum values of model orders for both point forecasts and prediction intervals.

We proceed by comparing our proposed DARIMA with ARIMA regarding execution time ondifferent settings of the maximum values of model orders, as shown inTable3. The results indicate that DARIMA is more computationallyefficient than ARIMA in multiple settings of the maximum values of model orders. When themodel orders are allowed to take a broader range of values, both DARIMA and ARIMA takemore time modeling the time series. The execution time spent on ARIMA modeling has amarked increase, while DARIMA keeps its modeling time within a reasonable and acceptablerange. For example, DARIMA is $53$ times more efficient than ARIMA on the setting ofmax.orders being equal to $(8,4,10)$ . The improved efficiency makes it possible for DARIMA to search for anappropriate model for each subseries in a broader range of candidate models.

Max orders	Method	MASE	MSIS	Execution time
				(mins)
(5, 2, 5)	ARIMA	1.430	19.733	4.596
	DARIMA	1.297	15.078	1.219
(5, 2, 7)	ARIMA	1.410	18.695	14.189
	DARIMA	1.297	15.078	1.211
(6, 2, 7)	ARIMA	1.410	18.695	15.081
	DARIMA	1.298	15.108	1.326
(6, 3, 7)	ARIMA	1.413	15.444	21.072
	DARIMA	1.324	12.590	1.709
(6, 3, 10)	ARIMA	1.413	15.654	76.272
	DARIMA	1.324	12.590	1.769
(7, 3, 10)	ARIMA	1.413	15.654	83.077
	DARIMA	1.327	12.561	1.829
(7, 4, 10)	ARIMA	1.409	13.667	111.292
	DARIMA	1.338	12.079	2.267
(8, 4, 10)	ARIMA	1.409	13.667	117.875
	DARIMA	1.335	12.076	2.224

5 Numerical Simulations

In this section, we perform a simulation study to further investigate and justify our proposedDARIMA method in terms of forecasting accuracy and computational cost.

5.1 Simulation Setup

We consider daily, hourly, and half-hourly series and generate $1,000$ timeseries samples in each case. Each series is generated by an ARIMA $(p,d,q)(P,D,Q)_{m}$ process with $d$ being randomly sampled from Bernoulli( $0.9$ ), $D$ being randomly sampledfrom Bernoulli( $0.4$ ), $p$ and $q$ each taking values from a uniform distribution on $\{0,1,2,3,4,5\}$ , and $P$ and $Q$ each taking values $0$ , $1$ and $2$ with equalprobability. The periods $m$ of the simulated series are set to be $7$ , $24$ and $48$ tomatch daily, hourly and half-hourly time series. For each generated series, theparameters of each process are randomly chosen from the uniform distribution U( $-2,2$ )over the stationary and invertible space.

We divide each series into three parts: the first $m\times 10$ observations are discarded asburn-in, the following $T$ observations are used as a training set for estimating parameters, andthe last $H$ observations are used for testing. For daily, hourly, and half-hourly series, $T$ takes the values $8,000$ , $100,000$ and $200,000$ respectively, while $H$ takes thevalues $100$ , $2,000$ and $4,000$ respectively.

Final forecasts are produced using four different methods:(i) distributed ARIMA modeling as introduced in Section3 (DARIMA);(ii) simply averaging the estimated parameters for split subseries when implementing DARIMA (DARIMA_SA);(iii) automatic ARIMA modeling for the whole series with a single model (ARIMA);and (iv) ETS modeling for the whole time series (ETS).For comparison purposes, DARIMA and DARIMA_SA share the samesettings for time series partitions. Specifically, each daily series is partitioned into $21$ subseries with each subseries spanning at least $52$ weeks, and each hourly andhalf-hourly series is split into $138$ subseries so that each subseries spans at least $30$ days. Other settings are consistent with that in the application to electricitydata, see Section4.2 for more details.

5.2 Results

We choose three metrics to evaluate the performance of our proposed method, which are MASE,MSIS, and ACD (absolute coverage difference). To assess prediction intervals, we set $\alpha=0.05$ (corresponding to $95\%$ prediction intervals). As asupplemental scoring rule, ACD measures the absolute difference between the actual coverageof the target method and the nominal coverage, where the coverage measures the rate at which thetrue values lie inside the prediction intervals the method provides.

Table4 displays the forecasting accuracy of DARIMA as well as three methodsconsidered as benchmarks in this study. The accuracy is reported for short-term (four weeks)and long-term (remaining periods) horizons separately as well as across all forecast horizons.Moreover, the multiple comparisons with the best (koning2005m3, 30, MCB,) test isperformed on each data frequency to identify whether the average ranks, based on MASE andMSIS, of the four models considered are significantly different, as presented in Figure11.If the intervals of two methods do not overlap, this indicates a statistically different performance.

The results indicate that, for daily and half-hourly series, the DARIMA method consistently achievesthe best forecast accuracy in terms of the mean values of MASE and MSIS, especially for long-termforecasting. The corresponding MCB results indicate that DARIMA achieves the best-ranked performanceas well, except that it ranks second in MASE for the half-hourly frequency but without significantlydiffering from the best.

In terms of hourly time series, DARIMA provides better forecasts, measured by the average MASEand MSIS values, than ARIMA and DARIMA_SA, but worse forecasts than ETS. However, the main comparisonwe are interested in is DARIMA versus ARIMA. Moreover, the correspondingMCB results show that DARIMA ranks third but is very close to the top two methods interms of MASE, while it ranks best and displays a clear gap from the remaining ones when MSISis used for conducting the MCB test.A more in-depth analysis shows that DARIMA provides about $20.77\%$ more accurate intervalforecasts than ETS, based on the median of MSIS, with a smaller degree of variation.This demonstrates that DARIMA tends to provide more stable forecasts than the competing methods.Additionally, when there are subseries that are poorly fitted by ARIMA models, the use of DARIMA helps eliminatethe influence of outliers on the forecast performance, which can not be achieved using theDARIMA_SA method.

However, we observe that DARIMA may result in lower-than-nominal coverage andyield ACD values that are higher than ARIMA, but lower than or comparable to DARIMA_SA. The lossof the efficiency of the estimator may be attributed to the simplified treatment ofsetting $\widehat{\Sigma}_{k}$ with $\hat{\sigma}_{k}^{2}I$ in Section3.2.Furthermore, DARIMA, on average, ranks higher in MSIS than ARIMA, DARIMA_SA, and ETS as shown inFigure11, thus enabling optimal decision making with a comprehensive understandingof the uncertainty and the resulting risks. Thus, we conclude that DARIMA does not only provide stableforecasts across different frequencies, but also achieve improved or at least comparable forecastscompared to the benchmark methods.

	Daily			Hourly			Half-hourly
	Short	Long	Total	Short	Long	Total	Short	Long	Total
	MASE
DARIMA	0.700	1.985	1.626	3.277	11.543	8.766	2.622	8.472	6.506
DARIMA_SA	0.704	1.997	1.635	$4.002^{\dagger}$	$12.911^{\dagger}$	$9.917^{\dagger}$	$2.879^{\dagger}$	$9.044^{\dagger}$	$6.973^{\dagger}$
ARIMA	0.793	2.670	2.144	3.609	14.517	10.852	3.991	15.144	11.397
ETS	0.838	2.143	1.778	2.442	7.764	5.976	8.705	24.663	19.301
	MSIS
DARIMA	3.921	11.574	9.431	44.671	200.993	148.469	24.608	104.287	77.515
DARIMA_SA	4.027	11.815	9.634	$74.645^{\dagger}$	$256.492^{\dagger}$	$195.391^{\dagger}$	$36.263^{\dagger}$	$134.006^{\dagger}$	$101.164^{\dagger}$
ARIMA	4.683	20.993	16.426	62.030	490.435	346.491	90.331	710.849	502.355
ETS	5.688	17.713	14.346	30.671	130.105	96.696	137.094	614.441	454.053
	ACD
DARIMA	0.003	0.006	0.005	0.099	0.140	0.126	0.067	0.125	0.106
DARIMA_SA	0.007	0.002	0.003	0.131	0.173	0.159	0.097	0.150	0.132
ARIMA	0.002	0.011	0.009	0.000	0.001	0.001	0.003	0.012	0.009
ETS	0.004	0.032	0.024	0.107	0.140	0.129	0.039	0.086	0.070

Distributed ARIMA models for ultra-long time series (11)

Distributed ARIMA models for ultra-long time series (12)

In order to investigate the computational cost of our proposed method, we proceed byvisualising the computational time of DARIMA and the three benchmark methods for each datafrequency, as presented in Figure12. It should be noted that the executiontime of DARIMA_SA is not included because there is no significant difference in the execution timebetween DARIMA and DARIMA_SA: they differ only in terms of how to combine the estimated parameterstrained on each subseries, and the time required for the combination step is negligible.Moreover, the execution time of ETS does not change along with the number of virtualcores used, since ETS modeling for a single time series cannot be extended to parallel processing.We observe that, as the length of the time series data of interest increases, both DARIMAand ARIMA take more time modeling the time series. The runtime required for both ARIMAand DARIMA modeling decreases as the number of executors/cores increases. However,DARIMA is computationally more efficient than ARIMA and ETS and keeps its runtime and forecastswithin a reasonable and acceptable range when using more than eight executors.

6 Discussion

Advances in technology have given rise to increasing demand for forecasting time seriesdata spanning a long time interval, which is extremely challenging to achieve. Attempts totackle the challenge by using MapReduce technology typically focus on two mainstreamdirections: combining forecasts from multiple models bendre2019time (3) and splittingthe multi-step forecasting problem into $H$ (forecast horizon) subproblemsgalicia2018novel (16). On the other hand, the statistical computation can beimplemented on a distributed system by aggregating the information about local estimatorstransmitted from worker nodes fan2019distributed (14, 58). The approachresults in the combined estimator proven to be statistically as efficient as the globalestimator. Inspired by the solution, this study provides a new way to forecast ultra-longtime series on a distributed system.

One of our developed framework highlights is that the distributed forecasting framework isdedicated to averaging the DGP of subseries to develop a trustworthy global model for timeseries forecasting. To this end, instead of unrealistically assuming the DGP of timeseries data remains invariant over an ultra-long time periodhyndman2018forecasting (22), we customize the optimization process of model parametersfor each subseries by only assuming that the DGP of subseries stays invariant over a shortperiod, and then aggregate these local parameters to produce the combined global model. Inthis way, we provide a complete novel perspective of forecasting ultra-long time series,with significantly improved computational efficiency.

As illustrated in Section3, this study focuses on implementing thedistributed time series forecasting framework using general ARIMA models that allow theinclusion of additional seasonal terms to deal with strong seasonal behavior.Nevertheless, it is also possible to apply the framework with other statistical models,such as state-space models, VAR models, and ETS, as a general loss function is considered.However, special concerns should be given on how to properly convert the local estimatorsto avoid the inefficiency in the combined estimators. In this work, we restrict our attentionto ARIMA models and involve a linear transformation step, in which ARIMA models trained onsubseries are converted into linear representations to avoid the stationary, causality,and invertibility problems that may be caused by directly combining the original parametersof ARIMA models. Similar to ARIMA models, ETS models share the virtue of allowing the trend andseasonal components to vary over time hyndman2002state (25). We hope to shed some lighton using distributed ETS models in the future.

The forecasting performance of the distributed ARIMA models is affected by variousfactors. Two factors, the number of split subseries and the maximum values of modelorders, are taken into consideration as described in Section4.5. Ourresults show that the number of subseries should be limited to a reasonable range toachieve improved performance in point forecasts and prediction intervals. Specifically, werecommend that subseries’ length ranges from $500$ to $1200$ for hourly timeseries. Moreover, compared to ARIMA models, a smaller maximum value of model order issufficient for the distributed ARIMA models to fit models for all subseries and obtainimproved forecasting results according to the combined estimators.

Many other potential factors may hold sway over the forecasting performance of ourproposed approach. For example, whether to set an overlap between successive subseries maybe a practical consideration when implementing the proposed distributed forecastingframework. Through repeated surveys, (scott1974analysis, 43, ) explore the effect ofwhether to overlap the random samples at each period on the estimation of populationparameters. They illustrate that considering the overlap between samples offers reductionsin the variance; they also discuss the optimum proportion of overlap. Therefore, webelieve that a study on setting overlap between successive subseries will further improveour framework, and our framework and computer code are generally applicable to such ascenario. To take another example, we may consider adding time-dependent weights for eachsubseries when combining the local estimators delivered from a group of worker nodes. Thetime-dependent weights for subseries help assign higher weights to subseries closer to theforecast origin, while smaller weights to subseries that are further away from theforecast origin.

7 Conclusions

In this paper, we propose a novel framework for ultra-long time series forecasting on adistributed system. Unlike previous attempts in the forecasting literature, this studyfacilitates distributed time series forecasting by taking a weighted average of the localestimators delivered from worker nodes to minimize the global loss function. To this end,an ultra-long time series spanning a long stretch of time is divided into severalsubseries spanning short time periods. Specifically, in this study, we focus onimplementing our proposed framework on ARIMA models to enable the ARIMA estimation ofultra-long time series in a distributed manner.

We investigate the performance of the distributed ARIMA models in both the real dataapplication and the simulations and compare the proposed approach against ARIMA modelsconcerning point forecasts, prediction intervals, and execution time. We find that thedistributed ARIMA models outperform ARIMA models in both point forecasting and uncertaintyestimation. The achieved performance improvements become more pronounced as the forecasthorizon increases. Finally, the comparison of execution time shows that our approach alsoachieves better forecasting performance with improved computational efficiency. We alsoexplore various factors that may affect the forecasting performance of the distributedARIMA models, such as the number of split subseries, the maximum values of model orders,overlap between successive subseries, and time-dependent weights for subseries. To furtherimprove the research on distributed forecasting methods, we suggest some possible researchavenues. For example, it would be meaningful to explore distributed ETS models in thefuture.

Acknowledgments

The authors are grateful to the editors and two anonymous reviewers for helpful commentsthat improved the contents of the paper.

Yanfei Kang is supported the National Natural Science Foundation of China (No. 72171011)and Feng Li is supported by the Emerging Interdisciplinary Project ofCUFE and the Beijing Universities Advanced Disciplines Initiative (No. GJJ2019163). Thisresearch is supported by Alibaba Group through the Alibaba Innovative Research Programand the high-performance computing (HPC) resources at Beihang University.

References

(1)Robin Anil et al.“Apache Mahout: Machine Learning on Distributed Dataflow Systems”In Journal of Machine Learning Research 21.127, 2020, pp. 1–6
(2) Apache Software Foundation“Apache Spark”, 2020URL: https://spark.apache.org
(3)Mininath Bendre and Ramchandra Manthalkar“Time series decomposition and predictive analytics using MapReduce framework”In Expert Systems with Applications 116Elsevier, 2019, pp. 108–120
(4)George EP Box, Gwilym M Jenkins, Gregory C Reinsel and Greta M Ljung“Time series analysis: forecasting and control”John Wiley & Sons, 2015
(5)Stephen Boyd, Neal Parikh and Eric Chu“Distributed optimization and statistical learning via the alternating direction method of multipliers”In Foundations and Trends in Machine Learning 3.1, 2011, pp. 1–122
(6)Peter J Brockwell and Richard A Davis“Introduction to time series and forecasting”Switzerland: Springer International Publishing, 2016
(7)Rodrigo N Calheiros, Enayat Masoumi, Rajiv Ranjan and Rajkumar Buyya“Workload prediction using ARIMA model and its impact on cloud applications’ QoS”In IEEE Transactions on Cloud Computing 3.4IEEE, 2014, pp. 449–458
(8)Fabio Canova and Bruce E Hansen“Are seasonal patterns constant over time? A test for seasonal stability”In Journal of Business & Economic Statistics 13.3Taylor & Francis Group, 1995, pp. 237–252
(9)Xi Chen, Weidong Liu and Yichen Zhang“Quantile regression under memory constraint”In Annals of Statistics 47.6Institute of Mathematical Statistics, 2019, pp. 3244–3273
(10)Angelo Coluccia and Giuseppe Notarstefano“A Bayesian framework for distributed estimation of arrival rates in asynchronous networks”In IEEE Transactions on Signal Processing 64.15IEEE, 2016, pp. 3984–3996
(11)Srinjoy Das and Dimitris N Politis“Predictive inference for locally stationary time series with an application to climate data”In Journal of the American Statistical Association 116.534Taylor & Francis, 2020, pp. 919–934
(12)Jianqing Fan, Yingying Fan and Jinchi Lv“High dimensional covariance matrix estimation using a factor model”In Journal of Econometrics 147.1Elsevier, 2008, pp. 186–197
(13)Jianqing Fan, Yuan Liao and Martina Mincheva“High dimensional covariance matrix estimation in approximate factor models”In Annals of Statistics 39.6NIH Public Access, 2011, pp. 3320
(14)Jianqing Fan, Dong Wang, Kaizheng Wang and Ziwei Zhu“Distributed estimation of principal eigenspaces”In Annals of Statistics 47NIH Public Access, 2019, pp. 3009–3031
(15)Jianqing Fan and Wenyang Zhang“Statistical methods with varying coefficient models”In Statistics and its Interface 1NIH Public Access, 2008, pp. 179–195
(16)Antonio Galicia, José F Torres, Francisco Martínez-Álvarez and A Troncoso“A novel spark-based multi-step forecasting algorithm for big data time series”In Information Sciences 467Elsevier, 2018, pp. 800–818
(17)Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung“The Google file system”In Proceedings of the nineteenth ACM symposium on Operating systems principles, 2003, pp. 29–43
(18)Tilmann Gneiting and Adrian E Raftery“Strictly proper scoring rules, prediction, and estimation”In Journal of the American Statistical Association 102.477Taylor & Francis, 2007, pp. 359–378
(19)Carla Gonçalves, Ricardo J Bessa and Pierre Pinson“A critical overview of privacy-preserving approaches for collaborative forecasting”In International Journal of Forecasting 37.1Elsevier, 2021, pp. 322–342
(20)Tao Hong, Jingrui Xie and Jonathan Black“Global energy forecasting competition 2017: Hierarchical probabilistic load forecasting”In International Journal of Forecasting 35.4Elsevier, 2019, pp. 1389–1399
(21)Rob J Hyndman, Roman A Ahmed, George Athanasopoulos and Han Lin Shang“Optimal combination forecasts for hierarchical time series”In Computational Statistics & Data Analysis 55.9Elsevier, 2011, pp. 2579–2589
(22)Rob J Hyndman and George Athanasopoulos“Forecasting: principles and practice”OTexts, 2021URL: https://OTexts.com/fpp3
(23)Rob J Hyndman and Yeasmin Khandakar“Automatic Time Series Forecasting: The forecast Package for R”In Journal of Statistical Software 27jstatsoft.org, 2008, pp. 1–22DOI: 10.18637/jss.v027.i03
(24)Rob J Hyndman and Anne B Koehler“Another look at measures of forecast accuracy”In International Journal of Forecasting 22.4Elsevier, 2006, pp. 679–688
(25)Rob J Hyndman, Anne B Koehler, Ralph D Snyder and Simone Grose“A state space framework for automatic forecasting using exponential smoothing methods”In International Journal of Forecasting 18.3Elsevier, 2002, pp. 439–454
(26)Michael I Jordan, Jason D Lee and Yun Yang“Communication-Efficient Distributed Statistical Inference”In Journal of the American Statistical Association 114.526Taylor & Francis, 2019, pp. 668–681
(27)Mirko Kämpf and Jan W Kantelhardt“Hadoop.TS: large-scale time-series processing”In International Journal of Computer Applications 74.17Foundation of Computer Science, 2013, pp. 1–8
(28)Yanfei Kang, Rob J Hyndman and Feng Li“GRATIS: GeneRAting TIme Series with diverse and controllable characteristics”In Statistical Analysis and Data Mining 13, 2020, pp. 354–376
(29)Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I Jordan“A scalable bootstrap for massive data”In Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.4Wiley Online Library, 2014, pp. 795–816
(30)Alex J Koning, Philip Hans Franses, Michele Hibon and Herman O Stekler“The M3 competition: Statistical tests of the results”In International Journal of Forecasting 21.3Elsevier, 2005, pp. 397–409
(31)Denis Kwiatkowski, Peter CB Phillips, Peter Schmidt and Yongcheol Shin“Testing the null hypothesis of stationarity against the alternative of a unit root”In Journal of Econometrics 54.1-3, 1992, pp. 159–178
(32)Jason D Lee, Qiang Liu, Yuekai Sun and Jonathan E Taylor“Communication-efficient sparse regression”In Journal of Machine Learning Research 18.5, 2017, pp. 1–30
(33)Lei Li, Farzad Noorian, Duncan JM Moss and Philip HW Leong“Rolling window time series prediction using MapReduce”In Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014), 2014, pp. 757–764IEEE
(34)Xixi Li, Yanfei Kang and Feng Li“Forecasting with time series imaging”In Expert Systems with Applications 160, 2020, pp. 113680
(35)Qiang Liu and Alexander Ihler“Distributed estimation, information loss and exponential families”In Advances in Neural Information Processing Systems, 2014, pp. 1098–1106
(36)Dougal Maclaurin and Ryan Prescott Adams“Firefly Monte Carlo: Exact MCMC with subsets of data”In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015
(37)Spyros Makridakis“Accuracy measures: theoretical and practical concerns”In International Journal of Forecasting 9.4Elsevier, 1993, pp. 527–529
(38)Spyros Makridakis, Evangelos Spiliotis and Vassilios Assimakopoulos“The M4 Competition: 100,000 time series and 61 forecasting methods”In International Journal of Forecasting 36.1Elsevier, 2020, pp. 54–74
(39)Xiangrui Meng et al.“MLlib: Machine learning in apache spark”In Journal of Machine Learning Research 17.1JMLR. org, 2016, pp. 1235–1241
(40)Pablo Montero-Manso, George Athanasopoulos, Rob J Hyndman and Thiyanga S Talagala“FFORMA: Feature-based forecast model averaging”In International Journal of Forecasting 36.1Elsevier, 2020, pp. 86–92
(41)Rui Pan et al.“A Note on Distributed Quantile Regression by Pilot Sampling and One-Step Updating”In Journal of Business and Economic Statistics In Press, 2021DOI: 10.1080/07350015.2021.1961789
(42)Fotios Petropoulos et al.“Forecasting: theory and practice”In International Journal of Forecasting, 2022
(43)AJ Scott and TMF Smith“Analysis of repeated surveys using time series methods”In Journal of the American Statistical Association 69.347Taylor & Francis Group, 1974, pp. 674–678
(44)Ohad Shamir, Nati Srebro and Tong Zhang“Communication-efficient distributed optimization using an approximate Newton-type method”In International Conference on Machine Learning, 2014, pp. 1000–1008PMLR
(45)Han Lin Shang and Rob J Hyndman“Grouped functional time series forecasting: An application to age-specific mortality rates”In Journal of Computational and Graphical Statistics 26.2Taylor & Francis, 2017, pp. 330–343
(46)Benedikt Sommer, Pierre Pinson, Jakob W Messner and David Obst“Online distributed learning in wind power forecasting”In International Journal of Forecasting 37.1Elsevier, 2021, pp. 205–223
(47)Marc A Suchard et al.“Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures”In Journal of Computational and Graphical Statistics 19.2Taylor & Francis, 2010, pp. 419–438
(48)R Talavera-Llames, Rubén Pérez-Chacón, A Troncoso and Francisco Martínez-Álvarez“Big data time series forecasting based on nearest neighbours distributed computing with Spark”In Knowledge-Based Systems 161Elsevier, 2018, pp. 12–25
(49)Andrew S Tanenbaum and Maarten Van Steen“Distributed systems: principles and paradigms”Prentice-Hall, 2007
(50)Ruey S Tsay“Analysis of financial time series”Hoboken, NJ, USA: John Wiley & Sons, 2010
(51)Stanislav Volgushev, Shih-Kang Chao and Guang Cheng“Distributed inference for quantile regression processes”In Annals of Statistics 47.3Institute of Mathematical Statistics, 2019, pp. 1634–1662
(52)Jialei Wang, Mladen Kolar, Nathan Srebro and Tong Zhang“Efficient distributed learning with sparsity”In International Conference on Machine Learning, 2017, pp. 3636–3645PMLR
(53)Xiangyu Wang and David B Dunson“Parallelizing MCMC via Weierstrass sampler”In arXiv preprint arXiv:1312.4605, 2013
(54)Xiaoqian Wang, Yanfei Kang, Fotios Petropoulos and Feng Li“The uncertainty estimation of feature-based forecast combinations”In Journal of the Operational Research SocietyTaylor & Francis, 2021DOI: 10.1080/01605682.2021.1880297
(55)K.V. Yuen“Bayesian methods for structural dynamics and civil engineering”Singapore: John Wiley & Sons, 2010
(56)Yuchen Zhang, John Duchi and Martin Wainwright“Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates”In Journal of Machine Learning Research 16.1JMLR. org, 2015, pp. 3299–3340
(57)Yuchen Zhang, John C Duchi and Martin J Wainwright“Communication-efficient algorithms for statistical optimization”In Journal of Machine Learning Research 14.1JMLR. org, 2013, pp. 3321–3363
(58)Xuening Zhu, Feng Li and Hansheng Wang“Least-Square Approximation for a Distributed System”In Journal of Computational and Graphical Statistics 30.4, 2021, pp. 1004–1018DOI: 10.1080/10618600.2021.1923517

Appendix A Background of Distributed Systems

A distributed system, usually used for distributed computing, is a system with a group ofinteracting computing nodes connected by a network tanenbaum2007distributed (49). Theseautonomous computers share resources, work together, and coordinate their activities tofulfill specified tasks, just like a single computer via a MapReduce framework. Whendealing with large-scale problems, distributed systems provide a new solution that sendsthe computing code to each computer node where data are also distributed stored. TheMapReduce is short for the “move-code-to-data” computing architecture that enables us toscale horizontally by adding more computing nodes, rather than scale vertically, byupgrading a single node’s hardware.

Inspired by the Google File System paper ghemawat2003google (17) that describesGoogle’s algorithm for distributed data-intensive applications,Hadoop ecosystem has been developed in the data science communityas an open-source platform that allows for thedistributed storage and processing of large-scale data sets. Such an ecosystem is thede-facto standard for large scale distributed computing in data analyticssectors. Nonetheless, the existing distributed systems equip with machine learning libraries meng2016mllib (39, 1) butlack forecasting support. Forecasters have to make unrealisticindependence assumptions for modeling large scale time series data to fit in theecosystem. Developing and integrating forecasting methods into such distributed systems hasgreat potential.

Apache Sparkspark (2) is the most popular distributed execution engine used for bigdata processing in the distributed ecosystem. With in-memory processing, Spark does notspend excess time moving data in and out of the disk, which achieves significantly faster(up to $100$ times) computation. Besides, Spark supports computer languages (e.g., Java, Scala, R, andPython) that are widely used in the machine learning and forecasting domains, making itdeveloper-friendly. Spark also offers a stack of libraries, such as MLlib for machinelearning, Spark Streaming for real-time processing, Spark SQL for interactive queries, andGraphX for graph processing, which provides easy-to-use analytics in many cases.

Appendix B Pseudocode for Distributed Forecasting

This section provides the pseudocode for Mapper and Reducer of the proposed distributedforecasting approach in Section3.

$\big{\langle}key,value\big{\rangle}$

$\big{\langle}key,(\widehat{\theta},\widehat{\Sigma})\big{\rangle}$

1:Start

2: $y_{t}\leftarrow\text{time series data}$

3: $T\leftarrow\text{time series length}$

4: $K\leftarrow\text{number of split subseries}$

5: $\text{index}\leftarrow\text{index vector of the assigned subseries}$

7: $n=\text{floor}(T/K)$

8:for $i$ in indexdo

9: $\text{lbound}=n\times(i-1)+1$

10: $\text{ubound}=\text{ifelse}(i\geq K,T,n\times i)$

11: $y=y_{t}[\text{lbound},\text{ubound}]$ $\triangleright$ Step $1$

12: $\text{fit}=\text{model}(y)$ $\triangleright$ Step $2$

13: $\text{fit}^{{}^{\prime}}=\text{model.to.linear}(\text{fit})$ $\triangleright$ Step $3$

14: $\widehat{\theta}=\text{fit}^{{}^{\prime}}.\text{coef}$

15: $\widehat{\Sigma}=\text{fit}^{{}^{\prime}}.\text{var.coef}$

16:endfor

17:Stop

$\big{\langle}\text{key},\text{list}(\widehat{\theta},\widehat{\Sigma})\big{\rangle}$

$\big{\langle}\text{key},\text{forecvalues}\big{\rangle}$

1:Start

2: $y_{t}\leftarrow\text{time series data}$

3: $H\leftarrow\text{forecast horizon}$

4: $level\leftarrow\text{confidence levels for prediction intervals}$

6: $\widetilde{\text{fit}}=\text{Comb.method}\big{(}\text{list}(\widehat{\theta},%\widehat{\Sigma})\big{)}$ $\triangleright$ Step $4$

7: $\text{forec}=\text{forecast}\big{(}\widetilde{\text{fit}},y_{t},H,\text{level}%\big{)}$ $\triangleright$ Step $5$

8: $\text{pred}=\text{forec.mean}$ $\triangleright$ point forecasts

9: $\text{lower}=\text{forec.lower}$ $\triangleright$ lower bound of prediction intervals

10: $\text{upper}=\text{forec.upper}$ $\triangleright$ upper bound of prediction intervals

11:Stop