1. Introduction
Recurrent Neural Networks (RNNs) are a class of Deep Learning models that are fundamentally designed to handle sequential data. Unlike feedforward neural networks, RNNs possess the unique feature of maintaining a memory of previous inputs by using their internal state to process sequences of inputs. This makes them ideally suited for applications where the order of data points is crucial [1]. With the increase in the size of the dataset and the improvement of the complexity of the model, the demand for computational strength and storage space in the training process also increases proportionally. The present work is placed in the context of the design of parallel algorithms for RNNs in large-scale applications where parallelizing RNNs can be challenging due to their sequential nature. We focus on RNNs with an additive loss function consisting of a large number of component functions, let us say fi, such as
where each term corresponds to the error between some data and the output of a parametric model, with being the vector of parameters. An example is linear least squares problems, where fi has a quadratic structure, except for a regularization function. A more general class of additive cost problems is nonlinear least squares [2,3].
Minimization of the Loss Function (LF) causes the most training overhead. The structure of the additive cost function has facilitated the use of parallel computing approaches that are well-suited for the incremental approach [4]. Very often, besides the exploitation of Graphics Processing Units' acceleration [5], concurrency is introduced inside the operations of each minimization step, at the cost of an all-to-one data synchronization at the end of each step (this is the simplest way to achieve what is defined in the context of deep learning as data parallelism or the mini-batch gradient descent approach). In the alternative, the incremental computing of the values and subgradients of the components fi could be performed in a distributed manner. This is defined as model parallelism. Naive model parallelism is relatively simple. It’s straightforward to split a large model into chunks of consecutive layers. However, there’s a sequential dependency between inputs and outputs of layers, so a naive implementation can lead to a large amount of idle time while a worker waits for outputs from the previous machine to be used as its inputs. We can reuse the ideas from data parallelism by having each worker only process a subset of data elements at one time, allowing us to cleverly overlap new computations with wait time. Pipeline parallelism allows the execution of a model to be partitioned such that multiple micro-batches can execute different parts of the model code concurrently [6-11].
In the presence of evolutionary problems, in the last decades, Parallel-In-Time methods have been investigated for reducing the temporal dimensionality of such problems. Since Nievergelt, in 1964, proposed for the first time the decomposition algorithm for finding the parallel solutions of evolutionary ordinary differential equations [12], the methods of time-parallel time integration have been extensively expanded and several relevant works can be found in the literature. An extensive and updated literature list can be found on the website [13] collecting information about people, methods, and software in the field of parallel-in-time integration methods. By relying on results we have obtained on the Kalman Filter and on variational methods on the uncertainty quantification models [14-16], in this work we present the main idea underlying an ab-initio model decomposition in time of RNNs. This approach implies both the time series and the objective function decomposition among computing resources. Local functionals are suitably modified, by imposing a regularization constraint in order to enforce the matching of their solutions between adjacent sub-sequences. Finally, according to the Additive Schwarz Method (ASM), synchronization of local solutions is imposed iteratively. Such synchronization guarantees the model convergence [16].
2. Model decomposition-in-time of RNNs
In the following, we summarize the key components of the hybrid decomposition of RNNs. In particular, we introduce the operators defining data reduction and localization then, finally, we give the mathematics underlying decomposition in time of RNNs. For simplicity, we consider the simplest RNN, with a self-connected hidden state, through which information cycles across time steps [17]. The decomposition approach could be extended to other types of RNN.
2.1 Recurrent neural networks
RNNs are a class of deep learning models that are fundamentally designed to handle sequential data. At each time step, t = 1,...,q, the RNN takes an input vector,
and returns the output vector
, s.t.
where
is the hidden state vector that captures information about previous inputs, Wxh is the weight matrix between the input and hidden layer, Whh is the weight matrix for the recurrent connection (hidden-to-hidden), bt and bh are the bias vectors, and finally σy and σh are the activation functions. The weight matrices and the bias functions characterize the unknown model. These parameters are obtained by minimizing a predefined objective Function.
For convenience, at each time t = 1,..., q we concatenate vectors xt and ht−1 to form a m0 + m1 dimensional vector at. Then, if we pose
where
, we write all the weights together in the matrix
It results that equations (2) and (3) can be written as
Let
be the matrix obtained by each r.h.s. of equation in (4);
be the vector obtained by stacking the t r.h.s of the equation in (4) on top of one another; let
be the column vector obtained collecting the vector yt (output of the RNN) q times. Finally, let
be the matrix whose t-th row is obtained by the vectors at, t = 1...,q. The equation in (4) can be written as the overdetermined linear system [18,19].
Ag = z
(Least Square Minimization in RNNs) The least-square minimization problem refers to the computation of g such that [20]:
with objective function
2.2 Model decomposition in time
The Model Decomposition in Time direction implies the reduction of both the time series and the objective function among computing resources. This is obtained by data reduction operation and functional localization, respectively. Local functions are obtained as a suitable modification of the original objective function. In particular, local functions are composed of two parts: the first part is the restriction of the original objective function to the sub-interval, then a regularization constraint is added in order to enforce the matching of their solutions between adjacent sub-sequences.
(Data Reduction) Let w = [wt wt+1 ... wn]T ∈ Rs be a vector with t ≥ 1 , n > 0, s = n − t and Ir = {1,...,r}, r > n and n > t. The extension of w to Ir is:
where for i = 1,...,r
Let B = [B1 B2 ... Bn] ∈ Rm×n be such that Bj is the j −th column of B. For i < j (i = 1,...,n − 1 and j = 2,...,n) we define the set Ii,j = {i,...,j}. The restriction of B to the set Ij is:
and to Ii,j is:
(Objective Function Reduction) Let n1 ,n2 > 0 two natural numbers less than q. Let
denote the restriction of J defined in (6).
For simplicity of notations, we let Ji,jm ≡ J|(Ii,Ij) with i, j = 1,2. We consider the decomposition of RNNs in two subsets only.
(In Time DD-RNN setup) Let I = {1,...,q} be the index set of rows of A. DD-RNN setup consists in decomposing I into 2 sets:
where
,
, and
is the overlap set. If s = 0, then I1 ∩ I2 = ∅ and I1,2 ≠ ∅. Restrictions of A to I1 and I2 are:
(In Time DD-RNN Algorithm) According to the ASM in [21], we introduce the loop for k = 0,1,2,..., the model decomposition-in-time leads to two RNNs solving iteratively the following reduced least square minimization problems:
O1,2 is the overlapping operator representing the data exchange on the overlap set I1,2, and µ > 0 is the regularization parameter.
In particular, we pose
with
,
be the extension to Ii, of restriction to I1,2 in (13) of
and
, respectively.
As a solution, we get the sequence
such that:
with the sets I1, I2 defined in (12) and I1,2 in (13).
Algorithm implementing local RNN on In1 and In2,
1: repeat
2: k:= k + 1
3: Call Loc-RNN
4: Exchange xki between adjacent subdomains and compute O1,2
5: until
3. Discussion
The aim of this work is only to illustrate the framework underlying such innovative ideas. Many aspects have been left intentionally unexplained. For example, regarding the size of the overlapping interval, we note that the majority of parallelization techniques first decompose the domain in nonoverlapping subdomains and then extend each subdomain with halo regions to enable stencil operations in a parallel context. Instead, the algorithm allows one to reduce communications among adjacent subdomains to only those nodes lying on the interfaces. In particular, if the widths of the halo regions are equal to one grid node wide, the halo nodes are actually the nodes of the overlapping region. Otherwise, halo regions contain more nodes than overlapping nodes and this means more communications among adjacent subdomains. Hence, the algorithm reduces communication times which depends on the number of inner nodes in overlap regions. In addition, the extra work performed on the overlapped region with an increasing size can be seen as the effect of a preconditioner on the overlapping region which can overestimate the solution [21].
Regarding the regularization parameter, we observe that the regularization parameter plays an important trade-off role. As is evident, when the regularization parameter overfitting occurs larger values are expected to produce reliable solutions even if the DD-step number increases. The choice of a good regularization parameter is one of the most important issues in solving unconstrained minimization problems and there exists a significant amount of research in the literature on the development of appropriate strategies for selecting regularization parameters. Parameter choice methods can be classified according to the input they require. There are two basic types: a-priori methods, requiring information about the noise level on data. The discrepancy principle, developed and analyzed by Morozov is the oldest one of them; data-driven methods, require no extra information. Generalized Cross-Validation (GCV), due to Wahba, is one of the most popular methods [22].
Another issue is the employment of a dynamic load balancing scheme based on adaptive and dynamic redefining of initial decomposition. Specifically, in order to optimally choose the domain decomposition configuration, the partitioning into subdomains must satisfy certain conditions. First, the computational load assigned to subdomains must be equally distributed. Good quality partitioning also requires the volume of communication during calculation to be kept at its minimum. In [15] the authors employed a dynamic load balancing scheme based on adaptive and dynamic redefining of initial decomposition, aimed to balance load between processors according to data location. In particular, the authors focused on the introduction of a dynamic redefining of initial DD in order to deal with problems where the observations are non uniformly distributed and generally sparse.
The most significant finding is that we can solve several smaller problems improving the accuracy-per-parameter metric. Most importantly, subproblems can be solved in parallel, leading to a scalable algorithm where the workers locally exchange parameter updates via a nearest-neighbourhood communication scheme, which does not require a fully connected network. In contrast to other decomposition-in-time approaches, in our approach local solvers run concurrently from the beginning. Overall, the employment of the framework on hybrid high-performance computing systems seems to be a fruitful research area [23].