Tracking Time Evolving Data Streams for Short-Term TrafficForecasting

Abstract:

Data streams have arisen as a relevant topic during the last few years as an efficient method for extracting knowledge from big data. In the robust layered ensemble model (RLEM) proposed in this paper for short-term traffic flow forecasting, incoming traffic flow data of all connected road links are organized in chunks corresponding to an optimal time lag. The RLEM model is composed of two layers. In the first layer, we cluster the chunks by using the Graded Possibilistic c-Means method.
The second layer is made up by an ensemble of forecasters, each of them trained for short-term traffic flow forecasting on the chunks belonging to a specific cluster. In the operational phase, as a new chunk of traffic flow data presented as input to the RLEM, its memberships to all clusters are evaluated, and if it is not recognized as an outlier, the outputs of all forecasters are combined in an ensemble, obtaining in this way a forecasting of traffic flow for a short-term time horizon. The proposed RLEM model is evaluated on a synthetic data set, on a traffic flow data simulator, and on two real-world traffic flow data sets. The model gives an accurate forecasting of the traffic flow rates with outlier detection and shows a good adaptation to non-stationary traffic regimes. Given its characteristics of outlier detection, accuracy, and robustness, RLEM can be fruitfully integrated in traffic flow management systems.

Introduction:

Data streams are ordered, potentially unbounded sequences of observations (data elements) made available over time. Data stream mining, the process of extracting knowledge from them, has arisen as a relevant topic in the machine learning field during the past decade. In many data stream mining applications where data exhibit a time series nature, the goal is to predict information about future instances in the data stream given some knowledge about previous ones. This can be approached either by modeling the dynamics of the system or by autoregressive models. Within the field of road traffic analysis and forecasting, the latter approach has rapidly become widespread in recent years due to the increase in both the availability of sensed data and processing power to deal with them. A common requirement in the task of mining data streams is the ability to distinguish the useful information from the useless ones. This may be required for limiting the usage of resources, for instance, transmission bandwidth or storage memory; for summarization purposes; or even for relieving the user from information overload. As an example of this latter case, a sensor network may provide just the information that requires attention by the human supervisor rather than transmitting all records. This task goes by the name of anomaly or outlier detection. One common approach to anomaly detection makes use of unsupervised learning: we learn a baseline model of the phenomenon of interest and then measure the discrepancy of subsequent data from the baseline. An anomalous observation is the one that is not well explained by the model. When operating within non-stationary environments for an extended time, the source of the stream may change over time. We distinguish between two types of change: for evolutionary, smooth changes, we use the term concept drift, while a radical, sudden change is labelled concept shift. In this paper, we approach the problem of short-term traffic forecasting by employing the autoregressive approach, which is more suitable than a model-based one in the short term because it can exploit the local time information contained in recent observations and is computationally less demanding. To tackle the issues of anomalies and non-stationarity, we employ an extension of the possibilistic clustering approach named Graded Possibilistic c-Means as a means to perform clustering of the non-stationary streaming data and employ the knowledge accumulated into the clusters to build a robust, accurate short-term traffic forecaster. Our proposed method has the ability to prevent outliers in the data stream from having a strong effect on the forecasting accuracy and is capable of both learning the data stream and analysing its evolution for the purpose of tracking it. To this end, an index to measure data stream change is proposed, based solely on the memberships to clusters and not on additional measures. We focus on the online approach to track and adapt to concept drift and shift and on using this knowledge to improve the ensemble forecasting model that was proposed by making the model able to not only detect outliers but also track the changes in data streams. An incremental retraining strategy is adopted, where the amount of retraining, and therefore the required computational effort, is modulated by the proposed measure of model change. This paper is organized as follows: The next section summarizes the state of the art in streaming data clustering and traffic modelling, motivating the specific design choices of our proposal.

Keywords:

Traffic forecasting, Fuzzy clustering, Big data, Ensemble model, Evolving data streams.