go back

Dealing with Inaccurate and Incomplete Labels in Industrial Streaming Data

Andrea Castellani, "Dealing with Inaccurate and Incomplete Labels in Industrial Streaming Data", Uni Bielefeld, 2023.

Abstract

The pressure to increase the energetic efficiency of industrial facilities has led to a strong increase in the number of installed measurement sensors. These collect large volumes of data that need to be processed and analyzed. As manual data processing methods are not appropriate due to the sheer amount of data, automated and intelligent solutions are needed. Machine learning techniques are a viable option for processing large volumes of data and are capable to capture complex relationships within it. However, obtaining meaningfully annotated data is a real challenge and typically incurs large costs. Especially, in an industrial setting where the generic data type is streaming data which is constantly evolving. Thus, devising machine learning models which are able to perform a desired task in industrial environments with few labelled data samples and drifting data features poses a severe challenge. In this thesis, we will address two main technical challenges in the field of analyzing industrial streaming data: (1) how to efficiently train models with only partially labeled data, and (2) how to train models when a sizable fraction of the label information is not correct or is changing over time. We propose several strategies how to deal with these questions and evaluate their performance on stationary and non-stationary benchmark data sets, as well as real-world industrial application data. As one central aspect of the approaches, we propose to use constrained embedding representations for the raw input data. These representations are shown to be efficient for dealing with limited annotated data by analysis of the labeled and unlabeled data based on similarities in the embedding space. They allow for robust semi-supervised training of deep neural networks in the presence of label noise, and even gradually correct the mislabeled samples during training. Similarly, connecting these latent representations to a network performing predefined tasks is shown to be useful for accurate concept drift detection. Another core aspect of these approaches is their capability to handle sparsely labeled data in streaming environments. By propagating the available labels to unlabeled samples, based on their proximity in the embedding space and the time of arrival of the labels, we can successfully train accurate models with incomplete and delayed labels in resource-constrained settings. Our work shows that the proposed methods are very effective for analyzing streaming data with sparse and even incorrect or delayed labels, as well as concept drift. We apply our methods in real-world industrial data in different tasks, such as robust anomaly detection with few labeled samples, predictive modeling of industrial machinery with presence of label noise and delayed labels, and also semi-supervised concept drift detection.



Download Bibtex file Per Mail Request

Search