Leveraging channels and augmentation to improve sensor datasets
Maintaining the same fidelity with 90% less sensor data
For a lot of real-world use-cases, we do not have a lot of sensor data. For example, we may only have 500 time-windows of a vibration / acoustic signal for leak detection. In these windows, we may have severe unbalance with 300 classes of normal data, 150 classes of fault 1, and just 50 classes of fault 2. Conventional approaches of analyzing such data involves domain experts calculating transforms such as FFT, spectrograms, power spectral density, etc, looking for certain peaks and AUC corresponding to specific faults. However, this manual expertise requires a lot of time and does not scale for deployment on a lot of use-cases. How to deal with small unbalanced datasets? How much data is sufficient for a given task like anomaly detection or classification?
While using signal processing / information theoretic approaches, we find a phase transition curve like the one shown below. It is a generalization of the Shannon-Nyquist sampling theorem, which says that the more information a signal has, the higher the sampling rate needs to be for that signal.
Collecting more data however, is not always feasible or possible. Some of the common problems include:
1. Difficulty in experimentation in remote / expensive / people intensive environments.
2. The amount of data in normal data class is much more than anomaly classes. This is because anomalies by definition occur much fewer times than normalcy.
3. Dedicated resources required for experimentation which may not always be economically feasible.
4. Data sharing restrictions.
Moreso, when using a data-driven approach to this problem, we are seeing that collecting more than a certain fraction of raw data from a channel does not improve performance accuracies. In fact, there are some other variables which improve test accuracies more than the amount of raw data collected. In the following examples, we will show how we used just 10% of the raw data using Lightscline’s SDK and obtained performance improvements by using more channels and data augmentation.
1. Number of channels
This refers to the number of parallel streams of data that are being collected. For example, in a leak detection scenario, we might collect tri-axial vibration data and acoustics data, leading to a total of 4 channels. This can reach 50+ channels of wearables data or 300+ channels of in-flight data from aircrafts.
As can be seen below in a wearables-based human activity recognition use-case, using 48 channels of data instead of 5 channels of data gave a 19.6% increase in test accuracy from 75.9% to 95.4%.
2. Data augmentation
This refers to the process of increasing the amount of data by using techniques such windowing. If done correctly, we augment the existing dataset by using this technique to create several fold larger datasets.
As can be seen below in the wearables-based human activity recognition use-case, using a 5x data multiplier leads to a 4% increase in test accuracy from 91.4% to 95.4%.
From the above examples, we can see that we are able to get 95%+ test accuracies while using 90% less sensor data than conventional techniques. Moreover, we also saw that using more than a certain fraction of raw data is not helpful in increasing performance. Rather, using more channels of data and data augmentation are two important techniques that can be used to improve model performance. Most of the times, we are actually collecting multiple channels of data anyways due to less marginal cost of adding a sensor modality or metadata. This additional information is very valuable and can help improve model performance. At Lightscline, our SDK includes all these capabilities and much more. You can learn more about us here.