Single-pass analytics is a special class of data analytics that are applied on input data that are read just once, such as when being ingested into a big-data pipeline as a pre-processing step to a follow-on (multi-pass) batch job, or in the case of on-the-fly processing of continuously flowing streams of data.
Single-pass analytics are particularly useful when early, possibly approximate results are desirable prior to a lengthier multi-pass data processing. Such analytics are often implemented as continuous queries performing incremental processing over streams of data, updating their results as more data becomes available. This article provides an overview of the single-pass analytics microservice that is being developed as part of the EVOLVE platform.ing more advantage of the EVOLVE acceleration features.
The EVOLVE single-pass analytics microservice encapsulates the Apache Spark Structured Streaming system at its core, and can be adapted to other stream-processing engines as well. The choice of Apache Spark Structured Streaming as a reference platform for the microservice was made based on the scalability, robustness, tools ecosystem, and community adoption of the Spark platform. Single-pass analytics are expressed as continuous queries over streaming data. The current prototype of the microservice is fully integrated with the EVOLVE ecosystem, its advanced computing platform, and Kubernetes-driven management framework.
The single-pass analytics microservice can be interconnected with the EVOLVE data ingest microservice based on Apache Kafka, forming a data path through which streams of incoming data are typically fed into the EVOLVE platform.
The single-pass analytics microservice can be managed via the EVOLVE management service. Users have the option to instantiate the service through the user-friendly Karvdash UI. Through it, users can create workflows that use the single-pass analytics service and/or use the Apache Zeppelin notebook, allowing users to create their analytics tasks in an interactive manner within the underlying single-pass analytics service, as seen in Figure 2.
The single-pass analytics service takes advantage of the scalability potential of the EVOLVE platform to achieve high performance under increasing levels of load. Scalability can be considered as vertical, when increasing the amount of resources (number of CPU cores, memory etc.) per container, or horizontal, when increasing the number of containers. Our experiments demonstrate the scalability achievable by the microservice with increasing load drawing on progressively more processing resources from the EVOLVE advanced computing platform.
Figures 3 and 4 depict the performance of an analytics task processing a bounded stream of data expressed as throughput (processed rows per second) as the processing capacity of the microservice increases. These results demonstrate that the analytics task is able to process more data faster, allowing users to gain sooner valuable insights into the data. Figure 3 depicts the horizontal scaling of the application as the number of executors increases, where each executor has 2 CPU cores and 4GB of memory. Figure 4 depicts vertical scalability with increasing resources (number of CPU cores and memory size) per container hosting the entire service. In both scalability scenarios the application shows similar performance improvement trends as more resources are available to the service.
In conclusion, we have provided a brief overview of the EVOLVE single-pass analytics microservice and demonstrated the ability of the current prototype to provide scalable performance when operating within the EVOLVE advanced computing platform. Future work will focus on further exploring the scalability potential of the single-pass analytics microservice in larger-scale experiments over EVOLVE pilot datasets, taking advantage of the unique features of the EVOLVE advanced computing platform.