Bigqa: Declarative big data quality assessment

H Fadlallah, R Kilany, H Dhayne, R El Haddad… - ACM Journal of Data …, 2023 - dl.acm.org
ACM Journal of Data and Information Quality, 2023dl.acm.org
In the big data domain, data quality assessment operations are often complex and must be
implementable in a distributed and timely manner. This article tries to generalize the quality
assessment operations by providing a new ISO-based declarative data quality assessment
framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in
different domains and contexts. It facilitates the planning and execution of big data quality
assessment operations for data domain experts and data management specialists at any …
In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果