A survey on data collection for machine learning: a big data-ai integration perspective
Data collection is a major bottleneck in machine learning and an active research topic in
multiple communities. There are largely two reasons data collection has recently become a …
multiple communities. There are largely two reasons data collection has recently become a …
Table-gpt: Table-tuned gpt for diverse table tasks
P Li, Y He, D Yashar, W Cui, S Ge, H Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Language models, such as GPT-3.5 and ChatGPT, demonstrate remarkable abilities to
follow diverse human instructions and perform a wide range of tasks. However, when …
follow diverse human instructions and perform a wide range of tasks. However, when …
Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks
C Yan, Y He - Proceedings of the 2020 ACM SIGMOD International …, 2020 - dl.acm.org
Data preparation is widely recognized as the most time-consuming process in modern
business intelligence (BI) and machine learning (ML) projects. Automating complex data …
business intelligence (BI) and machine learning (ML) projects. Automating complex data …
Blinkfill: Semi-supervised programming by example for syntactic string transformations
R Singh - Proceedings of the VLDB Endowment, 2016 - dl.acm.org
The recent Programming By Example (PBE) techniques such as FlashFill have shown great
promise for enabling end-users to perform data transformation tasks using input-output …
promise for enabling end-users to perform data transformation tasks using input-output …
Ten years of webtables
In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of
structured databases casually published online in the form of HTML tables. The past decade …
structured databases casually published online in the form of HTML tables. The past decade …
Uni-detect: A unified approach to automated error detection in tables
P Wang, Y He - Proceedings of the 2019 International Conference on …, 2019 - dl.acm.org
Data errors are ubiquitous in tables. Extensive research in this area has resulted in a rich
variety of techniques, each often targeting a specific type of errors, eg, numeric outliers …
variety of techniques, each often targeting a specific type of errors, eg, numeric outliers …
Pytheas pattern-based table discovery in CSV files
C Christodoulakis, EB Munson, M Gabel… - Proceedings of the …, 2020 - dl.acm.org
CSV is a popular Open Data format widely used in a variety of domains for its simplicity and
effectiveness in storing and disseminating data. Unfortunately, data published in this format …
effectiveness in storing and disseminating data. Unfortunately, data published in this format …
Auto-join: Joining tables by leveraging transformations
E Zhu, Y He, S Chaudhuri - Proceedings of the VLDB Endowment, 2017 - dl.acm.org
Traditional equi-join relies solely on string equality comparisons to perform joins. However,
in scenarios such as ad-hoc data analysis in spreadsheets, users increasingly need to join …
in scenarios such as ad-hoc data analysis in spreadsheets, users increasingly need to join …
Auto-detect: Data-driven error detection in tables
Z Huang, Y He - Proceedings of the 2018 International Conference on …, 2018 - dl.acm.org
Given a single column of values, existing approaches typically employ regex-like rules to
detect errors by finding anomalous values inconsistent with others. Such techniques make …
detect errors by finding anomalous values inconsistent with others. Such techniques make …
TEXUS: A unified framework for extracting and understanding tables in PDF documents
Tables in documents are a widely-available and rich source of information, but not yet well-
utilised computationally because of the difficulty in automatically extracting their structure …
utilised computationally because of the difficulty in automatically extracting their structure …