TY - GEN
T1 - The exception that improves the rule
AU - Freire, Juliana
AU - Glavic, Boris
AU - Kennedy, Oliver
AU - Mueller, Heiko
N1 - Funding Information:
This work was supported in part by gifts from Oracle and NSF Grant CNS-1229185. Juliana Freire is partially supported by Defense Advanced Research Projects Agency (DARPA) MEMEX program award FA8750-14-2-023. Opinions, findings and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of Oracle, the NSF, or DARPA.
Publisher Copyright:
© 2016 ACM.
PY - 2016/6/26
Y1 - 2016/6/26
N2 - The database community has developed numerous tools and techniques for data curation and exploration, from declarative languages, to specialized techniques for data repair, and more. Yet, there is currently no consensus on how to best expose these powerful tools to an analyst in a simple, intuitive, and above all, flexible way. Thus, analysts continue to rely on tools such as spreadsheets, imperative languages, and notebook style programming environments like Jupyter for data curation. In this work, we explore the integration of spreadsheets, notebooks, and relational databases. We focus on a key advantage that both spreadsheets and imperative notebook environments have over classical relational databases: ease of exception. By relying on set-at-a-time operations, relational databases sacrifice the ability to easily define singleton operations, exceptions to a normal data processing workflow that affect query processing for a fixed set of explicitly targeted records. In comparison, a spreadsheet user can easily change the formula for just one cell, while a notebook user can add an imperative operation to her notebook that alters an output "view". We believe that enabling such idiosyncratic manual transformations in a classical relational database is critical for curation, as curation operations that are easy to declare for individual values can often be extremely challenging to generalize. We explore the challenges of enabling singletons in relational databases, propose a hybrid spreadsheet/relational notebook environment for data curation, and present our vision of Vizier, a system that exposes data curation through such an interface.
AB - The database community has developed numerous tools and techniques for data curation and exploration, from declarative languages, to specialized techniques for data repair, and more. Yet, there is currently no consensus on how to best expose these powerful tools to an analyst in a simple, intuitive, and above all, flexible way. Thus, analysts continue to rely on tools such as spreadsheets, imperative languages, and notebook style programming environments like Jupyter for data curation. In this work, we explore the integration of spreadsheets, notebooks, and relational databases. We focus on a key advantage that both spreadsheets and imperative notebook environments have over classical relational databases: ease of exception. By relying on set-at-a-time operations, relational databases sacrifice the ability to easily define singleton operations, exceptions to a normal data processing workflow that affect query processing for a fixed set of explicitly targeted records. In comparison, a spreadsheet user can easily change the formula for just one cell, while a notebook user can add an imperative operation to her notebook that alters an output "view". We believe that enabling such idiosyncratic manual transformations in a classical relational database is critical for curation, as curation operations that are easy to declare for individual values can often be extremely challenging to generalize. We explore the challenges of enabling singletons in relational databases, propose a hybrid spreadsheet/relational notebook environment for data curation, and present our vision of Vizier, a system that exposes data curation through such an interface.
UR - http://www.scopus.com/inward/record.url?scp=84979792064&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84979792064&partnerID=8YFLogxK
U2 - 10.1145/2939502.2939509
DO - 10.1145/2939502.2939509
M3 - Conference contribution
AN - SCOPUS:84979792064
T3 - HILDA 2016 - Proceedings of the Workshop on Human-In-the-Loop Data Analytics
BT - HILDA 2016 - Proceedings of the Workshop on Human-In-the-Loop Data Analytics
PB - Association for Computing Machinery, Inc
T2 - 1st Workshop on Human-in-the-Loop Data Analytics, HILDA 2016
Y2 - 26 June 2016
ER -