TY - GEN
T1 - Bridging workflow and data provenance using strong links
AU - Koop, David
AU - Santos, Emanuele
AU - Bauer, Bela
AU - Troyer, Matthias
AU - Freire, Juliana
AU - Silva, Cláudio T.
PY - 2010
Y1 - 2010
N2 - As scientists continue to migrate their work to computational methods, it is important to track not only the steps involved in the computation but also the data consumed and produced. While this provenance information can be captured, in existing approaches, it often contains only weak references between data and provenance. When data files or provenance are moved or modified, it can be difficult to find the data associated with the provenance or to find the provenance associated with the data. We propose a persistent storage mechanism that manages input, intermediate, and output data files, strengthening the links between provenance and data. This mechanism provides better support for reproducibility because it ensures the data referenced in provenance information can be readily located. Another important benefit of such management is that it allows caching of intermediate data which can then be shared with other users. We present an implemented infrastructure for managing data in a provenance-aware manner and demonstrate its application in scientific projects.
AB - As scientists continue to migrate their work to computational methods, it is important to track not only the steps involved in the computation but also the data consumed and produced. While this provenance information can be captured, in existing approaches, it often contains only weak references between data and provenance. When data files or provenance are moved or modified, it can be difficult to find the data associated with the provenance or to find the provenance associated with the data. We propose a persistent storage mechanism that manages input, intermediate, and output data files, strengthening the links between provenance and data. This mechanism provides better support for reproducibility because it ensures the data referenced in provenance information can be readily located. Another important benefit of such management is that it allows caching of intermediate data which can then be shared with other users. We present an implemented infrastructure for managing data in a provenance-aware manner and demonstrate its application in scientific projects.
UR - http://www.scopus.com/inward/record.url?scp=77955045251&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77955045251&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-13818-8_28
DO - 10.1007/978-3-642-13818-8_28
M3 - Conference contribution
AN - SCOPUS:77955045251
SN - 3642138179
SN - 9783642138171
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 397
EP - 415
BT - Scientific and Statistical Database Management - 22nd International Conference, SSDBM 2010, Proceedings
T2 - 22nd International Conference on Scientific and Statistical Database Management, SSDBM 2010
Y2 - 30 June 2010 through 2 July 2010
ER -