1. 东北大学软件学院,辽宁 沈阳 110169
2. 广东省普及型高性能计算机重点实验室,广东 深圳 518060
3. 东北大学计算机科学与工程学院,辽宁 沈阳 110169
4. 清华大学计算机科学与技术系,北京 100084
[ "于明鹤(1989- ),女,博士,东北大学软件学院讲师,主要研究方向为大数据、信息检索等" ]
[ "聂铁铮(1980- ),男,博士,东北大学计算机科学与工程学院副教授,主要研究方向数据集成、大数据处理、区块链" ]
[ "李国良(1980- ),男,博士,清华大学计算机科学与技术系教授,主要研究方向为数据清洗、数据整合、众包数据管理等" ]
网络首发:2019-11,
纸质出版:2019-11-15
移动端阅览
于明鹤, 聂铁铮, 李国良. 数据管护技术及应用[J]. 大数据, 2019,5(6):2019048-1.
Minghe YU, Tiezheng NIE, Guoliang LI. Data curation technologies and applications[J]. Big Data Research, 2019, 5(6): 2019048-1.
于明鹤, 聂铁铮, 李国良. 数据管护技术及应用[J]. 大数据, 2019,5(6):2019048-1. DOI: 10.11959/j.issn.2096-0271.2019048.
Minghe YU, Tiezheng NIE, Guoliang LI. Data curation technologies and applications[J]. Big Data Research, 2019, 5(6): 2019048-1. DOI: 10.11959/j.issn.2096-0271.2019048.
为了对海量数据进行充分和有效的处理、存储以及应用,数据管护技术应运而生。数据管护技术是在数据整个生命周期内,对数据进行的主动并持续的管护,使数据得到最大化的利用,并且大程度地延长数据的使用寿命。围绕数据管护技术的目的、解决方案和应用,系统介绍了数据管护的处理过程和其中的关键技术,并介绍了几种基于数据管护的应用,并对其技术特点进行了对比分析。最后,对数据管护技术的发展前景和未来挑战进行了阐述。
Data curation is emerged in order to process
store and applied efficiency.Data curation processes active and continuous management the data through the whole lifecycle of it.And utilizing data curation techniques
data could be used to the maximum extent
and the speed of its elimination can be effectively slowed down.The process and key techniques of data curation aroundits goals
solutions and applications were described.For the crucial techniques
existing solutions were analyzed and introduced.In addition
the applications of data curation in the various domains were also introduced and compared.Finally
the development prospect and future challenges were expounded.
王芳 , 慎金花 . 国外数据管护(data curation)研究与实践进展 [J ] . 中国图书馆学报 , 2014 , 40 ( 4 ): 116 - 128 .
WANG F , SHEN J H . Advances in data curation abroad:research and practice [J ] . Journal of Library Science in China , 2014 , 40 ( 4 ): 116 - 128 .
BISHOP B W , HANK C . Data curation profiling of biocollections [C ] // Annual Meeting of the Association for Information Science and Technology,October 14-18,2016,Copenhagen,Denmark . Hoboken:Wiley , 2016 : 1 - 9 .
BOEHMKE B C . Data wrangling with R [M ] . Switzerland : Springer NaturePress , 2016 : 1 - 238 .
BEHESHTI S , TABEBORDBAR A , BENATALLAH B , et al . On automating basic data curation tasks [C ] // The 26th International Conference on World Wide Web Companion,April 3-7,2017,Perth,Australia . New York:ACM Press , 2017 : 165 - 169 .
SINGH N , SINGH A K . Data privacy protection mechanisms in cloud [J ] . Data Science and Engineering , 2018 , 3 ( 1 ): 24 - 39 .
BUNEMAN P , CHENEY J , TAN W C , et al . Curated databases [C ] // Symposium on Principles of Database Systems,June 9-11,2008,Vancouver,Canada . New York:ACM Press , 2008 : 1 - 12 .
PBOHANNON , M FLASTER , W FAN , et al . A cost-based model and effective heuristic for repairing constraints by value modification [C ] // International Conference on Management of Data,June 14-16,2005,Baltimore,USA . New York:ACM Press , 2005 : 143 - 154 .
CHU X , ILYAS I F , PAPOTTI P . Holistic data cleaning:putting violations into context [C ] // International Conference on Data Engineering,April 8-12,2013,Brisbane,Australia . Piscataway:IEEE Press , 2013 : 458 - 469 .
CHU X , ILYAS I F , KRISHNAN S A , et al . Data cleaning:overview and emerging challenges [C ] // International Conference on Management of Data,June 26 - July 1,2016,San Francisco,USA . New York:ACM Press , 2016 : 2201 - 2206 .
GOLAB L , KARLOFF H J , KORN F , et al . On generating near-optimal tableaux for conditional functional dependencies [J ] . Proceedings of the VLDB Endowment , 2008 , 1 ( 1 ): 376 - 390 .
GBESKALES B , ILYAS I F , GOLAB L , et al . On the relative trust between inconsistent data and inaccurate constraints [C ] // International Conference on Data Engineering,April 8-12,2013,Brisbane,Australia . Piscataway:IEEE Press , 2013 : 541 - 552 .
YAKOUT M , ELMAGARMID A K , NEVILLE J , et al . Guided data repair [J ] . Proceedings of the VLDB Endowment , 2011 , 4 ( 5 ): 279 - 289 .
WANG J , KRASKA T , FRANKLIN M J , et al . CrowdER:crowdsourcing entity resolution [J ] . Proceedings of the VLDB Endowment , 2012 , 5 ( 11 ): 1483 - 1494 .
HAO S , TANG N , LI G , et al . Cleaning relations using knowledge bases [C ] // International Conference on Data Engineering,April 19-22,2017,San Diego,USA . Piscataway:IEEE Press , 2017 : 933 - 944 .
MARCUS A , PARAMESWARAN A . Crowdsourced data management:industry and academic perspectives [J ] . Foundations and Trends in Databases , 2013 , 6 ( 1-2 ): 1 - 161 .
GOKHALE C , DAS S , DOAN A , et al . Corleone:hands-off crowdsourcing for entity matching [C ] // International Conference on Management of Data,June 22-27,2014,Snowbird,USA . New York:ACM Press , 2014 : 601 - 612 .
HAAS D , WANG J , WU E , et al . CLAMShell:speeding up crowds for lowlatency data labeling [J ] . Proceedings of the VLDB Endowment , 2015 , 9 ( 4 ): 372 - 383 .
MOZAFARI B , SARKAR P , FRANKLIN M J , et al . Scaling up crowd-sourcing to very large datasets:a case for active learning [J ] . Proceeding of the VLDB Endowment , 2014 , 8 ( 2 ): 125 - 136 .
ANANTHAKRISHNA R , CHAUDHURI S , GANTI V . Eliminating fuzzy duplicates in data warehouses [C ] // International Conference on Very Large Data Bases,August 20-23,2002,Hong Kong,China . San Francisco:Morgan Kaufmann , 2002 : 586 - 597 .
WANG J , KRISHNAN S , FRANKLIN M J , et al . A sample-and-clean framework for fast and accurate query processing on dirty data [C ] // International Conference on Management of Data,June 22-27,2014,Snowbird,USA . New York:ACM Press , 2014 : 469 - 48
KOLB L , THOR A , RAHM E . Dedoop:efficient deduplication with Hadoop [J ] . Proceeding of the VLDB Endowment , 2012 , 5 ( 12 ): 1878 - 1881 .
KHAYYAT Z , ILYAS I F , JINDAL A , et al . BigDansing:a system for big data cleansing [C ] // International Conference on Management of Data,May 31-June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 1215 - 1230 .
CHU X , ILYAS I F , KOUTRIS P . Distributed data deduplication [R ] . Waterloo:University of Waterloo , 2016 .
HUI J , LI L , ZHANG Z . Integration of big data:a survey [C ] // International Conference of Pioneering Computer Scientists,Engineers and Educators,September 21-23,2018,Zhengzhou,China . Heidelberg:Springer , 2018 : 101 - 121 .
LI F , LEE M , HSU W , et al . Linking temporal records for profiling entities [C ] // International Conference on Management of Data,May 31-June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 593 - 605 .
Z ABEDJAN A , AKCORA C G , OUZZANI M , et al . Temporal rules discovery for web data cleaning [J ] . Proceedings of the VLDB Endowment , 2015 , 9 ( 4 ): 336 - 347 .
PETERMANN A , JUNGHANNS M , MÜLLER R , et al . Graph-based data integration and business intelligence with BIIIG [J ] . Proceedings of the VLDB Endowment , 2014 , 4 ( 13 ): 1577 - 1580 .
LI Q , LI Y , GAO J , et al . Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation [C ] // International Conference on Management of Data,June 22-27,2014,Snowbird,USA . New York:ACM Press , 2014 : 1187 - 1198 .
LI Q , LI Y , GAO J , et al . A confidenceaware approach for truth discovery on long-tail data [J ] . Proceedings of the VLDB Endowment , 2014 , 8 ( 4 ): 425 - 436 .
REKATSINAS T , JOGLEKAR M , GARCIA-MOLINA H , et al . SLiMFast:guaranteed results for data fusion and source reliability [C ] // International Conference on Management of Data,May 14-19,2017,Chicago,USA . New York:ACM Press , 2017 : 1399 - 1414 .
YU R , GADIRAJU U , FETAHU B , et al . FuseM:query-centric data fusion on structured Web markup [C ] // International Conference on Data Engineering,April 19-22,2017,San Diego,USA . Piscataway:IEEE Press , 2017 : 179 - 182 .
SALLOUM M , DONG X L , SRIVASTAVA D , et al . Online ordering of overlapping data sources [J ] . Proceedings of the VLDB Endowment , 2013 , 7 ( 3 ): 133 - 144 .
REKATSINAS T , DONG X L , SRIVASTAVA D . Characterizing and selecting fresh data sources [C ] // International Conference on Management of Data,June 22-27,2014,Snowbird,USA . New York:ACM Press , 2014 : 919 - 930 .
BONAQUE R , CAO T D , CAUTIS B , et al . Mixed-instance querying:a lightweight integration architecture for data journalism [J ] . Proceedings of the VLDB Endowment , 2016 , 9 ( 13 ): 1513 - 1516 .
CHAMANARA J , KÖNIG-RIES B , JAGADISH H V . QUIS:InSitu heterogeneous data source querying [J ] . Proceedings of the VLDB Endowment , 2017 , 10 ( 12 ): 1877 - 1880 .
SAWADOGO P , KIBATA T , DARMONT J . Metadata management for textual documents in data lakes [C ] // International Conference on Enterprise Information Systems,May 3-5,2019,Heraklion,Greece.[S.l]:SciTePress . 2019 : 72 - 83 .
STEIN B , MORRISON A . The enterprise data lake:better integration and deeper analytics [J ] . Technology Forecast , 2014 ( 1 ): 1 - 9 .
QUIX C , HAI R , VATOV I . Metadata extraction and management in data lakes with GEMMS [J ] . Complex Systems Informatics and Modeling Quarterly , 2016 ( 9 ): 67 - 83 .
HAI R , GEISLER S , QUIX C . Constance:an intelligent data lake system [C ] // International Conference on Management of Data,June 26-July 1,2016,San Francisco,USA . New York:ACM Press , 2016 : 2097 - 2100 .
INMON B . Data lake architecture:designing the data lake and avoiding the garbage dump [M ] . [S.l.] : Technics PublicationsPress , 2016 .
FANG H , . Managing data lakes in big data era:what’s a data lake and why has it became popular in data management ecosystem [C ] // International Conference on Cyber Technology in Automation,Control and Intelligent Systems,June 8-12,2015,Shenyang,China . Piscataway:IEEE Press , 2015 : 820 - 824 .
MILOSLAVSKAYA N G , TOLSTOY A I . Application of big data,fast data,and data lake concepts to information security issues [C ] // International Conference on Future Internet of Things and Cloud Workshops,August 22-24,2016,Vienna,Austria . Piscataway:IEEE Press , 2016 : 148 - 153 .
MACCIONI A , TORLONE R . Crossing the finish line faster when paddling the data lake with kayak [J ] . Proceedings of the VLDB Endowment , 2017 , 10 ( 12 ): 1853 - 1856 .
HERSCHEL M , DIESTELKÄMPER R , LAHMAR H B . A survey on provenance:what for,what form,what from [J ] . The VLDB Journal , 2017 , 26 ( 6 ): 881 - 906 .
CHENEY J , CHITICARIU L , TAN W C . Provenance in databases:why,how,and where [J ] . Foundations and Trends in Databases , 2009 , 1 ( 4 ): 379 - 474 .
BUNEMAN P , TAN W C . Data provenance:what next [J ] . SIGMOD Record , 2018 , 47 ( 3 ): 5 - 16 .
BHAGWAT D , CHITICARIU L , TAN W C , et al . An annotation management system for relational databases [J ] . The VLDB Journal , 2005 , 14 ( 4 ): 373 - 396 .
CHITICARIU L , W CH TAN , VIJAYVARGIYA G . DBNotes:a post-it system for relational databases based on provenance [C ] // International Conference on Management of Data,June 14-16,2005,Maryland,USA . New York:ACM Press , 2005 : 942 - 944 .
GEERTS F , KEMENTSIETSIDIS A , MILANO D . MONDRIAN:annotating and querying databases through colors and blocks [C ] // International Conference on Data Engineering,April 3-8,2006,Atlanta,USA . Piscataway:IEEE Press , 2006 .
BUNEMAN P , CHENEY J , VANSUMMEREN S . On the expressiveness of implicit provenance in query and update languages [J ] . ACM Transactions on Database Systems , 2008 , 33 ( 4 ): 1 - 47 .
BUNEMAN P , KHANNA S , TAJIMA K , et al . Archiving scientific data [J ] . ACM Transactions on Database Systems , 2004 , 29 : 2 - 42 .
HUANG S , XU L , LIU J , et al . Orpheusdb:bolt-on versioning for relational databases [J ] . Proceeding of the VLDB Endowment , 2017 , 10 ( 10 ): 1130 - 1141 .
MADDOX M , GOEHRING D , ELMORE A J , et al . Decibel:the relational dataset branching system [J ] . Proceeding of the VLDB Endowment , 2016 , 9 ( 9 ): 624 - 635 .
LAPPAS T , TERZI E , GUNOPULOSD . Finding Effectors in Social Networks [C ] // International Conference on Knowledge Discovery and Data Mining,July 25-28,2010,Washington,DC,USA . New York:ACM Press , 2010 : 1059 - 1068 .
SHAH D , ZAMAN T . Rumors in a network:Who’s the culprit [J ] . Information Forensics and Security , 2011 , 57 ( 8 ): 5163 - 5181 .
BUNEMAN P , CHENEY J , LINDLEY S , et al . DBWiki:a structured wiki for curated data and collaborative data management [C ] // International Conference on Management of Data,June 12-16,2011,Athens,Greece . New York:ACM Press , 2011 : 1335 - 1338 .
B RACHMANN M , BAUTISTA C , CASTELO S , et al . Data debugging and exploration with vizier [C ] // International Conference on Management of Data,June 30-July 5,2019,Amsterdam,The Netherlands . New York:ACM Press , 2019 : 1877 - 1880 .
CALLAHAN S P , FREIRE J , SANTOS E , et al . VisTrails:visualization meets data management [C ] // International Conference on Management of Data,June 27-29,2006,Chicago,USA . New York:ACM Press , 2006 : 745 - 747 .
YANG Y , MENEGHETTI N , FEHLING R , et al . An on-demand approach to ETL [J ] . Proceedings of the VLDB Endowment , 2015 , 8 ( 12 ): 1578 - 1589 .
MARINI L , GUTIERREZ-POLO I , KOOPER R . et al Clowder:open source data management for long tail data [C ] // The Practice and Experience on Advanced Research Computing,July 22-26,2018,Pittsburgh,USA . New York:ACM Press , 2018 : 1 - 8 .
VARGAS-SOLAR B , KEMP G , GALLEGOS I H , et al . Demonstrating data collections curation and exploration with curare [C ] // International Conference on Extending Database Technology,March 26-29,2019,Lisbon,Portugal.[S.l.:s.n . ] , 2019 : 598 - 601 .
WOLLATZ L , SCOTT M , JOHNSTON S J , et al . Curation of image data for medical research [C ] // International Conference on e-Science,October 29 November 1,2018,Amsterdam,The Netherlands . Piscataway:IEEE Press , 2018 : 105 - 113 .
杜小勇 , 陈跃国 , 范举 , 等 . 数据整理——大数据治理的关键技术 [J ] . 大数据 , 2019 , 5 ( 3 ): 13 - 22 .
DU X Y , CHEN Y G , FAN J , et al . Data wrangling:a key technique of data governance [J ] . Big Data Research , 2019 , 5 ( 3 ): 13 - 22 .
0
浏览量
777
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621