1. 中国人民大学数据工程与知识工程教育部重点实验室,北京 100872
2. 中国人民大学信息学院,北京 100872
[ "范举(1984- ),男,博士,中国人民大学数据工程与知识工程教育部重点实验室与信息学院副教授,中国计算机学会会员,数据库专业委员会委员。主要研究方向为数据库与大数据、众包数据管理、数据准备" ]
[ "陈跃国(1978- ),男,博士,中国人民大学信息学院教授、博士生导师,中国计算机学会高级会员,数据库专业委员会委员,大数据专家委员会通信委员。主要研究方向为大数据分析系统和语义搜索" ]
[ "杜小勇(1963- ),男,博士,中国人民大学信息学院教授、博士生导师,教育部数据工程与知识工程重点实验室主任,中国计算机学会会士,数据库专业委员会主任,《大数据》期刊编委会副主任,ACM Transactions on Data Science编委。主要研究方向为数据库与大数据、智能信息检索、知识工程" ]
网络首发:2019-11,
纸质出版:2019-11-15
移动端阅览
范举, 陈跃国, 杜小勇. 人在回路的数据准备技术研究进展[J]. 大数据, 2019,5(6):2019046-1.
Ju FAN, Yueguo CHEN, Xiaoyong DU. Progress on human-in-the-loop data preparation[J]. Big Data Research, 2019, 5(6): 2019046-1.
范举, 陈跃国, 杜小勇. 人在回路的数据准备技术研究进展[J]. 大数据, 2019,5(6):2019046-1. DOI: 10.11959/j.issn.2096-0271.2019046.
Ju FAN, Yueguo CHEN, Xiaoyong DU. Progress on human-in-the-loop data preparation[J]. Big Data Research, 2019, 5(6): 2019046-1. DOI: 10.11959/j.issn.2096-0271.2019046.
随着数据分析技术的迅猛发展,数据准备越来越成为一个瓶颈性问题。以真实的数据分析场景为背景,分析了数据准备的两大核心挑战:人力成本高与时间周期长。在此基础上,介绍了人在回路数据准备技术的研究进展。交互式数据准备技术面向终端用户,通过与用户的交互预测其意图,并通过有效的预测算法来节省数据准备的时间。基于众包的数据准备技术引入互联网上的海量用户作为众包工人扩展计算能力,从而支持数据准备的基本任务,并研究如何对众包做质量控制与成本优化。最后,对人在回路的数据准备做出总结并探讨未来的挑战性问题。
With the rapid development of data analytics
data preparation has become a major bottleneck.The two essential challenges for data preparation on cost and time were analyzed.To address the challenges
the research progress on human-in-theloop data preparation was reviewed.Firstly
interactive data preparation was reviewed
which aimed to reduce the time for data preparation by predictively interacting with the end users.Then
crowdsourced data preparation was introduced
which utilize human’s computational power from the crowd to support foundamental data preparation tasks
and developed algorithms for controlling result quality and reducing crowdsourcing cost.Finally
future research directions were summarized and discussed.
杜小勇 , 陈跃国 , 范举 , 等 . 数据整理——大数据治理的关键技术 [J ] . 大数据 , 2019 , 5 ( 3 ): 13 - 22 .
DU X Y , CHEN Y G , FAN J , et al . Data wrangling:a key technique of data governance [J ] . Big Data Research , 2019 , 5 ( 3 ): 13 - 22
HELLERSTEIN J M , HEER J , KANDEL S . Self-service data preparation:research to practice [J ] . IEEE Data Engineering Bulletin , 2018 , 41 ( 2 ): 23 - 34 .
DENG J , DONG W , SOCHER R , et al . ImageNet:a large-scale hierarchical image database [C ] // Computer Vision and Pattern Recognition (CVPR),June 20-25,2009,Miami,USA . Piscataway:IEEE Press , 2009 : 248 - 255 .
YANG Y , MENEGHETTI N , FEHLING R , et al . Lenses:an on-demand approach to ETL [J ] . Proceedings of the VLDB Endowment , 2015 , 8 ( 12 ): 1578 - 1589 .
DOAN A H , . Human-in-the-loop data analysis:a personal perspective [C ] // The Workshop on Human-In-the-Loop Data Analytics (HILDA@SIGMOD 2018),Jun 10-15,2018,Houston,USA . New York:ACM Press , 2019 : 1 - 6 .
VERROIOS V , GARCIA-MOLINA H , PAPAKONSTANTINOU Y . Waldo:an adaptive human interface for crowd entity resolution [C ] // International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA . New York:ACM Press , 2017 : 1133 - 1148 .
WANG J , KRASKA T , FRANKLIN M J , et al . CrowdER:Crowdsourcing Entity Resolution [J ] . Proceedings of the VLDB Endowment , 2012 , 5 ( 11 ): 1483 - 1494 .
BERNSTEIN M S , BRANDT J , MILLER R C , et al . Crowds in two seconds:enabling realtime crowd-powered interfaces [C ] // Annual ACM Symposium on User Interface Software and Technology (UIST),October 16-19,2011,Santa Barbara,USA . New York:ACM Press , 2011 : 33 - 42 .
HAAS D , WANG J , WU E , et al . CLAMShell:speeding up crowds for lowlatency data labeling [J ] . Proceedings of the VLDB Endowment , 2015 , 9 ( 4 ): 372 - 383 .
STONEBRAKER M , BRUCKNER D , ILYAS I F , et al . Data curation at scale:the data tamer system [C ] // Biennial Conference on Innovative Data Systems Research (CIDR),January 6-9,2013,Asilomar,USA.[S.l:s.n.] , 2013 .
DOAN A H , ARDALAN A , BALLARD J R , et al . Toward a system building agenda for data integration [J ] . IEEE Data Engineering Bulletin , 2018 , 41 ( 2 ): 35 - 46 .
CHEN C , GOLSHAN B , HALEVEY A Y , et al . BigGorilla:an open-source ecosystem for data preparation and integration [J ] . IEEE Data Engineering Bulletin , 2018 , 41 ( 2 ): 10 - 22 .
LI G . Human-in-the-loop data integration [J ] . Proceedings of the VLDB Endowment , 2017 , 10 ( 12 ): 2006 - 2017 .
FAN J , LI G . Human-in-the-loop rule learning for data integration [J ] . IEEE Data Engineering Bulletin , 2018 , 41 ( 2 ): 104 - 115 .
KANDEL S , PAEPCKE A , HELLERSTEIN J M , et al . Wrangler:interactive visual specification of data transformation scripts [C ] // International Conference on Human Factors in Computing Systems (CHI),May 7-12,2011,Vancouver,Canada . New York:ACM Press , 2011 : 3363 - 3372 .
HEER J , HELLERSTEIN J M , KANDEL S . Predictive interaction for data transformation [C ] // Biennial Conference on Innovative Data Systems Research (CIDR),January 4-7,Asilomar,USA.[S.l:s.n.] , 2013
KHAN M A , XU L , NANDI A , et al . Data tweening:incremental visualization of data transforms [J ] . Proceedings of the VLDB Endowment , 2017 , 10 ( 6 ): 661 - 672 .
LIEBERMAN H . Your wish is my command:programming by example [M ] . Morgan Kaufmann Publishers , 2001 .
JIN Z , ANDERSON M R , CAFARELLA M J , et al . Foofah:Transforming data by example [C ] // International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA . New York:ACM Press , 2017 : 683 - 698 .
BLINKFILL R S . Semi-supervised programming by example for syntactic string transformations [J ] . Proceedings of the VLDB Endowment , 2016 , 9 ( 10 ): 816 - 827 .
SINGH R , MEDURI V V , ELMAGARMID A K , et al . Synthesizing entity matching rules by examples [J ] . Proceedings of the VLDB Endowment , 2017 , 11 ( 2 ): 189 - 202 .
BONIFATI A , COMIGNANI U , COQUERY E , et al . Interactive mapping specification with exemplar tuples [C ] // International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA . New York:ACM Press , 2017 : 667 - 682 .
ZHU E , HE Y , CHAUDHURI S . Autojoin:joining tables by leveraging transformations [J ] . Proceedings of the VLDB Endowment , 2017 , 10 ( 10 ): 1034 - 1045 .
HE Y , CHU X , GANJAM K , et al . Transform-data-by-example (TDE):an extensible search engine for data transformations [J ] . Proceedings of the VLDB Endowment , 2018 , 11 ( 10 ): 1165 - 1177 .
MORCOS J , ABEDJAN Z , ILYAS I F , et al . DataXFormer:an interactive data transformation tool [C ] // International Conference on Management of Data (SIGMOD),May 31-June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 883 - 888 .
ABEDJAN Z , MORCOS J , ILYAS I F , et al . DataXFormer:a robust transformation discovery system [C ] // IEEE International Conference on Data Engineering (ICDE),May 16-20,2016,Helsinki,Finland . Piscataway:IEEE Press , 2016 : 1134 - 1145 .
FAN J , LU M , OOI B C , et al . A hybrid machine-crowdsourcing system for matching web tables [C ] // IEEE International Conference on Data Engineering (ICDE),March 31-April 4,2014,Chicago,USA . Piscataway:IEEE Press , 2014 : 976 - 987 .
HOCHREITER S , SCHMIDHUBER J . Long short-term memory [J ] . Neural Computation , 1997 , 9 ( 8 ): 1735 - 1780 .
ZHANG Y , IVES Z G . Juneau:data lake management for Jupyter [J ] . Proceedings of the VLDB Endowment , 2019 , 12 ( 12 ): 1902 - 1905 .
IVES Z , ZHANG Y , HAN S , et al . Dataset relationship management [C ] // Biennial Conference on Innovative Data Systems Research (CIDR),January 13-16,Asilomar,USA.[S.l:s.n.] , 2019 .
VARTAK M , RAHMAN S , MADDEN S , et al . SEEDB:efficient data-driven visualization recommendations to support visual analytics [J ] . Proceedings of the VLDB Endowment , 2015 , 8 ( 13 ): 2182 - 2193 .
LUO Y , QIN X , TANG N , et al . DeepEye:towards automatic data visualization [C ] // IEEE International Conference on Data Engineering (ICDE),April 16-19,2018,Paris,France . Piscataway:IEEE Press , 2018 : 101 - 112 .
REZIG E K , CAO L , STONEBRAKER M , et al . Data civilizer 2.0:a holistic framework for data preparation and analytics [J ] . Proceedings of the VLDB Endowment , 2019 , 12 ( 12 ): 1954 - 1957 .
SHANG Z , ZGRAGGEN E , BURATTI B , et al . Democratizing data science through interactive curation of ML pipelines [C ] // International Conference on Management of Data (SIGMOD),June 30 - July 5,2019,Amsterdam,The Netherlands . New York:ACM Press , 2019 : 1171 - 1188 .
WANG J , KRISHNAN S , FRANKLIN M J , et al . A sample-and-clean framework for fast and accurate query processing on dirty data [C ] // International Conference on Management of Data (SIGMOD),June 22-27,2014,Salt Lake City,USA . New York:ACM Press , 2014 : 469 - 480 .
LI G , WANG J , ZHENG Y , et al . Crowdsourced data management:a survey [J ] . IEEE Transactions on Knowledge and Data Engineering , 2016 , 28 ( 9 ): 2296 - 2319 .
DEMARTINI G , DIFALLAH D E , CUDRÉMAUROUX P . ZenCrowd:leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking [C ] // International World Wide Web Conferences (WWW),April 16-20,2012,Lyon,France.[S.l:s.n . ] , 2012 : 469 - 478 .
KONDREDDI S K , TRIANTAFILLOU P , WEIKUM G . Combining information extraction and human computing for crowdsourced knowledge acquisition [C ] // International Conference on Data Engineering (ICDE),March 31 - April 4,2014,Chicago,USA . Piscataway:IEEE Press , 2014 : 988 - 999 .
ABAD A , NABI M , MOSCHITTI A . Self-Crowdsourcing training for relation extraction [C ] // Annual Meeting of the Association for Computational Linguistics (ACL),July 30 - August 4,2017,Vancouver,Canada.[S.l:s.n . ] , 2017 : 518 - 523 .
CHILTON L B , LITTLE G , EDGE D , et al . Cascade:crowdsourcing taxonomy creation [C ] // International Conference on Human Factors in Computing Systems (CHI),April 27 - May 2,2013,Paris,France . New York:ACM Press , 2013 : 1999 - 2008 .
CHU X , MORCOS J , ILYAS I F , et al . KATARA:a data cleaning system powered by knowledge bases and crowdsourcing [C ] // International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 1247 - 1261 .
TONG Y , CAO C C , ZHANG C J , et al . CrowdCleaner:Data cleaning for multi-version data on the web via crowdsourcing [C ] // IEEE International Conference on Data Engineering (ICDE),March 31 - April 4,2014,IL,USA.[S.l:s.n . ] , 2014 : 1182 - 1185 .
DOLATSHAH M , TEOH M , WANG J , et al . Cleaning crowdsourced labels using oracles for statistical classification [J ] . Proceedings of the VLDB Endowment , 2018 , 12 ( 4 ): 376 - 389 .
GAO J , LI Q , ZHAO B , et al . Truth discovery and crowdsourcing aggregation:a unified perspective [J ] . Proceedings of the VLDB Endowment , 2015 , 8 ( 12 ): 2048 - 2049 .
WANG J , LI G , KRASKA T , et al . Leveraging transitive relations for crowdsourced joins [C ] // International Conference on Management of Data (SIGMOD),June 22-27,2013,New York,USA . New York:ACM Press , 2013 : 229 - 240 .
CHAI C , LI G , LI J , et al . Cost-effective crowdsourced entity resolution:a partial-order approach [C ] // International Conference on Management of Data (SIGMOD),June 26 - July 1,2016,San Francisco,USA . New York:ACM Press , 2016 : 969 - 984 .
WANG S , XIAO X , LEE C . Crowd-based deduplication:an adaptive approach [C ] // International Conference on Management of Data (SIGMOD),May 31-June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 1263 - 1277 .
DAS S , C P S G , DOAN A , et al . Falcon:scaling up hands-off crowdsourced entity matching to build cloud services [C ] // International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA . New York:ACM Press , 2017 : 1431 - 1446 .
RATNER A , BACH S H , EHRENBERG H R , et al . Snorkel:rapid training data creation with weak supervision [J ] . Proceedings of the VLDB Endowment , 2017 , 11 ( 3 ): 269 - 282 .
RATNER A J , SA C D , WU S , et al . Data programming:creating large training sets,quickly [C ] // Neural Information Processing Systems (NeurIPS),December 5-10,2016,Barcelona,Spain.[S.l:s.n . ] , 2016 : 3567 - 3575 .
YANG J , FAN J , WEI Z , et al . Costeffective data annotation using game-based crowdsourcing [J ] . Proceedings of the VLDB Endowment , 2018 , 12 ( 1 ): 57 - 70 .
LIU T , YANG J , FAN J , et al . CrowdGame:a game-based crowdsourcing system for cost-effective data labeling [C ] // International Conference on Management of Data (SIGMOD),June 30 - July 5,2019,Amsterdam,The Netherlands . New York:ACM Press , 2019 : 1957 - 1960 .
LIU X , LU M , OOI B C , et al . CDAS:a crowdsourcing data analytics system [J ] . Proceedings of the VLDB Endowment , 2012 , 5 ( 10 ): 1040 - 1051 .
FAN J , LI G , OOI B C , et al . iCrowd:an adaptive crowdsourcing framework [C ] // International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 1015 - 1030 .
ZHENG Y , WANG J , LI G , et al . QASCA:a quality-aware task assignment system for crowdsourcing applications [C ] // International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 1031 - 1046 .
HAAS D , WANG J , WU F , et al . CLAMShell:speeding up crowds for lowlatency data labeling [J ] . Proceedings of the VLDB Endowment , 2015 , 9 ( 4 ): 372 - 383 .
MOZAFARI B , SARKAR P , FRANKLIN M J , et al . Scaling up crowd-sourcing to very large datasets:a case for active learning [J ] . Proceedings of the VLDB Endowment , 2014 , 8 ( 2 ): 125 - 136 .
VERROIOS V , LOFGREN P , GARCIAMOLINA H . tDP:an optimallatency budget allocation strategy for crowdsourced MAXIMUM operations [C ] // International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia . New York:ACM Press , 2015 : 1047 - 1062 .
SARMA A D , PARAMESWARAN A G , GARCIA-MOLINA H , et al . Crowdpowered find algorithms [C ] // IEEE International Conference on Data Engineering (ICDE),March 31 - April 4,2014,Chicago,USA . Piscataway:IEEE Press , 2014 : 964 - 975 .
BOIM R , GREENSHPAN O , MILO T , et al . Asking the right questions in crowd data sourcing [C ] // IEEE International Conference on Data Engineering (ICDE),April 1-5,2012,Washington,USA . Piscataway:IEEE Press , 2012 : 1261 - 1264 .
TO H , SHAHABI C , XIONG L . Privacypreserving online task assignment in spatial crowdsourcing with untrusted server [C ] // IEEE International Conference on Data Engineering (ICDE),April 16-19,2018,Paris,France . Piscataway:IEEE Press , 2018 : 833 - 844 .
0
浏览量
495
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621