浏览全部资源
扫码关注微信
1. 高可信软件技术教育部重点实验室(北京大学),北京 100871
2. 中国人民大学信息学院,北京 100872
3. 华中科技大学计算机科学与技术学院,湖北 武汉 430074
4. 中国科学院计算技术研究所,北京 100086
5. 中国科学院大学计算机科学与技术学院,北京 100049
6. 北京理工大学计算机学院,北京 100081
[ "梅宏(1963- ),男,博士,北京大学教授、高可信软件技术教育部重点实验室(北京大学)主任,中国科学院院士,发展中国家科学院院士,欧洲科学院外籍院士,中国计算机学会理事长。主要研究方向为软件工程与系统软件" ]
[ "杜小勇(1963- ),男,博士,中国人民大学教授、校长助理,中国计算机学会大数据专家委员会主任,主要研究方向为数据库与大数据" ]
[ "金海(1966- ),男,博士,华中科技大学计算机科学与技术学院教授,中国计算机学会副理事长,主要研究方向为计算机系统结构、并行与分布式计算" ]
[ "程学旗(1971- ),男,博士,中国科学院计算技术研究所研究员、副所长,主要研究方向为大数据分析系统、Web信息检索与数据挖掘等" ]
[ "柴云鹏(1983- ),男,博士,中国人民大学信息学院教授、计算机系主任,主要研究方向为数据库系统、云计算、存储系统等" ]
[ "石宣化(1978- ),男,博士,华中科技大学计算机科学与技术学院教授,主要研究方向为并行与分布式计算、异构计算" ]
[ "靳小龙(1976- ),男,博士,中国科学院计算技术研究所研究员,主要研究方向为知识图谱、知识工程、社会计算、社交网络等" ]
[ "王亚沙(1975- ),男,博士,高可信软件技术教育部重点实验室(北京大学)教授,主要研究方向为大数据分析、普适计算、城市计算" ]
[ "刘驰(1984- ),男,博士,北京理工大学计算机学院教授、副院长,主要研究方向为大数据分析、智能物联网" ]
网络出版日期:2023-01,
纸质出版日期:2023-01-15
移动端阅览
梅宏, 杜小勇, 金海, 等. 大数据技术前瞻[J]. 大数据, 2023,9(1):1-20.
Hong MEI, Xiaoyong DU, Hai JIN, et al. Big data technologies forward-looking[J]. Big data research, 2023, 9(1): 1-20.
梅宏, 杜小勇, 金海, 等. 大数据技术前瞻[J]. 大数据, 2023,9(1):1-20. DOI: 10.11959/j.issn.2096-0271.2023009.
Hong MEI, Xiaoyong DU, Hai JIN, et al. Big data technologies forward-looking[J]. Big data research, 2023, 9(1): 1-20. DOI: 10.11959/j.issn.2096-0271.2023009.
世界主要国家高度重视大数据发展,我国也将发展大数据作为国家战略,发展大数据技术具有重要意义。大数据技术涉及从采集、传输到管理、处理、分析、应用的全生命周期以及生命周期各阶段的数据治理。选取数据生命周期中的管理、处理和分析技术以及大数据治理技术来梳理国内外技术发展现状,特别是研判我国大数据技术发展与国际先进技术之间的差距。另外,在大数据应用需求的驱动下,计算技术体系正面临重构,从“以计算为中心”向“以数据为中心”转型,在新的计算技术体系下,一系列基础理论和核心技术问题亟待破解,新型大数据系统技术成为重要发展方向。在计算体系重构的背景下,提出大数据技术发展的四大技术挑战和十大发展趋势。
Major countries in the world attach great importance to the development of big data technology.China also puts big data as a national strategy
of great significance to develop in the long run.Big data technologies include data collection
transmission
management
processing
analysis
and application
forming a data life cycle as well as the data governance related to each procedure.Big data management
processing
analysis
and governance in four areas were seleceted
to identify the gap between China and the world.On the other hand
driven by diverse successful big data applications
the system architecture of computing technology is being restructured.From “computation-centric” to “data-centric”
fundamental computing theories and core technologies need to be redesigned
therefore a new type of big data system technology is becoming an important research direction.Against this background
four technical challenges and ten future development trends of big data technologies were aimed at identifying.
裴威 , 李战怀 , 潘巍 . GPU数据库核心技术综述 [J ] . 软件学报 , 2021 , 32 ( 3 ): 859 - 885 .
PEI W , LI Z H , PAN W . Survey of key technologies in GPU database system [J ] . Journal of Software , 2021 , 32 ( 3 ): 859 - 885 .
SHERKAT R , FLORENDO C , ANDREI M , et al . Native store extension for SAP HANA [J ] . Proceedings of the VLDB Endowment , 2019 , 12 ( 12 ): 2047 - 2058 .
SHEN S J , CHEN R , CHEN H B , et al . Retrofitting High availability mechanism to tame hybrid transaction/analytical processing [C ] // Proceedings of 2021 Operating Systems Design and Implementation .[S.l.:s.n. ] , 2021 : 219 - 238 .
LIU G , CHEN L Y , CHEN S M . Zen:a high-throughput log-free OLTP engine for non-volatile main memory [J ] . Proceedings of the VLDB Endowment , 2021 , 14 ( 5 ): 835 - 848 .
KRASKA T , BEUTEL A , CHI E H , et al . The case for learned index structures [C ] // Proceedings of 2018 International Conference on Management of Data . New York:ACM Press , 2018 : 489 - 504 .
CHATTERJEE S , JAGADEESAN M , QIN W , et al . Cosine [J ] . Proceedings of the VLDB Endowment , 2021 , 15 ( 1 ): 112 - 126 .
DAS S , GRBIC M , ILIC I , et al . Automatically indexing millions of databases in microsoft azure SQL database [C ] // Proceedings of 2019 International Conference on Management of Data . New York:ACM Press , 2019 : 666 - 679 .
AHMED R , BELLO R , WITKOWSKI A , et al . Automated generation of materialized views in Oracle [J ] . Proceedings of the VLDB Endowment , 2020 , 13 ( 12 ): 3046 - 3058 .
LIU X Z , YIN Z , ZHAO C , et al . PinSQL:pinpoint root cause SQLs to resolve performance issues in cloud databases [C ] // Proceedings of 2022 IEEE 38th International Conference on Data Engineering . Piscataway:IEEE Press , 2022 : 2549 - 2561 .
LI G L , ZHOU X H , SUN J , et al . OpenGauss:an autonomous database system [C ] // Proceedings of the International Conference on Very Large Databases .[S.l.:s.n. ] , 2021 , 14 ( 12 ): 3028 - 3041 .
ZHOU X H , LI G L , CHAI C L , et al . A learned query rewrite system using Monte Carlo tree search [J ] . Proceedings of the VLDB Endowment , 2021 , 15 ( 1 ): 46 - 58 .
WANG J Y , CHAI C L , LIU J B , et al . FACE:a normalizing flowbased cardinality estimator [C ] // Proceedings of the International Conference on Very Large Databases .[S.l.:s.n. ] , 2022 , 15 ( 1 ): 72 - 84 .
DEPOUTOVITCH A , CHEN C , CHEN J , et al . Taurus database:how to be fast,available,and frugal in the cloud [C ] // Proceedings of 2020 ACM SIGMOD International Conference on Management of Data . New York:ACM Press , 2020 : 1463 - 1478 .
CAO W , LIU Z J , WANG P , et al . PolarFS:an ultra-low latency and failure resilient distributed file system for shared storage cloud database [J ] . Proceedings of the VLDB Endowment , 2018 , 11 ( 12 ): 1849 - 1862 .
TAFT R , SHARIF I , MATEI A , et al . CockroachDB:the resilient geodistributed SQL database [C ] // Proceedings of 2020 ACM SIGMOD International Conference on Management of Data . New York:ACM Press , 2020 : 1493 - 1509 .
CAO W , LIU Z J , WANG P , et al . PolarFS:an ultra-low latency and failure resilient distributed file system for shared storage cloud database [J ] . Proceedings of the VLDB Endowment , 2018 , 11 ( 12 ): 1849 - 1862 .
WANG Y Y , WANG Z K , CHAI Y P , et al . Rethink the linearizability constraints of raft for distributed key-value stores [C ] // Proceedings of 2021 IEEE 37th International Conference on Data Engineering . Piscataway:IEEE Press , 2021 : 1877 - 1882 .
HUANG D X , LIU Q , CUI Q , et al . TiDB [J ] . Proceedings of the VLDB Endowment , 2020 , 13 ( 12 ): 3072 - 3084 .
WANG H X , XU C , ZHANG C , et al . A blockchain system ensuring query integrity [C ] // Proceedings of the ACM SIGMOD International Conference on Management of Data . New York:ACM Press , 2020 : 2693 - 2696 .
DANG H , DINH T T A , LOGHIN D , et al . Towards scaling blockchain systems via sharding [C ] // Proceedings of 2019 International Conference on Management of Data . New York:ACM Press , 2019 : 123 - 140 .
ALAKUIJALA J , FARRUGGIA A , FERRAGINA P , et al . Brotli:a generalpurpose data compressor [J ] . ACM Transactions on Information Systems , 2019 , 37 ( 1 ): 1 - 30 .
CAO W , ZHANG Y Q , YANG X J , et al . PolarDB serverless:a cloud native database for disaggregated data centers [C ] // Proceedings of 2021 International Conference on Management of Data . New York:ACM Press , 2021 : 2477 - 2489 .
ZHANG F , WAN W T , ZHANG C Y , et al . CompressDB:enabling efficient compressed data direct processing for various databases [C ] // Proceedings of 2022 International Conference on Management of Data .[S.l.:s.n. ] , 2022 : 1655 - 1669 .
WOJTOWICZ D T , YIN S Y , MORVAN F , et al . Cost-effective dynamic optimisation for multi-cloud queries [C ] // Proceedings of 2021 IEEE 14th International Conference on Cloud Computing . Piscataway:IEEE Press , 2021 : 387 - 397 .
王建冬 , 于施洋 , 窦悦 . 东数西算:我国数据跨域流通的总体框架和实施路径研究 [J ] . 电子政务 , 2020 ( 3 ): 13 - 21 .
WANG J D , YU S Y , DOU Y . East-west computing transfer:research on the overall framework and implementation path of cross-domain data circulation in China [J ] . E-Government , 2020 ( 3 ): 13 - 21 .
DEAN J , GHEMAWAT S . MapReduce:simplified data processing on large clusters [J ] . Communications of the ACM , 2008 , 51 ( 1 ): 137 - 150 .
FEY M , LENSSEN J E . Fast graph representation learning with PyTorch geometric [J ] . arXiv preprint , 2019 ,arXiv:1903.02428v2.
RASCHKA S , PATTERSON J , NOLET C . Machine learning in python:main developments and technology trends in data science,machine learning,and artificial intelligence [J ] . Information , 2020 , 11 ( 4 ): 193 .
AHN J , YOO S , MUTLU O , et al . PIMenabled instructions:a low-overhead,locality-aware processing-in-memory architecture [J ] . Computer Architecture News , 2015 , 43 ( 3 ): 336 - 348 .
WU M Y , ZHAO Z M , LI H Y , et al . Espresso:brewing Java for more non-volatility with non-volatile memory [C ] // Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems . New York:ACM Press , 2018 : 70 - 83 .
SHI X H , KE Z X , ZHOU Y L , et al . Deca:a garbage collection optimizer for in-memory data processing [J ] . ACM Transactions on Computer Systems , 2018 , 36 ( 1 ): 1 - 47 .
ZEUCH S , MONTE B D , KARIMOV J , et al . Analyzing efficient stream processing on modern hardware [J ] . Proceedings of the VLDB Endowment , 2019 , 12 ( 5 ): 516 - 530 .
TOSHNIWAL A , TANEJA S , SHUKLA A , et al . Storm@twitter [C ] // Proceedings of 2014 ACM SIGMOD International Conference on Management of Data . New York:ACM Press , 2014 .
ZAHARIA M , DAS T , LI H Y , et al . Discretized streams:fault-tolerant streaming computation at scale [C ] // Proceedings of the 24th ACM Symposium on Operating Systems Principles . New York:ACM Press , 2013 .
NASIR M A U , MORALES G D F , GARCÍA-SORIANO D , et al . The power of both choices:practical load balancing for distributed stream processing engines [C ] // Proceedings of 2015 IEEE 31st International Conference on Data Engineering . Piscataway:IEEE Press , 2015 : 137 - 148 .
NASIR M A U , MORALES G D F , KOURTELLIS N , et al . When two choices are not enough:balancing at scale in distributed stream processing [C ] // Proceedings of 2016 IEEE 32nd International Conference on Data Engineering . Piscataway:IEEE Press , 2016 : 589 - 600 .
ABDELHAMID A S , MAHMOOD A R , DAGHISTANI A , et al . Prompt:dynamic data-partitioning for distributed microbatch stream processing systems [C ] // Proceedings of 2020 ACM SIGMOD International Conference on Management of Data . New York:ACM Press , 2020 : 2455 - 2469 .
CHEN H H , ZHANG F , JIN H . PStream:a popularity-aware differentiated distributed stream processing system [J ] . IEEE Transactions on Computers , 2021 , 70 ( 10 ): 1582 - 1597 .
MALEWICZ G , AUSTERN M H , BIK A J C , et al . Pregel:a system for large-scale graph processing [C ] // Proceedings of 2010 ACM SIGMOD International Conference on Management of Data . New York:ACM Press , 2010 : 135 - 146 .
WANG Y , DAVIDSON A , PAN Y C , et al . Gunrock:a high-performance graph processing library on the GPU [C ] // Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . New York:ACM Press , 2016 .
ZHOU S J , KANNAN R , PRASANNA V K , et al . HitGraph:high-throughput graph processing framework on FPGA [J ] . IEEE Transactions on Parallel and Distributed Systems , 2019 , 30 ( 10 ): 2249 - 2264 .
RAHMAN S , ABU-GHAZALEH N , GUPTA R . GraphPulse:an event-driven hardware accelerator for asynchronous graph processing [C ] // Proceedings of 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture . Piscataway:IEEE Press , 2020 : 908 - 921 .
SHUN J L , BLELLOCH G E . Ligra:a lightweight graph processing framework for shared memory [C ] // Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming . New York:ACM Press , 2013 : 135 - 146 .
GONZALEZ J E , LOW Y C , GU H J , et al . PowerGraph:distributed graph-parallel computation on natural graphs [C ] // Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation . New York:ACM Press , 2012 : 17 - 30 .
KYROLA A , BLELLOCH G , GUESTRIN C . GraphChi:large-scale graph computation on just a PC [C ] // Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation . New York:ACM Press , 2012 : 31 - 46 .
HAM T J , WU L S , SUNDARAM N , et al . Graphicionado:a high-performance and energy-efficient accelerator for graph analytics [C ] // Proceedings of 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture . Piscataway:IEEE Press , 2016 : 1 - 13 .
ZHANG Y , LIAO X F , JIN H , et al . HotGraph:efficient asynchronous processing for real-world graphs [J ] . IEEE Transactions on Computers , 2017 , 66 ( 5 ): 799 - 809 .
ZHANG K Y , CHEN R , CHEN H B . NUMA-aware graph-structured analytics [C ] // Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . New York:ACM Press , 2015 : 183 - 193 .
ZHU X W , CHEN W G , ZHENG W M , et al . Gemini:a computation- centric distributed graph processing system [C ] // Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation . New York:ACM Press , 2016 : 301 - 316 .
ZHANG Y , LIAO X F , GU L , et al . Asyngraph:maximizing data parallelism for efficient iterative graph processing on gpus [J ] . ACM Transactions on Architecture and Code Optimization , 2020 , 17 ( 4 ): 1 - 21 .
DAI G H , HUANG T H , CHI Y Z , et al . ForeGraph:exploring largescale graph processing on multi-FPGA architecture [C ] // Proceedings of 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . New York:ACM Press , 2017 : 217 - 226 .
ZHAO J , YANG Y , ZHANG Y , et al . TDGraph:a topology-driven accelerator for high-performance streaming graph processing [C ] // Proceedings of the 49th Annual International Symposium on Computer Architecture . New York:ACM Press , 2022 : 116 - 129 .
LIN H , ZHU X W , YU B W , et al . ShenTu:processing multi-trillion edge graphs on millions of cores in seconds [C ] // Proceedings of International Conference for High Performance Computing,Networking,Storage and Analysis . Piscataway:IEEE Press , 2018 .
ZHANG Y , LIAO X F , JIN H , et al . DiGraph:an efficient path-based iterative directed graph processing system on multiple GPUs [C ] // Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems . New York:ACM Press , 2019 : 601 - 614 .
PHAM H , LIANG P P , MANZINI T , et al . Found in translation:learning robust joint representations by cyclic translations between modalities [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 6892 - 6899 .
WANG W H , BAO H B , DONG L , et al . Image as a foreign language:BEIT pretraining for all vision and visionlanguage tasks [J ] . arXiv preprint , 2022 ,arXiv:2208.10442.
CHEN X , WANG X , CHANGPINYO S , et al . Pali:a jointly-scaled multilingual language-image model [J ] . arXiv preprint , 2022 ,arXiv:2209.06794.
LIU J , ZHU X X , LIU F , et al . OPT:omni-perception pre-trainer for crossmodal understanding and generation [J ] . arXiv preprint , 2021 ,arXiv:2107.00249.
MAMMEN P M . Federated learning:opportunities and challenges [J ] . arXiv preprint , 2021 ,arXiv:2101.05428.
ZILLER A , TRASK A , LOPARDO A , et al . PySyft:a library for easy federated learning [M ] // Federated learning systems . Cham : Springer , 2021 : 111 - 139 .
WELTEN S , MOU Y L , NEUMANN L , et al . A privacy-preserving distributed analytics platform for health care data [J ] . Methods of Information in Medicine , 2022 , 61 ( S 01 ): e1 - e11 .
LI Q B , WEN Z Y , WU Z M , et al . A survey on federated learning systems:vision,hype and reality for data privacy and protection [J ] . IEEE Transactions on Knowledge and Data Engineering , 2021 :10.1109/TKDE.2021.3124599.
RUBIN D B . Estimating causal effects of treatments in randomized and nonrandomized studies [J ] . Journal of Educational Psychology , 1974 , 66 ( 5 ): 688 - 701 .
PEARL J . Causality:models,reasoning and inference [M ] . Cambridge : Cambridge University Press , 2009 .
PEARL J , MACKENZIE D . The book of why:the new science of cause and effect [J ] . Journal of MultiDisciplinary Evaluation , 2018 , 14 ( 31 ): 47 - 54 .
SUN X W , WU B T , ZHENG X Y , et al . Recovering latent causal factor for generalization to distributional shifts [C ] // Advances in Neural Information Processing Systems .[S.l.:s.n. ] , 2021 : 16846 - 16859 .
CUI P , ATHEY S . Stable learning establishes some common ground between causal inference and machine learning [J ] . Nature Machine Intelligence , 2022 , 4 ( 2 ): 110 - 115 .
ZHANG Y , FENG F L , HE X N , et al . Causal intervention for leveraging popularity bias in recommendation [J ] . arXiv preprint , 2021 ,arXiv:2105.06067.
ZHU Z M , CHEN X H , TIAN H L , et al . Offline reinforcement learning with causal structured world models [J ] . arXiv preprint , 2022 ,arXiv:2206.01474.
STONEBRAKER M . The solution:data curation at scale [M ] . Getting data right .[S.l. ] : O’Reilly , 2016 .
华为公司数据管理部 . 华为数据之道 [M ] . 北京 : 机械工业出版社 , 2020 .
Data Management Department of Huawei . Enterprise data at Huawei [M ] . Beijing : China Machine Press , 2020 .
REKATSINAS T , CHU X , ILYAS I F , et al . HoloClean:holistic data repairs with probabilistic inference [J ] . arXiv preprint,2017 , 2017 ,arXiv:1702.00820.
DONG X , GABRILOVICH E , HEITZ G , et al . Knowledge vault:a web-scale approach to probabilistic knowledge fusion [J ] . SIGKDD Explorations , 2014 ( CD/ROM ): 597 - 606 .
郝爽 , 李国良 , 冯建华 , 等 . 结构化数据清洗技术综述 [J ] . 清华大学学报(自然科学版) , 2018 , 58 ( 12 ): 1037 - 1050 .
HAO S , LI G L , FENG J H , et al . Survey of structured data cleaning methods [J ] . Journal of Tsinghua University (Science and Technology) , 2018 , 58 ( 12 ): 1037 - 1050 .
丁小欧 , 王宏志 , 于晟健 . 工业时序大数据质量管理 [J ] . 大数据 , 2019 , 5 ( 6 ): 1 - 11 .
DING X O , WANG H Z , YU S J . Data quality management of industrial temporal big data [J ] . Big Data Research , 2019 , 5 ( 6 ): 1 - 11 .
KAHN R , WILENSKY R . A framework for distributed digital object services [J ] . International Journal on Digital Libraries , 2006 , 6 ( 2 ): 115 - 123 .
梅宏 , 黄罡 , 刘譞哲 , 等 . 网构软件研究:回顾与展望 [J ] . 科学通报 , 2022 , 67 ( 32 ): 3782 - 3792 .
MEI H , HUANG G , LIU X Z , et al . Research on internetware:review and prospect [J ] . Chinese Science Bulletin , 2022 , 67 ( 32 ): 3782 - 3792 .
黄罡 . 数联网:数字空间基础设施 [J ] . 中国计算机学会通讯 , 2021 , 17 ( 12 ): 58 - 60 .
HUANG G . Internet of Data:infrastructure of digtital space [J ] . Communications of the CCF , 2021 , 17 ( 12 ): 58 - 60 .
0
浏览量
1791
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构