ETL主要用于整合不同信息系统中的异构数据,实现数据对上层业务用户的透明性,它是构建高质量的数据仓库的关键。针对某省级运营商的数据ETL需求,提出了一种分布式的ETL解决方案:基于MapReduce框架完成数据的非实时ETL功能;而针对实时ETL需求,则将ETL集群与Hadoop节点合设,充分利用Hadoop集群提供的集群管理功能,实现实时ETL的任务调度,从而提高了多服务器之间的协同性,也充分利用了服务器的硬件能力,节约了设备投资。
In order to provide the data transparency to the service user,ETL technology is applied to integrate the heterogeneous data from different information system. It is the key to construct high quality data warehouse. It proposes a distributed ETL scheme to satisfy the ETL requirement of one provincial telecommunication operator:a method based on MadReduce frame is used to fulfil the non-real-time ETL function. For the real-time ETL needs,it col aborates the ETL service with the Hadoop cluster so as to take advantage of the“cluster management”function in the Hadoop system. Therefore,the real-time ETL jobs can be finished while ful y using the hardware capability and saving the equipment investment.