As an emerging technique, crowdsourcing has drawn people's great attention in recent years. The crowdsourced data, however, can hardly be fused easily to enable usable applications for the reason that the data are collected by different users, in different locations, at different time, with different noises and distortions. Although different crowdsourcing services have proposed different data fusing methods, we find that they may not fully leverage the data collected from multiple dimensions that can potentially lead to a better fusion result. In order to harness this opportunity, we propose a more general solution, which can fuse the multi-dimension crowdsourced data together and align them with the consistent time and location stamps by using the features of the sensory data only, and thus can provide a high-quality crowdsourcing service from the raw data samplings collected from the environment. We conduct evaluations and experiments using different commercial smart phones to verify the effectiveness of our proposed method.