Recent advancements in Convolution Neural Networks (CNNs) have achieved amazing success in numerous applications. The record-breaking performance of CNNs is usually at the prohibitive training costs, thus all training data are usually processed at the powerful centralized server side, which rises privacy concerns. Federated learning (FL) is a distributed machine learning method over mobile devices to train a global model while keeping decentralized data on devices to preserve the data privacy. However, there are two major limitations to deploy FL on mobile clients. Firstly, on the client side, the limited communication and computation resources on mobile devices cannot well support the full training iterations. Secondly, on the server side, conventional FL only aggregate a common output for all the clients without personalizing the model to each client, which is an important missing feature when clients have heterogeneous data distributions. In this work, we aim to enable low-cost personalized FL by focusing on the weight gradients which are the most important exchanging parameters in FL and meanwhile, dominating the computation and communication cost. We first observe that the client's calculated weight gradients have high sparsity, and the sparse pattern in weight gradients could be predicted via very simple bit-wise operations on a sequence of bits (named bit-stream) instead of conducting expensive high-precision calculations to determine them. Furthermore, a unique pattern is exhibited in each client's uploaded weight gradients according to the distribution of its local training data. Guided by this pattern, each client can get a personalized aggregated model to fit its own data. Hence, we leverage bit-streams to predict weight gradients sparsity for low-cost training on each device, and meanwhile, bit-streams are used to represent the unique sparse pattern of the weight gradient for each client which will guide the model personalization. From our experiments, our proposed framework can improve the computation efficiency by 3.5× on average (up to 4.2×) and reduce the communication cost by 23% on average (up to 41%) while still achieving the state-of-the-art personalized accuracy.