Zero-shot singing voice synthesis (SVS), the task to synthesize the singing voice of an arbitrary target singer, has gained increasing attentions in the past few years. Several recently proposed systems have demonstrated promising results on this task. However, these systems require detailed musical features at the frame level as the musical content. To deal with this issue, we propose a model that performs zero-shot SVS with only musical score as the musical content condition. To help model training, we build an acoustic encoder that extracts linguistic features from audio, and train it with the lyrics transcription objective. The output of the acoustic encoder serves as an alternative to the musical score, allowing the SVS model to learn from weakly labeled data. Results suggest that the proposed method outperforms baseline semi-supervised method in both subjective and objective tests.