Scan chains are used in design for test by providing controllability and observability at each register. Scan optimization is run during physical design after placement where scannable elements are re-ordered along the chain to reduce total wirelength (and power). In this paper, we present a machine learning based technique that leverages constrained clustering and reinforcement learning to obtain a wirelength efficient scan chain solution. Novel techniques like next-min sorted assignment, clustered assignment, node collapsing, partitioned Q-Learning and in-context start-end node determination are introduced to enable improved wire length while honoring design-for-test constraints. The proposed method is shown to provide up to 24% scan wirelength reduction over a traditional algorithmic optimization technique across 188 moderately sized blocks from an industrial 7nm design.