Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced
with visual input perturbations, including variations in lighting and textures, impeding their real-world
application. We propose Stem-Ob that utilizes pretrained image diffusion models to suppress low-level visual
differences while maintaining high-level scene structures. This image inversion process is akin to
transforming the observation into a shared representation, from which other observations stem, with
extraneous details removed. Stem-Ob contrasts with data-augmentation approaches as it is robust to various
unspecified appearance changes without the need for additional training. Our method is a simple yet highly
effective plug-and-play solution. Empirical results confirm the effectiveness of our approach in simulated
tasks and show an exceptionally significant improvement in real-world applications, with an average increase
of 22.2% in success rates compared to the best baseline.