Style-transfer is a popular and ever-growing field whereby visual content can be stylized to match a reference image. Moving style transfer to the 3D domain is challenging as we require a low-latency system that preserves accurate 3D structure while flexibly styling the scene. We present a real-time 3D style transfer framework that generates fully stylized virtual scenes while preserving underlying geometry and achieving real-time frame rates. Our method leverages Gaussian splatting for an explicit, differentiable scene representation and uses a three-phase joint optimization pipeline. First, we reconstruct a photorealistic Gaussian splat from posed images; second, we transfer color coefficients to match a reference style; and third, we unfreeze Gaussian parameters to minimize a combined content and nearest-neighbor VGG style loss. We compare against 2D-first baselines (StyleShot, GPT-Image-1) and a 3D method, Artistic Radiance Fields, qualitatively demonstrating superior geometric consistency, stylization fidelity, and real-time performance suitable for VR/AR. Integration with Meta Quest 3 and Vision Pro headsets confirms smooth, immersive experiences. Our approach supports diverse artistic styles, explicit scene editing, and opens new avenues for end-to-end, geometry-aware stylization in virtual and augmented reality environments.
Please have patience for the interactive visualizations to load, sometimes it takes a second or two. Use Chrome or Safari if visuals are too slow to appear.
Style transfer is a well-studied area of research whereby input or content images are regenerated in the style of some reference or image. This ability to 'stylize' images has been used in various applications, from artistic rendering to image editing. The most common approach to style transfer is to use a neural architecture to extract features from the content and style images, then combine them to create a new image that retains the original content while adopting the reference style. This presents a significant struggle for 2D style transfer: how can we inherit style while preserving content?
Our work aims to study this problem in the 3D domain, where this problem is exacerbated. Specifically, we not only want to maintain content, but we have a stricter requirement of maintaining the 3D geometry of the scene. This is a significant challenge since existing 2D style transfer methods are unaware of the 3D geometry of the scene, and thus will not be able to create a stylized image consistent with the underlying geometry. An effective solution to the 3D style transfer problem would allow for the creation of stylized 3D scenes consistent with the underlying geometry, enabling a wide range of applications in animation or graphics. We are further interested in the potential to use this technology in virtual reality (VR) applications, where the user can be immersed in a stylized 3D scene, which requires a real-time rendering of the scene, adding additional difficulties with computational efficiency.
There are existing works enabling the rendering of full scenes in the style of a single image, such as ARF (Zhang et. al, 2022), ARF+ (Li et. al, 2023). They work by optimizing a NeRF along with some style-transfer loss to make an implicit stylized scene. However, these existing methods run at low frame rates, making them impractical for real-time applications. This is a problem since a major application of this 3D scene style transfer is creating various stylized virtual reality (VR) scenes, allowing an immersive user experience which requires a re-work of existing 3D style methods to be able to run in real-time. However, the joint optimization of style with 3D geometry is a compelling method we will experiment with.
There is a body of work which attempts to make these NeRF-based methods run in real-time to make walkable spaces (VRNeRF, Xu et. al, 2023) but this sidesteps a major issue with NeRF-style methods: NeRF's only predict RGB and density values, as it is an implicit model, which prevents users from freely editing scenes, a highly desirable feature for creating immersive VR experiences. This leaves us motivated to create a more explicit method for 3D style transfer, which can be easily style-optimized and run at a real-time rate to allow for a workable user-experience. Gaussian Splatting (Kerbl et. al, 2023) is a technique where 3D points are represented as Gaussians, which allows for fast, differentiable rendering via direct summation of Gaussian kernels, which can provide a real-time rendering solution. Furthermore, as an explicit rendering technique, we can represent scenes in point cloud form, allowing easy editing of the scene if desired. Lastly, splatting methods allow for flexibly choosing optimization parameters without requiring 'deferred backpropagation', a time-exhaustive method used with NeRFs to make style optimization possible with realistic memory usage.
However, there are methods which nonetheless use NeRFs as a model to create stylized scenes in 3D such as InstructNeRF2NeRF (Haque et. al 2023) which could feasibly be used to create walkable spaces if optimized at to work at high framerate and tuned to work with image-conditioning as opposed to language conditioning. They work by iteratively using InstructPix2Pix (Brooks et. al, 2023) to stylize the dataset images. We will explore a method similar to this in our work and compare it to our method, analyzing why this kind of setup might be detrimental to performance.
In style transfer literature, there's often a desired output image, a content image, and a style image. Generative methods will aim to train a model to one-shot generate an output image that retains information from the content image while being stylized like the style image. Other methods are optimization-oriented, where they test-time optimize parameters to create a desired output image, which, once again, tunes the content image to the style of the style image. In either case, there are a few main objective functions which dictate how this process is done. For most neural style transfer systems, some pre-trained image model is used to extract features from the target images (style and content). The most common model used is the VGG-19 model (Simonyan et. al, 2015), which is a convolutional neural network (CNN) trained on ImageNet (Russakovsky et. al 2014). The first loss is known as the content loss to keep the output close to the content image. We define the feature activations at layer \(l\) for image \(\mathbf{x}\) as \(F^l(\mathbf{x})\). In standard feature extraction pipelines, we expect some number of features, so we denote feature $i$ in layer $l$ via $F_i^l$. Given a content image \(\mathbf{p}\) and style image \(\mathbf{a}\), the content loss at layer \(l\) is $$ \mathcal{L}_{\mathrm{content}}(\mathbf{p},\mathbf{x},l) = \frac{1}{2} \sum_{i} \bigl(F_{i}^l(\mathbf{x}) - F_{i}^l(\mathbf{p})\bigr)^2 $$ This loss will push the feature activations of the output image to be similar to the feature activations of the content image, and the features to use can be strategically selected to ensure certain features are preserved. For example, the first few layers of the VGG-19 model are known to capture low-level features such as edges and textures. In contrast, the later layers capture high-level features such as object parts and semantic information. We can control which features are preserved in the output image by selecting the appropriate layers. The second loss is known as the style loss, which ensures that the output image is similar to the style image. The style of an image can be captured by feature statistics within its Gram matrix, which is a matrix of inner products between the feature maps of the image. The Gram matrix captures the correlations between different feature maps and is used to represent the style of an image. $$ G_{ij}^l(\mathbf{x}) = \sum_{k} F_{ik}^l(\mathbf{x})\,F_{jk}^l(\mathbf{x}), $$ and the overall style loss is $$ \mathcal{L}_{\mathrm{style}}(\mathbf{a},\mathbf{x}) = \sum_{l=0}^L \sum_{ij} (G_{ij}^l(\mathbf{x}) - G_{ij}^l(\mathbf{a}))^2 $$ Some works define different metrics over the Gram matrices, such as cosine similarity, and importantly, oftentimes do not use the full Gram matrix. NNST (Kolkin et. al, 2022) are inspired by a nearest-neighbor approach, similar to ARF, and choose to match all Gram features with a nearest neighbor, and minimize those distances instead. This gives a function which we call the nearest-neighbor feature matching loss, as described in ARF: $$ \mathcal{L}_{\mathrm{NNFM}}(\mathbf{a},\mathbf{x}) = \sum_{l=0}^L min_{ij} [D(G_{ij}^l(\mathbf{x}), G_{ij}^l(\mathbf{a}))] $$
Gaussian splatting is a way to represent a scene geometry with a series of 3D Gaussians. The full scene requires a set of these Gaussian with parameters for mean $\mu$, covariance $\Sigma$, opacity $\alpha$, and color $c$ (through sperical harmonics). Consider $\sigma(.)$ as the sigmoid function. Specifically, a 3D Gaussian is defined by $$ G_i(x) = \sigma(\alpha_i)e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol\mu_i)^\top \boldsymbol\Sigma_i^{-1} (\mathbf{x}-\boldsymbol\mu_i)}, $$ $$ \Sigma_i = R_iS_iS_i^TR_i^T $$ where $S$ is a scaling matrix and $R$ is a rotation matrix derived from two independently optimizable vectors, $s$ and $q$, a scale vector and a quaternion, respectively. When rendering, we project these 3D points to 2D pixel coordinates through perspective transformation and 3D Gaussians into screen space through this equation: $$ \Sigma' = (JW)\Sigma(W^TJ^T) $$ where J is the Jacobian of the projection and W describes the desired view transformation. They key to their method is a rasterizer which does opacity blending of anisotropic splats through sorting. A given point can then be represented by 'splatting' with 2D Gaussian kernels, which are then composited together to create a final image. The splatting process is done in screen space, where the splats are projected onto the image plane and blended together to create the final image. It follows: The color contribution can then be computed as: $$ C(p) = \sum_{i=1}^N c_i g_i(p) \prod_{j=1}^{i-1} (1 - g_j(p)) $$ where $g_i$ is the projected 2D Gaussian kernel for splat $i$ and $c_i$ is the color. In practice, we use this equation to differentiably form images from our parameters. Once an image is made, we can use losses to compare our renders to the ground truth and backpropagate to optimize the parameters of our Gaussians. Standard loss involves an L1 difference along with a Difference of Structural Similarity Index Measure loss: $$ L_{splat} = \lambda L_{L1} + (1 - \lambda) L_{DSSIM}, $$
As we begin our journey to create a stylized 3D scene, an intuitive first step is to take advantage of the extensive work done in the 2D setting. Specifically, we take inspiration from neural style transfer techniques (Gatys et. al, 2015), which are effective in transferring the style of a reference image to a target image. In our case, we will use a set of posed images as the target and a reference style image as the style. We will then transfer all dataset images to the style of the reference image, and train a 3D scene representation using the stylized images as our first attempt. We hypothesize that stylization in the 2D space first, then learning a 3D model, albeit simple, will be a poor method for creating a stylized yet realistic scene. This is since 2D style transfer methods are not aware of the spatial structure of a scene, and thus will not be able to create a stylized image consistent with the underlying geometry. To test this, we use a state-of-the-art 2D style transfer method, StyleShot (Gao et. al, 2024), which is a forward-pass generative style transfer method built on CLIP and Stable-Diffusion. We use it to stylize a set of posed images of a scene one-by-one, given the same style image. Using Gaussian splatting, we then use these stylized images to train a 3D scene representation. Below, you can interactively inspect the results.
Notice how the style-shot baseline cannot preserve the scene's geometry, resulting in many floating Gaussian artifacts. We explore why this might occur by inspecting how the stylization is changing the images and why it prevents good scene reconstruction. Looking at the generated images below, it is easy to see that each image stylizes different parts of the scene differently. Thus, the geometry does not necessarily correspond to an image pair, resulting in a corrupted reconstruction.
StyleShot 2D image transfer
As can be seen, the color assigned to the bulldozers wheels and intricacies of the geometry varies greatly, making it impossible to apply differentiable rendering techniques for reconstruction.
To further test this hypothesis, we test the recently popularized image generation capabilities of GPT-Image-1 (OpenAI, 2025), resulting in much higher fidelity style transfer. Here, we first generate a description of a style image in natural language, then use this description to generate a set of images with GPT-Image-1 by prompting the image with the style description and each scene image at a time. In the following example, the description is
Turn this image into a hand-drawn anime illustration with soft
lighting and whimsical details inspired by classic Studio Ghibli films.
Apart from the computational infeasibility of this method (each prompt costs approximately $0.25), it is very apparent in the graphic below that the geometry is not preserved at all. Training a Gaussian splatting model on this output would result in nonsense, as no spatial coherence exists between an image pair. It would be interesting to investigate whether one could register the generated images to their counterpart to estimate a fundamental matrix using DINOv2 (Oquab et al. 2023) features which are great at finding correspondences between semantically related parts of an image and train a gaussian splatting model on the transformed images. However, we leave this investigation for future work due to the computational cost.
A priori 2D style transfer using GPT-Image-1 leads to geometric inconsistencies (top: gpt generated images, bottom: original images).
We find that pre-image stylization is insufficient to create a good stylized scene, which leads us to consider how we can integrate style optimization jointly with 3D structure optimization instead of prior lossy stylization. We mostly take inspiration from ARF (Zhang et. al, 2022), whereby a NeRF is pre-trained on the actual images, then optimized for style afterwards by matching the feature maps of rendered images to the style reference image while still keeping the rendered images close to the actual content images. We argue that a NeRF is not a good choice for our application for the following reasons:
The deferred backpropagation method used in ARF is slow and results in the style transfer loss being computed on patches of the image, which can lose information about the global structure of the scene and prevent large style elements.
ARF cannot control which parameters are optimized during the style transfer process. The loss simply optimizes the NeRF parameters. With Gaussian splatting, we can control which parameters are optimized during the style transfer process. This enables us to ensure that spatial structure is preserved while still allowing for local changes to accommodate the style transfer. Furthermore, we can tune the optimization process per parameter class, allowing for more flexibility in the optimization process.
Being an explicit representation, Gaussian splatting better enforces the scene's geometry during the style transfer process (for reasons stated above). Additionally, it enables us to easily edit the scene geometry if desired and, for instance, mask out 3D regions from the style transfer process, which is not straightforward with NeRFs. Likewise, we can easily convert the splats to a mesh representation if desired.
Crucially, Gaussian Splatting unlocks fast train and inference times which is a major requirement for AR/VR applications. In our evaluation section, we show a train- and runtime comparison between our method and ARF/Plenoxels.
This leads us to formulate our method for doing 3D style transfer which roughly follows that of the ARF pipeline but crucially uses Gaussian Splatting. We implement this through a three-phased training sequence as ARF does. First, we train a Gaussian Splat on the training images through $L_{splat}$ described in the Background. Once this is done, we train a small number of iterations to regress the color coefficients to be close to the colors in the style image. Finally, we run the final style training through $L_{NNFM}$ and $L_{content}$ described in the Background. We show an overview of our pipeline below as an example with the Berkeley Redwood scene stylized as a dark cubist Picasso.
Our method pipeline. The videos show the progress at each step of the pipeline: After 3D reconstruction (left), after color transfer (middle), and after style transfer (right).
The figure above shows the three phases with intermediate outputs from every phase. After the first phase, we get a classic photorealistic Gaussian Splat that ensures low structural similarity and L1 losses between renders and ground-truth images. From this, we get a set of 3D splats, each with color, opacity, covariance, and position. Next, we match the colors to the style image by comparing rendered colors and minimizing distance. We can see how the Gaussian shapes are identical, but the colors have changed. This is evident in the second video shown, where the texture of the tree stays the same, but the scene's color has changed. We would also like to change things like texture, shape, and geometry for a more complete style transfer. So after color transfer, we unfreeze all parameters in the gaussians and optimized the style transfer loss described earlier. The final tree output shows this effect with the tree trunk being more diagonally striped, as in the style of the art image we are matching. Below, we share what an example scene looks like during training and then dive into some analysis of our method and how it may compare to other similar setups.
Our method in action. The video shows the progress of our method as it trains on the plane on a scene.
Please inspect the result of the above training process.
For our first experiment, we compare our method to a baseline described earlier, which does not use any joint-optimization style transfer but instead does style transfer a priori in 2D and uses the same Gaussian Splatting model as our method for reconstructing the scene with the stylized images. We observe that our joint optimization method outperforms this stylization for reasons described earlier.
Comparison of our method (right) to a priori stylization (left). The left image is the result of a Gaussian Splatting model trained on the style-shot images, while the right image is the result of our method.
Using 3D Gaussians provides excellent flexibility for style transfer. For a given style loss, any selection of the following Gaussian parameters can be optimized: means (positions), covariances (orientation and 3D scale), opacity, color, and spherical harmonics. In the following, we compare three flavors of our method: (1) RGB-only: only optimizes color and spherical harmonics, (2) Naive: Optimizes all parameters besides opacity, and (3) Ours: Optimizes all parameters and regularizes gaussians to be more round. Qualitatively, we found our method to perform best. (1) does not fully stylize the scene as it cannot rotate or move gaussians, (2) heavily stylizes the scene but results in thin gaussians which break the immersion when close up (in the image below, this phenomenon is visible on the round pot of the tree in the rightmost image). (3) can stylize the scene while keeping the geometry intact and the gaussians more round. The image below compares the three methods on the same scene with Van Gogh's Starry Night, Studio Ghibli, and Xu Beihong styling (left to right, controlled by navigation dots).
Three different viewpoint renders across three different optimization groups. We show results across three different styles.
Below is a comparison of our method to ARF. Note that our method can be tuned to stylize more or less (by choosing which parameters to optimize). In this istance we want the scene to be fully stylized. Notice that the background in the ARF scenes is often insufficiently stylized.
Qualitative comparison of our method (top) and ARF (bottom). The style images are overlayed.
Below is a training and runtime comparison between our method and ARF. Both were evaluated on an RTX 3090 workstation. Our method requires significantly lower training time and can run at a real-time rate, which is crucial for augmented reality applications. ARF on the other hand is much slower to train and uses Plenoxels (Yu et al. 2021) as a base model which renders at roughly 15 FPS, which is not fast enough for AR applications as low rendering framerates on an AR headset can result in dizzyness and nausea.
Comparison of training and inference times between ARF and our method (ARF data as reported in ARF (Zhang et. al, 2022), and Plenoxels (Yu et al. 2021)).
Please interact with the visualizer below to see how our method works across many styles. The leftmost image shows the scene reconstructed with 3D Gaussian Splatting, while the other images demonstrate the results of our method with different styles.
As seen, our method successfully transfers the style for various styles and retains geometric consistency. Similar to prior art in 3D style transfer, transferring abstract, geometric art yields qualitatively better results. To demonstrate how our method can be extended to VR spaces, we integrate our system with a Meta Quest 3 and Vision Pro headset and share a video demonstration of our system. The rendering runs locally on the headset in real time.
Here is a video of a user entering the 3D scene as a walkable space. You can see how the frame rate is high enough for a smooth experience, and the novel views are high quality.
In this blog post, we developed and presented a method for doing 3D-style transfer. Our main choices include using Gaussian splatting as our 3D representation and a joint optimization method for style transfer. Our experiments demonstrate that integrating style optimization directly into the Gaussian splatting pipeline yields stylized 3D scenes that maintain both the reference artwork's visual characteristics and the scene's underlying geometric consistency. Unlike a priori 2D stylization—whether via StyleShot or GPT-Image-1—that produces per-view inconsistencies and floating artifacts in the reconstructed splats, joint scene–style optimization preserves spatial coherence. However, it depends on a well-structured style loss to optimize over. Using Gram matrices of StyleShot features seemed to fail, which could be due to a complex loss landscape that is not well suited for test-time optimization. Luckily, VGG-19, albeit old, works well, which might be because it is not trained to be generative conditioning, whereas StyleShot features are.
Comparing our method to NeRF-based transfer, we find higher fidelity output that is cheaper to run. Explicit scene editing allows for promising future work as well. This principle may extend beyond Gaussian splatting to other explicit geometry representations—such as point clouds, meshes, or voxel grids—suggesting a broader research direction toward end-to-end, geometry-aware stylization. From a practical standpoint, our method opens up new possibilities for immersive VR and AR experiences.
When it comes to the future of our application in live interaction with objects in AR or VR, we note that our method may not be able to preserve the affordances of objects in the scene. For example, if we were to stylize a scene with a chair, the style transfer may change the visible affordances of the chair, making it difficult for users to interact with it. We could foreseeably address this by creating a loss that preserves the affordances or interactivity. Or perhaps masking out objects and using different stylization learning rates for them, which can be done with explicit methods like Gaussian splatting. This would allow us to stylize the background while keeping the objects in the scene intact.
We are also interested in the temporal dimension for added immersion. Our system, in its current state, would not be able to handle dynamic scenes. Some work in dynamic 3D Gaussians could be applicable here to make stylized interactive movies (Luiten et. al, 2023). and we are excited to see how this could be applied to our method. Our method is also limited by the current status of 2D image style transfer. We find that our method works well for abstract styles, but struggles with styles that require the presence of humans or objects in the artwork. For example, styles like Keith Haring, where humans and objects are visible in the artwork, would need to be visible in a 3D transfer for the style to be recognizable. Future work could explore more generative methods to overcome this in 3D.