2025

ArXiv

Unsupervised Segmentation by Diffusing, Walking and Cutting

Daniela Ivanova, Marco Aversa, Paul Henderson, and John Williamson

ArXiv preprint, 2025

Abs

We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.
WACV

ARTeFACT: Benchmarking Segmentation Models on Diverse Analogue Media Damage

Daniela Ivanova, Marco Aversa, Paul Henderson, and John Williamson

IEEE/CVF Winter Conference on Applications of Computer Vision , 2025

Abs

Accurately detecting and classifying damage in analogue media such as paintings, photographs, textiles, mosaics, and frescoes is essential for cultural heritage preservation. While machine learning models excel in correcting degradation if the damage operator is known a priori, we show that they fail to robustly predict \emphwhere the damage is even after supervised training; thus, reliable damage detection remains a challenge. Motivated by this, we introduce ARTeFACT, a dataset for damage detection in diverse types analogue media, with over 11,000 annotations covering 15 kinds of damage across various subjects, media, and historical provenance. Furthermore, we contribute human-verified text prompts describing the semantic contents of the images, and derive additional textual descriptions of the annotated damage. We evaluate CNN, Transformer, diffusion-based segmentation models, and foundation vision models in zero-shot, supervised, unsupervised and text-guided settings, revealing their limitations in generalising across media types. By publicly sharing our dataset, we provide a benchmark for the analogue damage detection task, with the aim to advance developments in automated analogue media restoration and preservation.

2024

NeurIPS
Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

Athanasios Tragakis*, Marco Aversa*, Chaitanya Kaul, Roderick Murray-Smith, and Daniele Faccio

Advances in Neural Information Processing Systems, 2024

Abs Bib Code Website

In this work, we introduce Pixelsmith, a zero-shot text-to-image generative framework to sample images at higher resolutions with a single GPU. We are the first to show that it is possible to scale the output of a pre-trained diffusion model by a factor of 1000, opening the road for gigapixel image generation at no additional cost. Our cascading method uses the image generated at the lowest resolution as a baseline to sample at higher resolutions. For the guidance, we introduce the Slider, a tunable mechanism that fuses the overall structure contained in the first-generated image with enhanced fine details. At each inference step, we denoise patches rather than the entire latent space, minimizing memory demands such that a single GPU can handle the process, regardless of the image’s resolution. Our experimental results show that Pixelsmith not only achieves higher quality and diversity compared to existing techniques, but also reduces sampling time and artifacts.
@article{tragakis2024gpuenoughpushingimage, title = {Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models}, author = {Tragakis*, Athanasios and Aversa*, Marco and Kaul, Chaitanya and Murray-Smith, Roderick and Faccio, Daniele}, journal = {Advances in Neural Information Processing Systems}, year = {2024}, }

NeurIPS

Generative Fractional Diffusion Models

Gabriel Nobis, Maximilian Springenberg, Marco Aversa, Michael Detzel, Rembert Daems, Roderick Murray-Smith, and 8 more authors

Advances in Neural Information Processing Systems, 2024

Bib Website

@article{nobis2023generative,
  title = {Generative Fractional Diffusion Models},
  author = {Nobis, Gabriel and Springenberg, Maximilian and Aversa, Marco and Detzel, Michael and Daems, Rembert and Murray-Smith, Roderick and Nakajima, Shinichi and Lapuschkin, Sebastian and Ermon, Stefano and Birdal, Tolga and and Manfred Opper and Knochenhauer, Christoph and Oala, Luis and Samek, Wojciech},
  journal = {Advances in Neural Information Processing Systems},
  year = {2024},
}

2023

NeurIPS Spotlight
DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch Diffusion in Histopathology

Marco Aversa, Gabriel Nobis, Miriam Hägele, Kai Standvoss, Mihaela Chirica, Roderick Murray-Smith, and 5 more authors

In Advances in Neural Information Processing Systems, 2023

Abs Bib PDF Code Website

We present DiffInfinite, a hierarchical diffusion model that generates arbitrarily large histological images while preserving long-range correlation structural information. Our approach first generates synthetic segmentation masks, subsequently used as conditions for the high-fidelity generative diffusion process. The proposed sampling method can be scaled up to any desired image size while only requiring small patches for fast training. Moreover, it can be parallelized more efficiently than previous large-content generation methods while avoiding tiling artefacts. The training leverages classifier-free guidance to augment a small, sparsely annotated dataset with unlabelled data. Our method alleviates unique challenges in histopathological imaging practice: large-scale information, costly manual annotation, and protective data handling. The biological plausibility of DiffInfinite data is validated in a survey by ten experienced pathologists as well as a downstream segmentation task. Furthermore, the model scores strongly on anti-copying metrics which is beneficial for the protection of patient data.
@inproceedings{NEURIPS2023_f64927f5, title = {DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch Diffusion in Histopathology}, author = {Aversa, Marco and Nobis, Gabriel and H{\"a}gele, Miriam and Standvoss, Kai and Chirica, Mihaela and Murray-Smith, Roderick and Alaa, Ahmed and Ruff, Lukas and Ivanova, Daniela and Samek, Wojciech and others}, award = {Spotlight}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Oh, A. and Naumann, T. and Globerson, A. and Saenko, K. and Hardt, M. and Levine, S.}, pages = {78126--78141}, publisher = {Curran Associates, Inc.}, volume = {36}, year = {2023}, journal = {Advances in Neural Information Processing Systems}, }
TMLR
Data Models for Dataset Drift Controls in Machine Learning With Optical Images

Luis Oala*, Marco Aversa*, Gabriel Nobis, Kurt Willis, Yoan Neuenschwander, Michèle Buck, and 5 more authors

Transactions on Machine Learning Research, 2023

Abs Bib PDF Code Website

Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important public services spanning medicine or environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of machine learning’s primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself to support the downstream machine vision task. This is an interesting upgrade to existing imaging pipelines which traditionally have been optimized to be consumed by human users but not machine learning models. Alongside the data model code we release two datasets to the public that we collected as part of this work. In total, the two datasets, Raw-Microscopy and Raw-Drone, comprise 1,488 scientifically calibrated reference raw sensor measurements, 8,928 raw intensity variations as well as 17,856 images processed through twelve data models with different configurations. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.
@article{oala2023data, title = {Data Models for Dataset Drift Controls in Machine Learning With Optical Images}, author = {Oala*, Luis and Aversa*, Marco and Nobis, Gabriel and Willis, Kurt and Neuenschwander, Yoan and Buck, Mich{\`e}le and Matek, Christian and Extermann, Jerome and Pomarico, Enrico and Samek, Wojciech and others}, journal = {Transactions on Machine Learning Research}, year = {2023}, }

2022

OBPDC
Data-centric AI workflow based on compressed raw images

Marco Aversa, Ziad Malik, Phillip Geier, Fabien Droz, Andres Upegui, Roderick Murray-Smith, and 2 more authors

Proceedings of the OBPDC2022-8th Internationl Worshop on Onboard payload data compression, 2022

Abs Bib PDF

In order to extract the full potential of the high volume of image data coming from earth observation, image compression is needed for transfer and storage, and artificial intelligence (AI) is needed for analysis. The promise of AI is to perform complex operations with low programming effort, naturally shifting the focus of the development of machine learning systems from the code, i.e. the implementation of the neural network, to the training process, and in particular to the acquisition, selection and preparation of training data. Lossy compression (like many other image processing methods), however, was developed primarily to compress already processed images for visual inspection, not regarding damage to invisible image properties which play an important role in machine-learning, such as higher order statistics, correlations and bias. The Jetraw image format, in contrast, was designed to compress raw image data, preserving its statistics and embedding camera calibration profile and noise model. These features facilitate the generation of accurate raw synthetic data. They allow for “Jetraw functions” to take a Jetraw image as an argument and return another Jetraw image, complete with its newly computed calibration profile and noise model. Several of these functions can be chained to build complex operations while always maintaining metrologically correct data, i.e. values that have independent errors, are unbiased and have a well-defined noise model. Jetraw images and functions may be used in end-to-end models to generate synthetic data with statistics matching those of genuine raw images, and play an important role in data-centric AI methodologies. Here we show how these features are used for a machine-learning task: the segmentation of cars in an urban, suburban and rural environment. Starting from a drone and airship image dataset in the Jetraw format (with calibrated sensor and optics), we use an end-to-end model to emulate realistic satellite raw images with on-demand parameters. First, we study the effect of various satellite parameters on the task’s performance as well as on the compressed image size. These parameters are satellite mirror size, focal length, pixel size and pattern, exposure time and atmospheric haze. Then, we discuss characterising and improving the performance and tolerances of the neural network through the use of on-the-fly generation of data that accurately reflects the statistics of the target system.
@article{aversa2022data, title = {Data-centric AI workflow based on compressed raw images}, author = {Aversa, Marco and Malik, Ziad and Geier, Phillip and Droz, Fabien and Upegui, Andres and Murray-Smith, Roderick and Clausen, Christoph and Sanguinetti, Bruno}, journal = {Proceedings of the OBPDC2022-8th Internationl Worshop on Onboard payload data compression}, year = {2022} }

NeurIPS

Bessel Equivariant Networks for Inversion of Transmission Effects in Multi-Mode Optical Fibres

Joshua Mitton, Simon Mekhail, Miles Padgett, Daniele Faccio, Marco Aversa, and Roderick Murray-Smith

Advances in Neural Information Processing Systems, 2022

Bib PDF

@article{mitton2022bessel,
  title = {Bessel Equivariant Networks for Inversion of Transmission Effects in Multi-Mode Optical Fibres},
  author = {Mitton, Joshua and Mekhail, Simon and Padgett, Miles and Faccio, Daniele and Aversa, Marco and Murray-Smith, Roderick},
  journal = {Advances in Neural Information Processing Systems},
  volume = {35},
  pages = {16010--16022},
  year = {2022},
  asbtract = {We develop a new type of model for solving the task of inverting the transmission effects of multi-mode 
    optical fibres through the construction of an SO+(2, 1) equivariant neural network. This model takes advantage of the of the azimuthal
    correlations known to exist in fibre speckle patterns and naturally accounts for the
    difference in spatial arrangement between input and speckle patterns. In addition,
    we use a second post-processing network to remove circular artifacts, fill gaps,
    and sharpen the images, which is required due to the nature of optical fibre transmission. 
    This two stage approach allows for the inspection of the predicted images produced by the more robust 
    physically motivated equivariant model, which could be useful in a safety-critical application, or by the output of both models,
    which produces high quality images. Further, this model can scale to previously
    unachievable resolutions of imaging with multi-mode optical fibres and is demonstrated on 256 × 256 pixel images. 
    This is a result of improving the trainable parameter requirement from O(N4) to O(m), where N is pixel size and m is
    number of fibre modes. Finally, this model generalises to new images, outside of
    the set of training data classes, better than previous models.},
}

2021

Phys. Rev. Lett.
Direct observation of fractal-dimensional percolation in the 3D cluster dynamics of a ferroelectric supercrystal

Ludovica Falsi, Marco Aversa, Fabrizio Di Mei, Davide Pierangeli, Feifei Xin, Aharon J Agranat, and 1 more author

Physical Review Letters, 2021

Abs Bib

We perform percolation analysis of crossed-polarizer transmission images in a biased nanodisordered bulk KTN: Li perovskite. Two distinct percolative transitions are identified at two electric field thresholds. The low-field transition involves a directional fractal chain of dimension D= 1.65, while the high-field transition has a dimension D> 2. Direct cluster imaging in the volume is achieved using high-resolution orthographic 3D projections based on giant refraction. Percolation is attributed to a full-3D domain reorientation that mediates the transition from a ferroelectric supercrystal state to a disordered domain mosaic.
@article{falsi2021direct, title = {Direct observation of fractal-dimensional percolation in the 3D cluster dynamics of a ferroelectric supercrystal}, author = {Falsi, Ludovica and Aversa, Marco and Di Mei, Fabrizio and Pierangeli, Davide and Xin, Feifei and Agranat, Aharon J and DelRe, Eugenio}, journal = {Physical Review Letters}, volume = {126}, number = {3}, pages = {037601}, year = {2021}, publisher = {APS} }

Workshop Papers

2024

ECCV

State-of-the-Art Fails in the Art of Damage Detection

Daniela Ivanova, Marco Aversa, Paul Henderson, and John Williamson

ECCV 2024 Vision for Art Workshop VISART, 2024

Abs Website

Accurately detecting and classifying damage in analogue media such as paintings, photographs, textiles, mosaics, and frescoes is essential for cultural heritage preservation. While machine learning models excel in correcting global degradation if the damage operator is known a priori, we show that they fail to predict where the damage is even after supervised training; thus, reliable damage detection remains a challenge. We introduce DamBench, a dataset for damage detection in diverse analogue media, with over 11,000 annotations covering 15 damage types across various subjects and media. We evaluate CNN, Transformer, and text-guided diffusion segmentation models, revealing their limitations in generalising across media types.
ICML

Generative Fractional Diffusion Models

Gabriel Nobis, Maximilian Springenberg, Marco Aversa, Michael Detzel, Stefano Ermon, Shinichi Nakajima, and 5 more authors

ICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling, 2024

Abs

We generalize the continuous time framework for score-based generative models from an underlying Brownian motion (BM) to an approximation of fractional Brownian motion (FBM). We derive a continuous reparameterization trick and the reverse time model by representing FBM as a stochastic integral over a family of Ornstein-Uhlenbeck processes to define generative fractional diffusion models (GFDM) with driving noise converging to a non-Markovian process of infinite quadratic variation. The Hurst index H∈(0,1) of FBM enables control of the roughness of the distribution transforming path. To the best of our knowledge, this is the first attempt to build a generative model upon a stochastic process with infinite quadratic variation.

2023

NeurIPS

Generative Fractional Diffusion Models

Gabriel Nobis, Maximilian Springenberg, Marco Aversa, Michael Detzel, Stefano Ermon, Shinichi Nakajima, and 5 more authors

NeurIPS 2023 Workshop on Diffusion Models, 2023

Abs Website

We generalize the continuous time framework for score-based generative models from an underlying Brownian motion (BM) to an approximation of fractional Brownian motion (FBM). We derive a continuous reparameterization trick and the reverse time model by representing FBM as a stochastic integral over a family of Ornstein-Uhlenbeck processes to define generative fractional diffusion models (GFDM) with driving noise converging to a non-Markovian process of infinite quadratic variation. The Hurst index H∈(0,1) of FBM enables control of the roughness of the distribution transforming path. To the best of our knowledge, this is the first attempt to build a generative model upon a stochastic process with infinite quadratic variation.
ICML

Data Models for Dataset Drift Controls in Machine Learning With Optical Images

Luis Oala, Marco Aversa, Gabriel Nobis, Kurt Willis, Yoan Neuenschwander, Michèle Buck, and 5 more authors

International Conference on Machine Learning, Differentiable Almost Everything Workshop, 2023

Abs

Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important public services spanning medicine or environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of machine learning’s primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself to support the downstream machine vision task. This is an interesting upgrade to existing imaging pipelines which traditionally have been optimized to be consumed by human users but not machine learning models. Alongside the data model code we release two datasets to the public that we collected as part of this work. In total, the two datasets, Raw-Microscopy and Raw-Drone, comprise 1,488 scientifically calibrated reference raw sensor measurements, 8,928 raw intensity variations as well as 17,856 images processed through twelve data models with different configurations. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.
ICML

Data Models for Dataset Drift Controls in Machine Learning With Optical Images

Luis Oala, Marco Aversa, Gabriel Nobis, Kurt Willis, Yoan Neuenschwander, Michèle Buck, and 5 more authors

International Conference on Machine Learning, Spurious Correlations, Invariance, and Stability Workshop, 2023

Abs

Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important public services spanning medicine or environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of machine learning’s primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself to support the downstream machine vision task. This is an interesting upgrade to existing imaging pipelines which traditionally have been optimized to be consumed by human users but not machine learning models. Alongside the data model code we release two datasets to the public that we collected as part of this work. In total, the two datasets, Raw-Microscopy and Raw-Drone, comprise 1,488 scientifically calibrated reference raw sensor measurements, 8,928 raw intensity variations as well as 17,856 images processed through twelve data models with different configurations. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.

2022

NeurIPS Contributed Talk

Physical Data Models in Machine Learning Imaging Pipelines

Marco Aversa, Luis Oala, Christoph Clausen, Roderick Murray-Smith, and Bruno Sanguinetti

Advances in Neural Information Processing Systems, Machine Learning and the Physical Science Workshop, 2022

Abs PDF Website

Light propagates from the object through the optics up to the sensor to create an image. Once the raw data is collected, it is processed through a complex image signal processing (ISP) pipeline to produce an image compatible with human perception. However, this processing is rarely considered in machine learning modelling because available benchmark data sets are generally not in raw format. This study shows how to embed the forward acquisition process into the machine learning model. We consider the optical system and the ISP separately. Following the acquisition process, we start from a drone and airship image dataset to emulate realistic satellite raw images with on-demand parameters. The end-to-end process is built to resemble the optics and sensor of the satellite setup. These parameters are satellite mirror size, focal length, pixel size and pattern, exposure time and atmospheric haze. After raw data collection, the ISP plays a crucial role in neural network robustness. We jointly optimize a parameterized differentiable image processing pipeline with a neural network model. This can lead to speed up and stabilization of classifier training at a margin of up to 20% in validation accuracy.