• AI Collection
  • Oxford Thesis Collection
  • CC0 version of this metadata

Optimising convolutional neural networks for large-scale neuroimaging studies

Ageing has a pronounced effect on the human brain, leading to cognitive decline and an increased risk of neurodegenerative diseases. Thus, the ageing population presents a significant challenge for healthcare. The use of MRI and the availability of computational methods for analysing the MRI data is increasingly contributing to the understanding of healthy and diseased structural brain maturation and ageing. Increasingly, large cross sectional and longitudinal neuroimaging studies are bec...

Email this record

Please enter the email address that the record information will be sent to.

Please add any additional information to be included within the email.

Cite this record

Chicago style, access document.

  • ndinsdale_thesis_final.pdf (Dissemination version, 47.2MB)

Why is the content I wish to access not available via ORA?

Content may be unavailable for the following four reasons.

  • Version unsuitable We have not obtained a suitable full-text for a given research output. See the versions advice for more information.
  • Recently completed Sometimes content is held in ORA but is unavailable for a fixed period of time to comply with the policies and wishes of rights holders.
  • Permissions All content made available in ORA should comply with relevant rights, such as copyright. See the copyright guide for more information.
  • Clearance Some thesis volumes scanned as part of the digitisation scheme funded by Dr Leonard Polonsky are currently unavailable due to sensitive material or uncleared third-party copyright content. We are attempting to contact authors whose theses are affected.

Alternative access to the full-text

Request a copy.

We require your email address in order to let you know the outcome of your request.

Provide a statement outlining the basis of your request for the information of the author.

Please note any files released to you as part of your request are subject to the terms and conditions of use for the Oxford University Research Archive unless explicitly stated otherwise by the author.

Contributors

Bibliographic details, item description, related items, terms of use, views and downloads.

If you are the owner of this record, you can report an update to it here: Report update to this record

Report an update

We require your email address in order to let you know the outcome of your enquiry.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 May 2024

Novel applications of Convolutional Neural Networks in the age of Transformers

  • Tansel Ersavas 1 ,
  • Martin A. Smith 1 , 2 , 3 , 4 &
  • John S. Mattick 1  

Scientific Reports volume  14 , Article number:  10000 ( 2024 ) Cite this article

410 Accesses

3 Altmetric

Metrics details

  • Computational science
  • Machine learning

Convolutional Neural Networks (CNNs) have been central to the Deep Learning revolution and played a key role in initiating the new age of Artificial Intelligence. However, in recent years newer architectures such as Transformers have dominated both research and practical applications. While CNNs still play critical roles in many of the newer developments such as Generative AI, they are far from being thoroughly understood and utilised to their full potential. Here we show that CNNs can recognise patterns in images with scattered pixels and can be used to analyse complex datasets by transforming them into pseudo images with minimal processing for any high dimensional dataset, representing a more general approach to the application of CNNs to datasets such as in molecular biology, text, and speech. We introduce a pipeline called DeepMapper , which allows analysis of very high dimensional datasets without intermediate filtering and dimension reduction, thus preserving the full texture of the data, enabling detection of small variations normally deemed ‘noise’. We demonstrate that DeepMapper can identify very small perturbations in large datasets with mostly random variables, and that it is superior in speed and on par in accuracy to prior work in processing large datasets with large numbers of features.

Similar content being viewed by others

convolutional neural network phd thesis

Deep learning for cellular image analysis

convolutional neural network phd thesis

Off-the-shelf deep learning is not enough, and requires parsimony, Bayesianity, and causality

convolutional neural network phd thesis

A guide to machine learning for biologists

Introduction.

There are exponential increases in data 1 especially from highly complex systems, whose non-linear interactions and relationships are not well understood, and which can display major or unexpected changes in response to small perturbations, known as the ‘Butterfly effect’ 2 .

In domains characterised by high-dimensional data, traditional statistical methods and Machine Learning (ML) techniques make heavy use of feature engineering that incorporates extensive filtering, selection of highly variable parameters, and dimension reduction techniques such as Principal Component Analysis (PCA) 3 . Most current tools filter out smaller changes in data, mostly considered artefacts or `noise`, which may contain information that is paramount to understanding the nature and behaviour of such highly complex systems 4 .

The emergence of Deep Learning (DL) offers a paradigm shift. DL algorithms, underpinned by adaptive learning mechanisms, can discern both linear and non-linear data intricacies, and open avenues to analyse data that is not possible or practical by conventional techniques 5 , particularly in complex domains such as image, temporal sequence analysis, molecular biology, and astronomy 6 . DL models, such as Convolutional Neural Networks (CNNs) 7 , Recurrent Neural Networks (RNNs) 8 , Generative Network s 9 and Transformers 10 , have demonstrated exceptional performance in various domains, such as image and speech recognition, natural language processing, and game playing 6 . CNNs and LSTMs were found to be great tools to predict behaviour of so called `chaotic` systems 11 . Modern DL systems often surpass human-level performance, and challenge humans even in creative endeavours.

CNNs utilise a unique architecture that comprises several layers, including convolutional layers, pooling layers, and fully connected layers, to process and transform the input data hierarchically 5 . CNNs have no knowledge of sequence, and therefore are generally not used in analysing time-series or similar data, which is traditionally attempted with Recurrent Neural Networks (RNNs) 12 and Long Short-Term Memory networks (LSTMs) 8 due to their ability to capture temporal patterns. Where CNNs have been employed for sequence or time-series analysis, 1-dimensional (1D) CNNs have been selected because of their vector based 1D input structure 13 . However, attempts to analyse such data in 1D CNNs do not always give superior results 14 . In addition, GPU (Graphical Processing Units) systems are not always optimised for processing 1D CNNs, therefore even though 1D CNNs have fewer parameters than 2-dimensional (2D) CNNs, 2D CNNs can outperform 1D CNNs 15 .

Transformers , introduced by Vaswani et al. 10 , have recently come to prominence, particularly for tasks where data are in the form of time series or sequences, in domains ranging from language modelling to stock market prediction 16 . Transformers leverage self-attention, a key component that allows a model to weigh and focus on various parts of an input sequence when producing an output, enabling the capture of long-range dependencies in data. Unlike CNNs, which use local receptive fields, self-attention weighs the significance of various parts of the input data 17 .

Following success with sequence-based tasks, Transformers are being extended to image processing. Vision-Transformers in object detection 18 , Detection Transformers 19 and lately Real-time Detection Transformers all claim superiority over CNNs 20 . However, their inference operations demand far more resources than CNNs and trail CNNs in flexibility. They also suffer similar augmentation problems as CNNs. More recently, Retentive-Networks have been offered as an alternative to Transformers 21 and may soon challenge the Transformer architecture.

CNNs can recognise dispersed patterns

Even though CNNs are widely used, there are some misconceptions, notably that CNNs are largely limited to image data, and require established spatial relationships between pixels in images, both of which are open to challenge. The latter is of particular importance when considering the potential of CNNs to analyse complex non-image datasets, whose data structures are arbitrary.

Moreover, while CNNs are universal function approximators 22 , they may not always generalise 23 , especially if they are trained on data that is insufficient to cover the solution space 24 . It is also known that they can spontaneously generalise even when supplied with a small number of samples during training after overfitting, called ‘grokking’ 25 , 26 . CNNs can generalise from scattered data if given enough samples, or if they grok, and this can be determined by observing changes to training versus testing accuracy and loss.

Non-image processing with CNNs

While CNNs have achieved remarkable success in computer vision applications, such as image classification and object detection 7 , 27 , they have also been employed in other domains to a lesser degree with impressive results, including: (1) natural language processing, text classification, sentiment analysis and named entity recognition, by treating text data as a one-dimensional image with characters represented as pixels 16 , 28 ; (2) audio processing, such as speech recognition, speaker identification and audio event detection, by applying convolutions over time frequency representations of audio signals 29 ; (3) time series analysis, such as financial market prediction, human activity recognition and medical signal analysis, using one-dimensional convolutions to capture local temporal patterns and learn features from time series data 30 ; and (4) biopolymer (e.g., DNA) sequencing, using 2D CNNs to accurately classify molecular barcodes in raw signals from Oxford Nanopore sequencers using a transformation to turn a 1D signal into 2D images—improving barcode identification recovery from 38 to over 85% 31 .

Indeed, CNNs are not perfect tools for image processing as they do not develop semantic understanding of images even though they can be trained to do semantic segmentation 32 . They cannot easily recognise negative images when trained with positive images 33 . CNNs are also sensitive to the orientation and scale of objects and must rely on augmentation of image datasets, often involving hundreds of variations of the same image 34 . There are no such changes in the perspective and orientation of data converted into flat 2D images.

In the realm of complex domains that generate huge amounts of data, augmentation is usually not required for non-image datasets, as the datasets will be rich enough. Moreover, introducing arbitrary augmentation does not always improve accuracy; indeed, introducing hand-tailored augmentation may hinder analysis 35 . If augmentation is required, it can be introduced in a data-oriented form, but even when using automated augmentation such as AutoAugment 35 or FasterAutoAugment 36 , many of the augmentations (such as shearing, translation, rotation, inversion, etc.) should not be used, and the result should be tested carefully, as augmentation may introduce artefacts.

A frequent problem with handling non-image datasets with many variables is noise. Many algorithms have been developed for noise elimination, most of which are domain specific. CNNs can be trained to use the whole input space with minimal filtering and no dimension reduction, and can find useful information in what might be ascribed as ‘noise’ 4 , 37 . Indeed, a key reason to retain ‘noise’ is to allow discovery of small perturbations that cannot be detected by other methods 11 .

Conversion of non-image data to artificial images for CNN processing

Transforming sequence data to images without resorting to dimension reduction or filtering offers a potent toolset for discerning complex patterns in time series and sequence data, which potentiates the two major advantages of CNNs compared to RNNs, LSTMs and Transformers . First, CNNs do not depend on past data to recognise current patterns, which increases sensitivity to detect patterns that appear in the beginning of time-series or sequence data. Second, 2D CNNs are better optimised for GPUs and highly parallelizable, and are consequently faster than other current architectures, which accelerates training and inference, while reducing resource and energy consumption during in all phases including image transformation, training, and inference significantly.

Image data such as MNIST represented in a matrix can be classified by basic deep networks such as Multi-level Perceptrons (MLP) by turning their matrix representation to vectors (Fig.  1 a). Using this approach analysis of images becomes increasingly complex as the image size grows, increasing the input parameters of MLP and the computational cost exponentially. On the other hand, 2D CNNs can handle the original matrix much faster than MLP with equal or better accuracy and scale to much larger images.

figure 1

Conversion of images to vectors and vice versa. ( a ) Basic operation of transformation of an image to a vector, forming a sequence representation of the numeric values of pixels. ( b ) Transforming a vector to a matrix, forming an image by encoding numerical values as pixels. During this operation if the vector size cannot be mapped to m X n because vector size is smaller than the nearest m X n, then it is padded with zeroes to the nearest m X n.

Just like how a simple neural network analyses a 2D image by turning it into a vector, the reciprocal is also true—data in a vector can be converted to a 2D matrix (Fig.  1 b). Vectors converted to such matrices form arbitrary patterns that are incomprehensible to human eye. A similar technique for such mapping has also been proposed by Kovelarchuk et al. using another algorithm called CPC-R 38 .

Attribution

An important aspect of any analysis is to be able to identify those variables that are most important and the degree to which they contribute to a given classification. Identifying these variables is particularly challenging in CNNs due to their complex hierarchical architecture, and many non-linear transformations 39 . To address this problem many ‘attribution methods’ have been developed to try to quantify the contribution of each variable (e.g., pixels in images) to the final output for deep neural networks and CNNs 40 .

Saliency maps serve as an intuitive attribution and visualisation tool for CNNs, spotlighting regions in input data that significantly influence the model's predictions 27 . By offering a heatmap representation, these maps illuminate key features that the model deems crucial, thus aiding in demystifying the model's decision-making process. For instance, when analysing an image of a cat, the saliency map would emphasise the cat's distinct features over the background. While their simplicity facilitates understanding even for those less acquainted with deep learning, saliency maps do face challenges, particularly their sensitivity to noise and occasional misalignment with human intuition 41 , 42 , 43 . Nonetheless, they remain a pivotal tool in enhancing model transparency and bridging the interpretability gap between ML models and human comprehension.

Several methods have been proposed for attribution, including Guided Backpropagation 44 , Layer-wise Relevance Propagation 45 , Gradient-weighted Class Activation Mapping 46 , Integrated Gradients 47 , DeepLIFT 48 , and SHAP (SHapley Additive exPlanations) 49 . Many of these methods were developed because it is challenging to identify important input features when there are different images with the same label (e.g., ‘bird’ with many species) presented at different scales, colours, and perspectives. In contrast, most non-image data does not have such variations, as each pixel corresponds to the same feature. For this reason, choosing attributions with minimal processing is sufficient to identify the salient input variables that have the maximal impact on classification.

Here we introduce a new analytical pipeline, DeepMapper , which applies a non-indexed or indexed mapping to the data representing each data point with one pixel, enabling the classification or clustering of data using 2D CNNs. This simple direct mapping has been tried by others but has not been tested with datasets with sufficiently large amounts of data in various conditions. We use raw data with minimal filtering and no dimension reduction to preserve small perturbations in data that are normally removed, in order to assess their impact.

The pipeline includes conversion of data, separation to training and validation, assessment of training quality, attribution, and accumulation of results in a pipeline. The pipeline is run multiple times until a consensus is reached. The significant variables can then be identified using attribution and exported appropriately.

The DeepMapper architecture is shown in Fig.  2 . The complete algorithm of DeepMapper is detailed in the “ Methods ” section and the Python source code is supplied at GitHub 50 .

figure 2

DeepMapper architecture. DeepMapper uses sequence or multi-variate data as input. The first step of DeepMapper is to merge and if required index input files to prepare them into matrix format. The data are normalised using log normalisation, then folded to a matrix. Folding is performed either directly with the natural order of the data or by using the index that is generated or supplied during the data import. After folding, the data are kept in temporary storage and separated to ‘train’ and ‘test’ using SciPy train test split. Training is done using either using CNNs that are supplied by the PyTorch libraries, or a custom CNN supplied ( ResNet18 is used by default). Intermediary results are run through attribution algorithms supplied by the Captum 51 and saved to run history log. The run is then repeated until convergence is achieved, or until a pre-determined number of iterations are performed by shuffling training testing and validation data. Results are summarised in a report with exportable tables and graphics. Attribution is applied to true positives and true negatives, and these are translated back to features to be added to reports. Further details can be directly found in the accompanying code 50 .

DeepMapper is developed to implement an approach to process high-dimensional data without resorting to excessive filtering and dimension reduction techniques that eliminate smaller perturbations in data to be able to identify those differences that would otherwise be filtered out. The following algorithm is used to achieve this result:

Read and setup the running parameters.

Read the data into a tabulated form in the form of observations, features, and outcome (in the form of labels, or if self-supervised, the input itself).

If the input data includes categorical features, these features should be converted to numbers and normalised before feeding to DeepMapper .

Identify features and labels.

Do only basic filtering that eliminates observations or features if all of them are 0 or empty.

Normalise features.

Transform tabulated data to 2-dimensional matrices as illustrated in Fig.  1 a by applying a vector to matrix transformation.

If the analysis is supervised, then transform class labels to output matrices.

Begin iteration:

Separate the data into training and validation groups.

Train on the dataset for required number of epochs, until reaching satisfactory testing accuracy and loss, or maximum a pre-determined number of iterations.

If satisfactory testing results are obtained, then:

Perform attributions by associating each result to contributing input pixels using Captum, a Python library for attributions 51 .

Accumulate attribution results by collecting the attribution results for each class.

If training is satisfactory:

Tabulate attribution results by averaging accumulated attributions.

Save the model.

Report results.

The results of DeepMapper analysis can be used in 2 ways:

Supervised: DeepMapper produces a list of features that played a prominent role in the differentiation of classes.

Self-supervised: Highlights the most important features in differentiating observations from each other in a non-linear fashion. The output can be used as an alternative feature selection tool for dimension reduction.

In both modes, any hidden layer can be examined as latent space. A special bottleneck layer can be introduced to reduce dimensions for clustering purposes.

We present a simple example to demonstrate that CNNs can readily interpret data with a well dispersed pattern of pixels, using the MNIST dataset, which is widely used for hand-written image recognition and which humans as well as CNNs can easily recognise and classify based on the obvious spatial relationships between pixels (Fig.  3 ). This dataset is a more complicated problem than datasets such as the Gisette dataset 52 that was developed to distinguish between 4 and 9. It includes all digits and uses a full randomisation of pixels, and can be regenerated with the script supplied 50 and changing the seed will generate different patterns.

figure 3

A sample from MNIST dataset (left side of each image) and its shuffled counterpart (right side).

We randomly shuffled the data in Fig.  3 using the same seed 50 to obtain 60,000 training images such as those shown on the right side of each digit, and validated the results with a separate batch of 20,000 images (Fig.  3 ). Although the resulting images are no longer recognizable by eye, a CNN has no difficulty distinguishing and classifying each pattern with ~ 2% testing error compared to the reference data (Fig.  4 ). This result demonstrates that CNNs can accurately recognise global patterns in images without reliance on local relationships between neighbouring pixels. It also confirms the finding that shuffling images only marginally increases training loss 23 and extends it to testing loss (Fig.  4 ).

figure 4

Results of training MNIST dataset ( a ) and the shuffled dataset ( b ) with PyTorch model ResNet18 50 . The charts demonstrate although the training continued for 50 epochs, about 15 epochs for shuffled images ( b ) would be enough, as further training starts causing overfitting. The decrease of accuracy between normal and shuffled images is about 3%, and this difference cannot be improved by using more sophisticated CNNs with more layers, meaning shuffling images cause a measurable loss of information, yet still hold patterns recognisable by CNNs.

Testing DeepMapper

Finding slight changes in very few variables in otherwise seemingly random datasets with large numbers of variables is like finding a needle in a haystack. Such differences in data are almost impossible to detect using traditional analysis tools because small variations are usually filtered out before analysis.

We devised a simple test case to determine if DeepMapper can detect one or more variables with small but distinct variations in otherwise randomly generated data. We generated a dataset with 10,000 data items with 18,225 numeric variables as an example of a high-dimensional dataset using PyTorch’s uniform random algorithms 53 . The algorithm sets 18,223 of these variables to random numbers in the range of 0–1, and two of the variables into two distinct groups as seen in Table 1 .

We call this type of dataset ‘Needle in a haystack’ (NIHS) dataset, where very small amounts of data with small variance is hidden among a set of random variables that is order(s) of magnitude greater than the meaningful components. We provide a script that can generate this and similar datasets among the source supplied 50 .

DeepMapper was able to accurately classify the two datasets (Fig.  5 ). Furthermore, using attribution DeepMapper was also able to determine the two datapoints that have different variances in the two classes. Note that DeepMapper may not always find all the changes in the first attempt as neural network initialisation of weights is a stochastic process. However, DeepMapper o vercomes this matter via multiple iterations to establish acceptable training and testing accuracies as described in the Methods.

figure 5

In this demonstration of analysis of high dimensional data with very small perturbations, DeepMapper can find these small variations in a few (in this example two) variables out of very large number of random variables (here 18,225). ( a ) DeepMapper representations of each record. ( b ) The result of the test run of the classification with unseen data (3750 elements). ( c ) The first and second variables in the graph are measurably higher than the other variables.

Comparison of DeepMapper with DeepInsight

DeepInsight 54 is the most general approach published to date for converting non-image data into image-like structures, with the claim that these processed structures allow CNNs to capture complex patterns and features in the data. DeepInsight offers an algorithm to create images that have similar features collated into a “well organised image form”, or by applying one of several dimensionality reduction algorithms (e.g., t-SNE, PCA or KPCA) 54 . However, these algorithms add computational complexity, potentially eliminate valuable information, limit the abilities of CNNs to find small perturbations, and make it more difficult to use attribution to determine most notable features impacting analysis as multiple features may overlap in the transformed image. In contrast DeepMapper uses a direct mapping mechanism where each feature corresponds to one pixel.

To identify important input variables, DeepInsight authors later developed DeepFeature 55 using an elaborate mechanism to associate image areas identified by attribution methods to the input variables. DeepMapper uses a simpler approach as each pixel corresponds to only one variable and can use any of the attribution methods to link results to its input space. While both DeepMapper and DeepInsight follow the general idea that non-image data can be processed with 2D CNNs, DeepMapper uses a much simpler and faster algorithm, while DeepInsight chooses a sophisticated set of algorithms to convert non-image data to images, dramatically increasing computational cost. The DeepInsight conversion process is not designed to utilise GPUs so cannot be accelerated by better hardware, and the obtained images may be larger than the number of data points, also impacting performance.

One of the biggest differences between DeepFeature and DeepMapper is that DeepFeature in many cases selects multiple features during attribution because DeepInsight pixels represent multiple values, whereas each DeepMapper pixel represents one input feature, therefore it can determine differentiating features with pinpoint accuracy at a resolution of 1 pixel per feature.

The DeepInsight manuscript offers various examples of data to demonstrate its abilities. However, many of the examples use low dimensions (20–4000 features) while today’s complex datasets may regularly require tens of thousands to millions of features such as in genome analysis in biology and radio-telescope analysis in astronomy. As such, several examples provided by DeepInsight have insufficient dimensions for a sophisticated mechanism such as DeepMapper , which should ideally have 10,000 or more dimensions as required by modern complex datasets. DeepInsight examples include a speech dataset from the TIMIT corpus with 39 dimensions, Relathe (text) dataset, which is derived from newsgroup documents and partitioned evenly across different newsgroups. It contains 1427 samples and 4322 dimensions. The ringnorm-DELVE , which is an implementation of Leo Breiman’s ringnorm example, is a 20 dimensional, 2 class classification with 7400 samples 54 . Another example, Madelon , introduced an artificially generated dataset 2600 samples and 500 dimensions, where only 5 principal and 20 derived variables containing information. Instead, we used a much more complicated example than Madelon , an NIHS dataset 50 that we used to test DeepMapper in the first place. We attempted to run DeepInsight with NIHS data, but we could not get it to train properly and for this reason we cannot supply a comparison.

The most complex problem published by DeepInsight was the analysis of a public RNA sequencing gene expression dataset from TCGA ( https://cancergenome.nih.gov/ ) containing 6216 samples of 60,483 genes or dimensions, of which DeepInsight used 19,319. We selected this example as the second demonstration of application of DeepMapper to high dimensional data, as well as a benchmark for comparison with DeepInsight .

We generated the data using the R script offered by DeepInsight 54 and ran DeepMapper as well as DeepInsight using the generated dataset to compare accuracy and speed. In this test DeepMapper exhibited much improved processing speed with near identical accuracy (Table 2 , Fig.  6 ).

figure 6

Analysis of TCGA data by DeepInsight vs DeepMapper: The image on the top was generated by DeepInsight using its default values and a t-SNE transformer supplied by DeepInsight . The image at the bottom was generated by DeepMapper. Image conversion and training speeds and the analysis results can be found in Table 2 .

CNNs are fundamentally sophisticated pattern matchers that can establish intricate mappings between input features and output representations 6 . They excel at transforming various inputs into outputs, including identifying classes or bounding boxes, through a series of operations involving convolution, pooling, and activation functions 7 , 56 .

Even though CNNs are in the centre of many of today’s revolutionary AI systems from self-driving cars to generative AI systems such as Dall-E-2 , MidJourney and Stable Diffusion , they are still not well understood nor efficiently utilised, and their usage beyond image analysis has been limited.

While CNNs used in image analysis are constrained historically and practically to a 224 × 224 matrix or a similar fixed size input, this limitation arises for pre-trained models. When CNNs have not been pre-trained, one can select a much wider variety of sizes as input shape depending on the CNN architecture. Some CNNs are more flexible in their input size that implemented with adaptive pooling layers such as ResNet18 using adaptive pooling 57 . This provides flexibility to choose optimal sizes for the task in hand for non-image applications, as most non-image applications will not use pre-trained CNNs.

Here we have demonstrated uses of CNNs that are outside the norm. There is a need for analysis of complex data with many thousands of features that are not primarily images. There is also a lack of tools that offer minimal conversion of non-image data to image-like formats that then can easily be processed with CNNs in classification and clustering tasks. As a lot of this data is coming from complex systems that have a lot of features, DeepMapper offers a way of investigating such data in ways that may not be possible with traditional approaches.

Although DeepMapper currently uses CNN as its AI component, alternative analytic strategies can easily be substituted in lieu of CNN with minimal changes, such as Vision Transformers 18 or RetNets 21 , which have great potential for this application. While Transformers and RetNets have input size limitations for inference in terms of number of tokens. Vision Transformers can handle much larger inputs by dividing images to segments that incorporate multiple pixels 18 . This type of approach would be applicable to both Transformers and RetNets , and future architectures. DeepMapping can leverage these newer architectures, and others, in the future 57 .

Data availability

DeepMapper is released as an open source tool on GitHub https://github.com/tansel/deepmapper . Data that is not available from GitHub because of size constraints can be requested from the authors.

Taylor, P. Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025. https://www.statista.com/statistics/871513/worldwide-data-created/ (2023).

Ghys, É. The butterfly effect. in The Proceedings of the 12th International Congress on Mathematical Education: Intellectual and attitudinal challenges, pp. 19–39 (Springer). (2015).

Jolliffe, I. T. Mathematical and statistical properties of sample principal components. Principal Component Analysis , pp. 29–61 (Springer). https://doi.org/10.1007/0-387-22440-8_3 (2002).

Landauer, R. The noise is the signal. Nature 392 , 658–659. https://doi.org/10.1038/33551 (1998).

Article   ADS   CAS   Google Scholar  

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press). http://www.deeplearningbook.org (2016).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444. https://doi.org/10.1038/nature14539 (2015).

Article   ADS   CAS   PubMed   Google Scholar  

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60 , 84–90. https://doi.org/10.1145/3065386 (2017).

Article   Google Scholar  

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9 , 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 (1997).

Article   CAS   PubMed   Google Scholar  

Goodfellow, I. et al. Generative adversarial nets. Commun. ACM 63 , 139–144. https://doi.org/10.1145/3422622 (2020).

Vaswani, A. et al. Attention is all you need. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems , pp. 6000–6010. https://doi.org/10.5555/3295222.3295349 (2017).

Barrio, R. et al. Deep learning for chaos detection. Chaos 33 , 073146. https://doi.org/10.1063/5.0143876 (2023).

Article   ADS   MathSciNet   PubMed   Google Scholar  

Levin, E. A recurrent neural network: limitations and training. Neural Netw. 3 , 641–650. https://doi.org/10.1016/0893-6080(90)90054-O (1990).

LeCun, Y. & Bengio, Y. Convolutional networks for images, speech, and time series. in The handbook of brain theory and neural networks, pp. 255–258. https://doi.org/10.5555/303568.303704 (MIT Press, 1998).

Wu, Y., Yang, F., Liu, Y., Zha, X. & Yuan, S. A comparison of 1-D and 2-D deep convolutional neural networks in ECG classification. arXiv preprint arXiv:1810.07088 . https://doi.org/10.48550/arXiv.1810.07088 (2018).

Hu, J. et al. A multichannel 2D convolutional neural network model for task-evoked fMRI data classification. Comput. Intell. Neurosci. 2019 , 5065214. https://doi.org/10.1155/2019/5065214 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Zhang, S. et al. A deep learning framework for modeling structural features of RNA-binding protein targets. Nucleic Acids Res. 44 , e32. https://doi.org/10.1093/nar/gkv1025 (2016).

Article   PubMed   Google Scholar  

Maurício, J., Domingues, I. & Bernardino, J. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Appl. Sci. 13 , 5521. https://doi.org/10.3390/app13095521 (2023).

Article   CAS   Google Scholar  

Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . https://doi.org/10.48550/arXiv.2010.11929 (2020).

Carion, N. et al. End-to-end object detection with transformers. Computer Vision-ECCV 2020 (Springer), pp. 213–229. https://doi.org/10.1007/978-3-030-58452-8_13 (2020).

Lv, W. et al. DETRs beat YOLOs on real-time object detection. arXiv preprint arXiv:2304.08069 . https://doi.org/10.48550/arXiv.2304.08069 (2023).

Sun, Y. et al. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 . https://doi.org/10.48550/arXiv.2307.08621 (2023).

Zhou, D.-X. Universality of deep convolutional neural networks. Appl. Comput. Harmonic Anal. 48 , 787–794. https://doi.org/10.1016/j.acha.2019.06.004 (2020).

Article   MathSciNet   Google Scholar  

Chiyuan, Z., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64 , 107–115. https://doi.org/10.1145/3446776 (2021).

Ma, W., Papadakis, M., Tsakmalis, A., Cordy, M. & Traon, Y. L. Test selection for deep learning systems. ACM Trans. Softw. Eng. Methodol. 30 , 13. https://doi.org/10.1145/3417330 (2021).

Liu, Z., Michaud, E. J. & Tegmark, M. Omnigrok: grokking beyond algorithmic data. arXiv preprint arXiv:2210.01117 . https://doi.org/10.48550/arXiv.2210.01117 (2022).

Power, A., Burda, Y., Edwards, H., Babuschkin, I. & Misra, V. Grokking: generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177 . https://doi.org/10.48550/arXiv.2201.02177 (2022).

Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 . https://doi.org/10.48550/arXiv.1312.6034 (2013).

Kim, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 . https://doi.org/10.48550/arXiv.1408.5882 (2014).

Abdel-Hamid, O. et al. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22 , 1533–1545. https://doi.org/10.1109/TASLP.2014.2339736 (2014).

Hatami, N., Gavet, Y. & Debayle, J. Classification of time-series images using deep convolutional neural networks. in Proceedings Tenth International Conference on Machine Vision (ICMV 2017) 10696 , 106960Y. https://doi.org/10.1117/12.2309486 (2018).

Smith, M. A. et al. Molecular barcoding of native RNAs using nanopore sequencing and deep learning. Genome Res. 30 , 1345–1353. https://doi.org/10.1101/gr.260836.120 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Emek Soylu, B. et al. Deep-Learning-based approaches for semantic segmentation of natural scene images: A review. Electronics 12 , 2730. https://doi.org/10.3390/electronics12122730 (2023).

Hosseini, H., Xiao, B., Jaiswal, M. & Poovendran, R. On the limitation of Convolutional Neural Networks in recognizing negative images. in 16th IEEE International Conference on Machine Learning and Applications, pp. 352–358. https://ieeexplore.ieee.org/document/8260656 (2017).

Montserrat, D. M., Lin, Q., Allebach, J. & Delp, E. J. Training object detection and recognition CNN models using data augmentation. Electron. Imaging 2017 , 27–36. https://doi.org/10.2352/ISSN.2470-1173.2017.10.IMAWM-163 (2017).

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V. & Le, Q. V. Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501 . https://doi.org/10.48550/arXiv.1805.09501 (2018).

Hataya, R., Zdenek, J., Yoshizoe, K. & Nakayama, H. Faster AutoAugment: Learning augmentation strategies using backpropagation, in Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part XXV, pp. 1–16 (Springer). https://doi.org/10.1007/978-3-030-58595-2_1 (2020).

Xiao, K., Engstrom, L., Ilyas, A. & Madry, A. Noise or signal: the role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994 . https://doi.org/10.48550/arXiv.2006.09994 (2020).

Kovalerchuk, B., Kalla, D. C. & Agarwal, B., Deep learning image recognition for non-images, in Integrating artificial intelligence and visualization for visual knowledge discovery (eds. Kovalerchuk, B., et al. ) pp. 63–100 (Springer). https://doi.org/10.1007/978-3-030-93119-3_3 (2022).

Samek, W., Binder, A., Montavon, G., Lapuschkin, S. & Muller, K. R. Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Netw. Learn. Syst. 28 , 2660–2673. https://doi.org/10.1109/tnnls.2016.2599820 (2017).

Article   MathSciNet   PubMed   Google Scholar  

Montavon, G., Samek, W. & Müller, K.-R. Methods for interpreting and understanding deep neural networks. Digital Signal Process. 73 , 1–15. https://doi.org/10.1016/j.dsp.2017.10.011 (2018).

De Cesarei, A., Cavicchi, S., Cristadoro, G. & Lippi, M. Do humans and deep convolutional neural networks use visual information similarly for the categorization of natural scenes?. Cognit. Sci. 45 , e13009. https://doi.org/10.1111/cogs.13009 (2021).

Kindermans, P.-J. et al. The (un) reliability of saliency methods, in Explainable AI: Interpreting, explaining and visualizing deep learning. Lecture Notes in Computer Science 11700 , pp. 267–280 (Springer). https://doi.org/10.1007/978-3-030-28954-6_14 (2019).

Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. Computer Vision—ECCV 2014, pp. 818–833 (Fleet, D., Pajdla T., Schiele, B., & Tuytelaars, T., eds) (Springer). https://doi.org/10.1007/978-3-319-10590-1_53 (2014).

Springenberg, J. T., Dosovitskiy, A., Brox, T. & Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 . https://doi.org/10.48550/arXiv.1412.6806 (2014).

Binder, A., Montavon, G., Lapuschkin, S., Müller, K.-R. & Samek, W. Layer-wise relevance propagation for neural networks with local renormalization layers, in Artificial Neural Networks and Machine Learning–ICANN 2016: Proceedings 25th International Conference on Artificial Neural Networks, pp. 63–71 (Springer). https://doi.org/10.1007/978-3-319-44781-0_8 (2016).

Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. Proceedings of the 2017 IEEE international conference on computer vision, pp. 618–626. https://ieeexplore.ieee.org/document/8237336 (2017).

Sundararajan, M., Taly, A. & Yan, Q. (2017) Axiomatic attribution for deep networks. in Proceedings of the 34th International Conference on Machine Learning 70 , 3319–3328. https://doi.org/10.5555/3305890.3306024 .

Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. in Proceedings of the 34th International Conference on Machine Learning 70 , 3145–3153. https://doi.org/10.5555/3305890.3306006 (2017).

Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. in Proceedings of the 31st International Conference on Machine Learning, pp . 4768–4777. https://doi.org/10.5555/3295222.3295230 (2017).

Ersavas, T. Deepmapper. https://github.com/tansel/deepmapper (2023).

Kokhlikyan, N. et al. Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896 . https://doi.org/10.48550/arXiv.2009.07896 (2020).

Guyon, I. G. S. B.-H. A. & Dror, G. Gisette. UCI Machine Learning Repository . https://archive.ics.uci.edu/dataset/170/gisette (2008).

PyTorch, torch.rand. https://pytorch.org/docs/stable/generated/torch.rand.html (2023).

Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A. & Tsunoda, T. DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Sci. Rep. 9 , 11399. https://doi.org/10.1038/s41598-019-47765-6 (2019).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Sharma, A., Lysenko, A., Boroevich, K. A., Vans, E. & Tsunoda, T. DeepFeature: feature selection in nonimage data using convolutional neural network. Brief. Bioinform. 22 , bbab297. https://doi.org/10.1093/bib/bbab297 (2021).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 . https://doi.org/10.48550/arXiv.1409.1556 (2014).

Pytorch2, AdaptiveAvgPool2d. https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html (2023).

Download references

Acknowledgements

We thank Murat Karaorman, Mitchell Cummins, and Fatemeh Vafaee for helpful advice and comments on the manuscript. This research is supported by an Australian Government Research Training Program Scholarships RSAI8000 and RSAP1000 to T.E., a Fonds de Recherche du Quebec Santé Junior 1 Award 284217 to M.A.S., and UNSW SHARP Grant RG193211 to J.S.M.

Author information

Authors and affiliations.

School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Sydney, NSW, 2052, Australia

Tansel Ersavas, Martin A. Smith & John S. Mattick

Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, QC, H3C 3J7, Canada

Martin A. Smith

CHU Sainte-Justine Research Centre, Montreal, Canada

UNSW RNA Institute, UNSW Sydney, Australia

You can also search for this author in PubMed   Google Scholar

Contributions

T.E. developed the methods, implemented DeepMapper and produced the first draft of the paper. J.S.M. provided advice, structured the paper, and edited it for improved readability and clarity. M.A.S. provided advice and edited the paper.

Corresponding authors

Correspondence to Tansel Ersavas or John S. Mattick .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Ersavas, T., Smith, M.A. & Mattick, J.S. Novel applications of Convolutional Neural Networks in the age of Transformers. Sci Rep 14 , 10000 (2024). https://doi.org/10.1038/s41598-024-60709-z

Download citation

Received : 16 January 2024

Accepted : 26 April 2024

Published : 01 May 2024

DOI : https://doi.org/10.1038/s41598-024-60709-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

convolutional neural network phd thesis

convolutional neural network phd thesis

Publications

convolutional neural network phd thesis

  • Code (Github)

convolutional neural network phd thesis

  • Code (char-rnn)

convolutional neural network phd thesis

  • Retrieval demo
  • Presentation

convolutional neural network phd thesis

  • Supplemental
  • Web Demo of Results

convolutional neural network phd thesis

  • PDF, Code, Data

convolutional neural network phd thesis

Pet Projects

convolutional neural network phd thesis

Help | Advanced Search

Computer Science > Computation and Language

Title: explaining deep neural networks.

Abstract: Deep neural networks are becoming more and more popular due to their revolutionary success in diverse areas, such as computer vision, natural language processing, and speech recognition. However, the decision-making processes of these models are generally not interpretable to users. In various domains, such as healthcare, finance, or law, it is critical to know the reasons behind a decision made by an artificial intelligence system. Therefore, several directions for explaining neural models have recently been explored. In this thesis, I investigate two major directions for explaining deep neural networks. The first direction consists of feature-based post-hoc explanatory methods, that is, methods that aim to explain an already trained and fixed model (post-hoc), and that provide explanations in terms of input features, such as tokens for text and superpixels for images (feature-based). The second direction consists of self-explanatory neural models that generate natural language explanations, that is, models that have a built-in module that generates explanations for the predictions of the model.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

PhD thesis in Applied Physics

Convolution function.

A big revolution into the Neural Network research field was given by the introduction of the convolution functions. Convolutional Neural Network (CNN) are particularly designed for image analysis. Convolution is the mathematical integration of two functions in which the second one is translated by a given value:

In signal processing this operation is also called crossing correlation ad it is equivalent to the autocorrelation function computed in a given point. In image processing the first function is represented by the image I and the second one is a kernel k (or filter) which shifts along the image. In this case we will have a 2D discrete version of the formula given by:

where C[i, j] is the pixel value of the resulting image and N , M are kernel dimensions.

The use of CNN in modern image analysis applications can be traced back to multiple causes. First of all the image dimensions are increasingly bigger and thus the number of variables/features, i.e pixels, is often too big to manage with standard DNN 1 . Moreover if we consider detection problems, i.e the problem of detecting a set of features (or an object) inside a larger pattern, we want a system ables to recognize the object regardless of where it appears into the input. In other words, we want that our model would be independent by simple translations.

Both the above problems can overcome by CNN models using a small kernel, i.e weight mask, which maps the full input. A CNN is able to successfully capture the spatial and temporal dependencies in an signal through the application of relevant filters.

The main parameter of this function are so given by the input dimensions and the filter/kernel dimensions, i.e the number of weight which we have to tune during the training. This is the basic idea behind the convolution function but in many cases (especially in modern deep learning neural network) we can sophisticate it playing with the possible movements of the filter mask. In particular, aside the kernel mask-size, we can also force the filter to jump along the image, i.e a discontinuous movement of the filter excluding some pixels. This parameter, called stride , defines the number of pixels to jump and it is often used to reduce further the output dimensions.

Given this theoretical background we can implement the convolution function in many different ways, using different mathematical approaches: a study on the computational efficiency will tell us which is the best approach to choose. The first (naive) approach is to use a brute force technique and implement the direct evaluation of the convolution functions as described above. This version is certainly the easier to implement but its computational performances are so worst than for sake of brevity we excluded it from our tests 2 .

`im2col` algorithm scheme using a `2 x 2` filter on a image with 3 channels. At the end of the `im2col` algorithm the `GEMM` is performed between weights and input image.

Taking into account what we have learned from the DNN models, we can re-formulate our problem using an efficient manipulation of the involved matrices to optimize the GEMM algorithm. A direct convolution on an image of size (W x H x C) using a kernel mask of dimensions (k x k) requires O(WHCk^2) operations and thus many matrix products. We can re-arrange the involved data to optimize this computation and thus evaluate a single matrix product: this re-arrangement is called im2col (or im2row ) algorithm. The algorithm is just a simple transformation which flats the original input into a bigger matrix where each column carries all the elements which have to be multiplied for the filter mask into a single step 3 . In this way we can immediately apply our GEMM algorithm on the full image. In Fig. 1 the main scheme of this algorithm is reported. This kind of algorithm certainly optimize the computation efficiency of the GEMM product but in payback we have to store a lot of memory for the input re-organization.

Using the mathematical theory behind the problem a third idea can arise using the well known Convolution Theorem: the Fourier transformation of our functions (that in this case are given by the input image and the weights kernel) can be reinterpreted into a simple matrix product in the frequency space. This is certainly the most “physical” approach to solve this problem and probably the easier one since the Fourier transformation is a well-known optimized algorithm and many efficient implementations are already provided in literature. One of the most efficient one is provided by the FFTW ( Fast Fourier Transform in the West ) library [ FFTW05 ]: the FFTW3 is an open source C subroutine library for computing the discrete Fourier transform (DFT) in multiple dimensions without constrains in input sizes or data types. The library is not only accurate in the computation but it also provide an efficient parallel version for multi-threading applications.

A further implementation kind is given by linear algebra considerations (very closed to numerical considerations) and it is called Coppersmith-Winograd algorithm. This algorithm was designed to optimize the matrix product and in particular to reduce the computational cost of its operations. Suppose we have an input image given by just 4 elements and a filter mask with size equal to 3:

we can now using the told above im2col algorithm and thus reshape our input image and weights into

given this data we can simply compute the output as the matrix product of this two matrices. The Winograd algorithm rewrites this computation as follow:

We tested the computational time efficiency of each algorithm on different random images. The tests were performed on a classical bioinformatics server (128 GB RAM memory and 2 CPU E5-2620, with 8 cores each) and we considered only kernel sizes equal to 3 ( Winograd constrain) varying the input dimensions and the number of filters. In Fig. 1 we show the result of our simulations using the im2col values as reference 6 .

In all our simulations we found a visible speedup using the Winograd algorithm against the other two algorithms: for small dimensions we obtain more than 5x against the im2col and 25x against the fftw implementation. The worst algorithm is certainly the fftw one which, despite the efficient FFTW3 parallel-library, is always more than 5 times slower than the reference. However, it is interesting notice how the fftw implementation is able to reach the best performances when the dimensions are proportional to powers of 2, as expected from the mathematical theory behind the Discrete Fourier Transformation.

We can conclude that the Winograd algorithm is certainly the best choice when we have to perform a 2D convolution. The payback of this method is given by the rigid constrains related to the mask sizes and strides: when it is possible it remains the best solution but in all the other cases the im2col implementation is a relatively good alternative. The efficiency of Byron library follows the efficiency of the Winograd algorithm since the major part of layers in modern deep learning Neural Network models are Convolutional layers with size equal to 3 and unitary stride.

If we consider a simple image 224 x 224 with 3 color channels we obtain a set of 150'528 features. A classical DNN layer with this input size should have 1024 nodes for a total of more than 150 million weights to tune.  ↩

Compared to the other implementations the direct (brute force) convolution algorithm exceeds the computational time of order of magnitudes. For this reason it is not taken into account during our tests. A possible implementation in C++ is however provided into the Byron library .  ↩

We work under the assumption that the weights matrix is already a flatten array and thus each row of the weights matrix represents the full mask.  ↩

A multiplication takes 7 clock cycles in a normal CPU while an add takes only 3 clock cycles.  ↩

We would also highlight that this formulation is valid only if we consider unitary strides.  ↩

The im2col algorithm can be found in the major part of Neural Network library and it is also the only convolution function implemented in the darknet library, which is a sort of reference for our work.  ↩

Loughborough University

Object detection in drone imagery using convolutional neural networks

Drones, also known as Unmanned Aerial Vehicles (UAVs), are lightweight aircraft that can fly without a pilot on board. Equipped with high-resolution cameras and ample data storage capacity, they can capture visual information for subsequent processing by humans to gather vital information. Drone imagery provides a unique viewpoint that humans cannot access by other means, and the captured images can be valuable for both manual processing and automated image analysis. However, detecting and recognising objects in drone imagery using computer vision-based methods is difficult because the object appearances differ from those typically used to train object detection and recognition systems. Additionally, drones are often flown at high altitudes, which makes the captured objects appear small. Furthermore, various adverse imaging conditions may occur during flight, such as noise, illumination changes, motion blur, object occlusion, background clutter, and camera calibration issues, depending on the drone hardware used, interference in flight paths, changing environmental conditions, and regional climate conditions. These factors make the automated computer-based analysis of drone footage challenging.

In the past, conventional machine-based object detection methods were widely used to identify objects in images captured by cameras of all types. These methods involved using feature extractors to extract an object’s features and then using an image classifier to learn and classify the object’s features, enabling the learning system to infer objects based on extracted features from an unknown object. However, the feature extractors used in traditional object detection methods were based on handcrafted features decided by humans (i.e. feature engineering was required), making it challenging to achieve robustness of feature representation and affecting classification accuracy. Addressing this challenge, Deep Neural Network (DNN) based learning provides an alternative approach to detect objects in images. Convolutional Neural Networks (CNNs) are a type of DNN that can extract millions of high-level features of objects that can be effectively trained for object detection and classification. The aim of research presented in this thesis is to optimally design, develop and extensively investigate the performance of CNN based object detection and recognition models that can be efficiently used on drone imagery.

One significant achievement of this work is the successful utilization of the state-of-the-art CNNs, such as SSD, Faster R-CNN and YOLO (versions 5s, 5m, 5l, 5x, 7), to generate innovative DNN-based models. We show that these models are highly effective in detecting and recognising Ghaf trees, multiple tree types (i.e., Ghaf, Acacia and Date Palm trees) and in detecting litter. Mean Average Precision ([email protected]) values ranging from 70%-92% were obtained, depending on the application and the CNN architecture utilised.

The thesis places a strong emphasis on developing systems that can effectively perform under practical constraints and variations in images. As a result, several robust computer vision applications have been developed through this research, which are currently being used by the collaborators and stakeholders.

  • Computer Science

Rights holder

Publication date, supervisor(s), qualification name, qualification level, this submission includes a signed certificate in addition to the thesis file(s).

  • I have submitted a signed certificate

Administrator link

  • https://repository.lboro.ac.uk/account/articles/24435187

Usage metrics

Computer Science Theses

  • Other information and computing sciences not elsewhere classified

CC BY-NC-ND 4.0

(Stanford users can avoid this Captcha by logging in.)

  • Send to text email RefWorks EndNote printer

Convex neural networks

Digital content, more options.

  • Contributors

Description

Creators/contributors, contents/summary, bibliographic information.

Stanford University

  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Non-Discrimination
  • Accessibility

© Stanford University , Stanford , California 94305 .

PhD thesis 'Focus of attention: a sensory-motor task for energy reduction in spiking neural networks'

Dear colleagues,

Applications are welcome for a fully funded PhD position Focus of attention: a sensory-motor task for energy reduction in spiking neural networks . The position will be located at the EDGE Team @ LEAT Laboratory within Université Côte d’Azur and/or at the INT in Marseille , France.

This project takes place in the context of the EMERGENCES project (ANR PEPR IA 2023-2027) which aims to advance the state of the art on machine learning models using inspiration from biology. Indeed, inspiration from brain features promises to show the emergence of unrivalled efficient processing. Among the most promising features studied in the literature of bio-inspired AI are temporal data encoding using spikes, multimodal association, local learning or attention-based processing.

This PhD subject focuses on the association between attention and spiking neural networks for defining new efficient AI models for embedded systems such as drones, robots and more generally autonomous systems.

The thesis will take place between the LEAT research lab in Sophia-Antipolis and the INT institute in Marseille which both develop complementary approaches on bio-inspired AI from neuroscience observation to embedded systems design.

The volume as well as the diversity of visual information that reaches our eyes at every moment are huge and cannot be fully integrated by the visual system. In other words, the biological system is confronted to the same challenge as the one encountered by artificial systems (especially at the edge) when dealing with the huge amounts of information coming continuously from the real world. Interestingly, the brain has found an original approach to deal with this issue by focusing on a sub-part of the visual information at a time. Indeed, the study of the visual cortex in neuroscience has made it possible to highlight subregions that treat each or all of the multiple properties of information coming from the visual pathways: shapes, colors, movements, etc [1] , thus revealing the interaction of attentional processes and the concept of “saliency” used in cognitive science.

Creating a fully autonomous system remains a significant challenge, especially when operating in the dynamic real world. In recent times, machine learning has assumed a prominent role in machine vision, particularly through the implementation of deep learning algorithms. These algorithms have yielded impressive outcomes in tasks such as object detection, recognition, and tracking. However, these systems come with a high computational cost, as they must process entire camera images to generate these results. Additionally, they struggle to dynamically adapt to changes in their environment.

Our focus lies on two integrated bio-inspired approaches that leverage attentional mechanisms. The first approach, known as bottom-up , draws inspiration from the work of the Gestalt theory, the Feature Integration Theory (Triesman, Gelad) [3] , and the model of visual attention from Itti & Koch [1] . This approach relies on the saliency of low-level features in the visual field, processed in parallel, including movement, color, and edges. It employs emergent mechanisms to integrate features guided by their saliency in order to detect the consistency of objects, encompassing their form, position, and speed. As shown by the Gestalt theory, only the more salient data are needed in this mechanism. Thus, we can dramatically reduce the amount of needed data by extracting only the more salient regions of interest during bottom-up phase.

The second approach, known as top down , considers that the visual attention is guided by higher level cognitive stages. For instance, in the Guided Search theory [4] , Wolfe emphasizes the role of prior knowledges, expectations, and intentions. In this work, Wolfe proposes a guided search mechanism that relies on a “Priority map that represents the system’s best guess as to where to deploy attention next.”. This Priority map is built on multiple sources of information such as the visual system as well as higher-level information such as intention, search history and the actual visual semantics. In this way, higher-level information is used to guide the filtering of the botom-up path, so that only the information required for a given task is selected and processed. Similar systems are proposed by Schöner [5] in which saliency maps, working memories and “priority map”, guided visual search mechanisms are implemented through the Neural Field Theory (NFT). Here, Dynamic Neural Fields are used to implement the saliency of feature maps, as well as scene spatial selection mechanism, working memory, etc.

In a previous work from the LEAT [6] , we have proposed a brain inspired attentional process implementing bottom-up and top-down paths based on a dynamic neural fields properties embodied in a sensory-motor loop. In a complementary work, the INT group has developed a dual pathway model of the visual system in which saliency emerges as a property of the perceptual system to perform saccades, that is, rapid shifts of the fixation point [7] . This uses a recognition model which takes as an input a retinotopically transformed input and shows the emergence of saliency maps [8] In the dual-pathway model, the exploration of a visual scene is based on both the saliency of the color feature (bottom-up) and the class of the last selected object recognized by a convolutional neural network (top-down). Both paths are integrated by a dynamic neural field to select the next visual information to be explored or conserved by setting motor orders accordingly.

The main goal of the thesis is to propose a new vision of the integration of attention into machine learning models. The proposed model will draw on the dynamics at play in a sensory-motor approach to perception and will thus reconsider the classical perception tasks in order to better fit with the continuous flow of information coming from the environment.

The PhD will be co-supervised between INT in Marseille and LEAT in Nice. According to the preferences of the candidate, a main laboratory of affiliation will be selected. Weekly meetings will be organized remotely and visiting weeks will be planned to work in-person in the other lab along the year.

Study the state of the art in both neuroscience and machine learning on the use of attentional properties to make AI models more effective in environmental perception tasks.

Write a synthesis report on this study.

Develop a first neural model integrating attention-based selection in a specific perception task such as visual search.

Define the specific metrics (KPI) dedicated to the evaluation of the performance and efficiency of such a bio-inspired AI model.

Submit a first publication on this preliminary study in an international conference.

Analyze of the performances of the preliminary attention-based model

Develop the approach in order to integrate step by step the features related to dual pathway perception, attention, foveation, DNF and make the model compatible with convolutional neural networks

Submit a second publication in a international journal

Study the adaptation of the model to spiking neural networks

Evaluation and comparison of the different approaches

Submit publications on the final results of the thesis

Write the thesis report and prepare the defense

Required skills

Master thesis in one of the following domains: neuromorphic systems, spiking neural networks, neurocognition, machine learning.

Background and experience in machine-learning, artificial neural networks, and/or neurosciences.

Strong motivation, team working, fluent in English (spoken and written).

Programming skills in python, keras, pytorch or equivalent

Start: year 2024

Duration: 3 years

Location: Sophia-Antipolis and/or Marseille

Benoît Miramond is Full Professor in Electrical Engineering at LEAT laboratory from University Côte d’Azur (UCA). He holds the chair on bio-inspired AI at 3IA Cote d’Azur Institute and leads the eBRAIN research group which develops a interdisciplinary research activity on embedded Bio-inspiRed AI and Neuromorphic architectures, especially based on SNNs. LEAT is a mixt research unit (UMR 72 48) from UCA and CNRS.

Laurent Perrinet is a director of research at Institut des Neurosciences de la Timone (CNRS - Aix-Marseille Université). He is studying the link between brain microstructures and their macroscopic function by implementing realistic models of the primary visual cortex using spiking neural networks.

Laurent Rodriguez is associate professor at LEAT laboratory in the eBRAIN group. He is interested in dynamic neural networks and develop neural models from biological inspiration.

Application

Apply by sending an email directly to the supervisors ( [email protected] [email protected] [email protected] ). The application will include:

• Letter of recommendation of the master supervisor.

• Curriculum vitæ.

• Motivation Letter.

[1] L. Itti et C. Koch, « Computational modelling of visual attention », Nat Rev Neurosci , vol. 2, 3, 3, mars 2001, doi: 10.1038/35058500 .

[2] Gerstner, W., Kistler, W. M., Naud, R., & Paninski, L. (2014). Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press

[3] A. M. Treisman et G. Gelade, « A feature-integration theory of attention », Cognitive Psychology , vol. 12, 1, p. 97‑136, janv. 1980, doi: 10.1016/0010-0285(80)90005-5 .

[4] Wolfe, J.M. «Guided Search 6.0: An updated model of visual search. » Psychon Bull Rev 28, 1060–1092 (2021). https://doi.org/10.3758/s13423-020-01859-9

[5] Grieben, R. , &  Schöner, G. . « A neural dynamic process model of combined bottom-up and top-down guidance in triple conjunction visual search». In T. Fitch, Lamm, C., Leder, H., & Teßmar-Raible, K. (Eds.), Proceedings of the 43rd Annual Conference of the Cognitive Science Society

[6] M. Rasamuel, Lyes Khacef, Laurent Rodriguez, et Benoit Miramond, « Specialized visual sensor coupled to a dynamic neural field for embedded attentional process | IEEE Conference Publication | IEEE Xplore ». https://ieeexplore.ieee.org/abstract/document/8705979

[7] Emmanuel Daucé, Pierre Albigès, Laurent U Perrinet (2020). A dual foveal-peripheral visual processing model implements efficient saccade selection . Journal of Vision . doi: https://doi.org/10.1167/jov.20.8.22

[8] Jean-Nicolas Jérémie, Emmanuel Daucé, Laurent U Perrinet (2020). Retinotopic Mapping Enhances the Robustness of Convolutional Neural Networks. arXiv https://arxiv.org/abs/2402.15480

Laurent U Perrinet

Researcher in computational neuroscience.

My research interests include Machine Learning and computational neuroscience applied to Vision.

IMAGES

  1. Convolutional Neural Networks The Why What And How Of Convolutional

    convolutional neural network phd thesis

  2. Overview and details of a convolutional neural network (CNN

    convolutional neural network phd thesis

  3. Overview and details of a convolutional neural network (CNN

    convolutional neural network phd thesis

  4. Convolutional Neural Networks: Architectures, Types & Examples

    convolutional neural network phd thesis

  5. Simple Introduction to Convolutional Neural Networks

    convolutional neural network phd thesis

  6. Cnn Algorithm

    convolutional neural network phd thesis

VIDEO

  1. Graph Neural Networks vs. Traditional Methods for Recommending MOOC Courses

  2. Convolutional Neural Network

  3. DeepSteadyFlows

  4. Blind Inpainting Convolutional Neural Network Matlab Code Projects

  5. Autism Spectrum Disorder Prediction Using a Convolutional Neural Network CNN fMRI data python code

  6. 5G Small Cells Network Simulator OMNET Projects

COMMENTS

  1. PDF UvA-DARE (Digital Academic Repository)

    this thesis we explore ways to leverage symmetries to improve the ability of convolutional neural networks to generalize from relatively small samples. We argue and show empirically that in the context of deep learning it is bet-ter to learn equivariant rather than invariant representations, because invari-

  2. PDF IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS

    Oulu University of Applied Sciences Information Technology, Internet Services. Author: Hung Dao Title of the bachelor's thesis: Image Classification Using Convolutional Neural Networks Supervisor: Jukka Jauhiainen Term and year of completion: Spring 2020 Number of pages: 31. The objective of this thesis was to study the application of deep ...

  3. PDF Deep Learning: An Overview of Convolutional Neural Network(CNN)

    Next, Artificial Neural Networks (ANN), which works as a stepping stone to deep learning, types of ANN methods, and their limitations are explored. Chapter three focuses on deep learning and four of its main architectures including unsupervised pretrained networks, recurrent neural network, recursive neural network, and convolutional neural ...

  4. PDF by Ilya Sutskever

    The publications below describe work that is loosely related to this thesis but not described in the thesis: ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. In Advances in Neural Information Pro-cessing Systems 26, (NIPS*26), 2012. (Krizhevsky et al., 2012)

  5. PDF Medical Image Classification using Deep Learning Techniques and

    agnosis systems. More specifically, the thesis provides the following three main contributions. First, it introduces a novel entropy-based elastic ensemble of Deep Convolutional Neural Networks (DCNNs) architecture termed as 3E-Net for classi-fying grades of invasive breast carcinoma microscopic images. 3E-Net is based on a

  6. Design Space Exploration of Convolutional Neural Networks for Image

    DESIGN SPACE EXPLORATION OF CONVOLUTIONAL NEURAL NETWORKS FOR IMAGE CLASSIFICATION A Thesis Submitted to the Faculty of Purdue University by Prasham Shah In Partial Ful llment of the ... thesis advisor, Dr. Mohamed El-Sharkawy, whose wisdom and guidance has been indispensable throughout my journey. I am grateful to have had Dr. Brian King,

  7. PDF Examining the Structure of Convolutional Neural Networks

    uses a speci c model called a neural network [2]. What follows in this thesis is an introduction to supervised learning, an introduction to neural networks, and my work on Convolutional Neural Networks, a speci c class of neural networks. 1.2 Supervised Learning

  8. PDF Hyperparameter Optimization of Deep Convolutional Neural Networks

    Deep convolutional neural networks (CNNs) recently have shown remarkable success in a variety of areas such as computer vision [1-3] and natural language processing [4-6]. CNNs are biologically inspired by the structure of mammals' visual cortexes as presented in Hubel and Wiesel's model [7]. In 1998, LeCun et al. followed

  9. Optimising convolutional neural networks for large-scale neuroimaging

    Optimising convolutional neural networks for large-scale neuroimaging studies. ... PhD thesis, University of Oxford. Copy Chicago Style Tweet. Print. Access Document. Files: ndinsdale_thesis_final.pdf (Dissemination version, 47.2MB) Why is the content I wish to access not available via ORA? ×. Bibliographic data (the information relating to ...

  10. Designing a Convolutional Neural Network for Image Recognition: A

    The aim of this thesis was to compare different convolutional neural network (CNN) architectures and training techniques for image recognition, with the goal of identifying the most effective ...

  11. PDF A review of convolutional neural networks in computer vision

    A convolutional neural network (Li et al. 2021), known for local connectivity of neurons, weight sharing, and down-sampling, is a deep feed-forward multilayered hierarchical network inspired by the receptive field mechanism in biology. As one of the deep learning models, a CNN can also achieve "end-to-end" learning.

  12. Novel applications of Convolutional Neural Networks in the age of

    Convolutional Neural Networks (CNNs) have been central to the Deep Learning revolution and played a key role in initiating the new age of Artificial Intelligence. However, in recent years newer ...

  13. PDF Analysis and Optimization of Convolutional Neural Network Architectures

    Convolutional Neural Network Architectures Master Thesis of Martin Thoma Department of Computer Science Institute for Anthropomatics and FZI Research Center for Information Technology Reviewer: Prof. Dr.-Ing. R. Dillmann Second reviewer: Prof. Dr.-Ing. J. M. Zöllner Advisor: Dipl.-Inform. Michael Weber Research Period: 03. May 2017 ...

  14. Convolutional Neural Networks: An Introduction

    Convolutional neural networks (CNNs) have made revolutionary strides in the field of computer vision. This article provides an overview of CNNs, starting with their fundamental components, including convolutional and pooling layers. This article also discusses the techniques used in their training and reviews popular variants of CNNs such as ...

  15. Andrej Karpathy Academic Website

    PhD Thesis, 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning ... Sports-1M: a dataset of 1.1 million YouTube videos with 487 classes of Sport. This dataset allowed us to train large Convolutional Neural Networks that learn spatio-temporal features from video rather than single, static images. Andrej Karpathy ...

  16. PDF Convolutional Neural Network (CNN)

    Convolutional Neural Network (CNN) by Vinay K. Chawla May, 2021 Director of Thesis: Carol Massarra, PhD Major Department: Construction Management Assessing pavement condition is extremely essential in any effort to reduce future economic losses and improve the structural reliability and resilience. Data resulting from pavement

  17. Medical Image Segmentation using Deep Convolutional Neural Network

    This dissertation addresses these challenges and presents novel deep convolutional neural network (CNN) techniques for two different medical applications. In addressing the first application of ...

  18. PDF ROBUST CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS A Thesis

    Chapter 2 of this thesis will present a literature review about the convolutional neural network. I shall present some techniques that increase the accuracy for Convolutional Neural Networks (CNNs). To test system performance, the Modified NIST or MNIST dataset demonstrated in [1] was chosen.

  19. [2010.01496] Explaining Deep Neural Networks

    Therefore, several directions for explaining neural models have recently been explored. In this thesis, I investigate two major directions for explaining deep neural networks. The first direction consists of feature-based post-hoc explanatory methods, that is, methods that aim to explain an already trained and fixed model (post-hoc), and that ...

  20. PDF Department of Information Engineering and Computer Science ...

    Convolutional Neural Network (CNN) is arguably the most utilized model by the computer vision community, which is reasonable thanks to its remarkable performance in object and scene recognition, with respect to traditional hand-crafted features. Nevertheless, it is evident that CNN naturally is availed in its two-dimensional version.

  21. Convolution function

    Convolutional Neural Network (CNN) are particularly designed for image analysis. Convolution is the mathematical integration of two functions in which the second one is translated by a given value: In signal processing this operation is also called crossing correlation ad it is equivalent to the autocorrelation function computed in a given point.

  22. Object detection in drone imagery using convolutional neural networks

    Convolutional Neural Networks (CNNs) are a type of DNN that can extract millions of high-level features of objects that can be effectively trained for object detection and classification. The aim of research presented in this thesis is to optimally design, develop and extensively investigate the performance of CNN based object detection and ...

  23. Convex neural networks in SearchWorks catalog

    Summary. Neural networks have made tremendous advancements in a variety of machine learning tasks across different fields. Typically, neural networks have relied on heuristically optimizing a non-convex objective, raising doubts into their transparency, efficiency, and empirical performance. In this thesis, we show that a wide variety of neural ...

  24. PhD thesis 'Focus of attention: a sensory-motor task for energy

    Develop the approach in order to integrate step by step the features related to dual pathway perception, attention, foveation, DNF and make the model compatible with convolutional neural networks. Submit a second publication in a international journal. Year 3. Study the adaptation of the model to spiking neural networks