ISCB Africa ASBCB Conference on Bioinformatics 2017

SESSION 4: Bioinformatics of human genetics and population studies
Oral Presentation Abstracts

Insilico Identification of Protein-Coding and Non-Coding Regions in Next-Generation Technology Transcriptome Sequence Data: A Machine Learning Approach

Presenter:
Olaitan Awe
University of Ibadan

Additional authors:

Angela Makolo
University of Ibadan

Segun Fatumo
Wellcome Trust Sanger Institute
Wellcome Genome Campus, Hinxton, Cambridge

With the rapid increase in the volume of sequence data and multi-species transcripts generated using next-generation sequencing technologies, designing algorithms to process these data in an efficient manner and gaining biological insight is becoming a significantly growing challenge as there is no known effective method to discriminate between non-coding and protein-coding regions in human transcriptomes because RiboNucleic Acids (RNA) show similar features to each other. The few existing techniques mostly involve intense computation or multi-threading for small/large datasets to achieve small performance difference, and risking a high execution time of the tool.<br><br>To solve this problem, we developed a fast, accurate and robust alignment-free predictor based on multiple feature groups using Logistic Regression, for the discrimination of protein-coding regions in multispecies transcriptome sequence data, where the predictive performance is influenced by Open Reading Frame(ORF)-Related and ORF-unrelated features used in the model rather than the training datasets, thereby achieving a relatively high performance and computational speed in processing small and large datasets of full-length and partial-length protein-coding and non-coding transcripts derived from transcriptome sequencing.

We describe a series of experiments on the human datasets with a goal of generally performing better than competing techniques.

Our tool identified coding and non-coding regions in the human RNA-seq dataset with 97% accuracy, 97% F1-score, 97% sensitivity and 97% specificity.

We expect this new approach to result in an efficient computational cost of analyzing transcripts, and hence make genome annotation and transcriptome analysis easier.

Gene Regulatory Network Sparseness with Fuzzified Adjusted Rand Index (FARI)

Presenter:
Taiwo Adigun
University of Ibadan

Additional authors:

Angela Makolo
University of Ibadan Computer Science

Segun Fatumo
H3Africa Bioinformatics Network (H3ABioNet) Node, National Biotechnology Development Agency (NABDA), Federal Ministry of Science and Technology (FMST), Abuja

Gene regulatory networks have an important role in every process of life, including cell differentiation, metabolism, the cell cycle and signal transduction. Sparseness property of models and curse of dimensionality property of input data pose a serious challenge to its modeling. Besides, most developed network inference models did not model the effect of less expressed genes. We have proposed a modified technique called Fuzzified Adjusted Rand Index (FARI) to effectively handle the sparseness of gene regulatory network and to drastically reduce the effect of curse of dimensionality of the input data. The concept of Fuzzy is incorporated to calculate the sizes of intersection of expression values in the samples of the gene objects. An estimated fuzzified contingency table of two gene expression profiles from input dataset is generated and the ARI value of the two genes is calculated iteratively. We finally generated a distance matrix containing ARI values of all the genes in the dataset. The set of genes with higher values and the set of genes with lower values on each row (gene node) are separately fed into Recurrent Neural Network as parameters to train the learned model and to investigate the regulatory effect of least co-expressed genes on each gene. We have provided a fast and effective regularization technique to model a higher order neural network to handle large-scale biological network and to investigate the effect of less co-expressed genes on other genes.

COMBAT TB, an integrated environment for TB sequence data storage, analysis and visualization

Presenter:
Peter van Heusden
South African National Bioinformatics Institute, University of the Western Cape

Additional authors:

Thoba Lose
South African National Bioinformatics Institute

Ziphozakhe Mashologu
South African National Bioinformatics Institute

Alan Christoffels
South African National Bioinformatics Institute

Tuberculosis (TB), caused by the M. tuberculosis (Mtb) bacteria, continues to be one of the leading causes of morbidity and mortality in sub-Saharan Africa. The advent of Next Generation Sequencing (NGS) has allowed low cost sequencing of the 4 megabase Mtb genome and rapid growth of available sequence data. Availability of genomic sequence requires scalable data storage and analytic capacity to translate into knowledge for both public health laboratories and TB researchers. Under the auspices of the COMBAT TB project SANBI has developed an integrated environment for genomic data storage integrated with automated analysis workflows. This allows sequence data to be associated with metadata that is later available to enhance analysis. The integrated environment allows both “first pass” analysis of new sequence data using pre-built workflows and “ad-hoc” analysis of data as research needs dictate, using the Galaxy and COMBAT TB Explorer environments. The COMBAT TB Explorer (CTBE) is a graph database and visualisation environment that allows in-house data to be interpreted in the context of publicly available annotation of M. tuberculosis. The CTBE is built on a Neo4j graph database, using a data model inspired by the Chado and Global Alliance for Genomic Health (GA4GH) schemas for genomic and sequence variant annotation respectively. A task queue allows work to be shared between the CTBE and our analysis workflow environment. The complete COMBAT TB environment is available as a set of Docker containers, with all code available on Github.

Ensemble Clustering Algorithms: An Experimental Evaluation on Gene Expression Data

Presenter:
Itunuoluwa Isewon
Covenant University

Additional authors:

Faridah Ameh
Covenant University Department of Computer & Information Sciences

Efosa Uwoghiren
Covenant University Department of Computer & Information Sciences

Olufemi Aromolaran
Covenant University Department of Computer & Information Sciences

Jelili Oyelade
Covenant University Department of Computer & Information Sciences & Covenant University Bioinformatics Research (CUBRe)

Ensemble clustering provides more flexibility as the user is not constrained to the choice of a single clustering algorithm but can harness the strengths of multiple clustering approaches into one clustering solution and also avoid the possibility of making a poor choice of clustering algorithm. Ensemble clustering algorithms combine results produced by different clustering techniques through a consensus function. This study evaluated the performance of different consensus functions on five differently shaped or structured gene expression data sets. Four clustering algorithms were used to generate different clustering results from these data sets. These algorithms include: Fuzzy C-means algorithm, Spherical K-means algorithm, Complete-linkage Agglomerative Hierarchical clustering algorithm and Self Organising Maps algorithm. The base clusterings were combined in a cluster ensemble and re-clustered using the following consensus functions: Hard least square Euclidean consensus function, Soft least squares Euclidean consensus function, DWH (Dimitriadou, Weingessel and Hornik) consensus function, 3rd model of Gordon and Vichi consensus function, Soft median Manhattan consensus function. In order to perform a comparative analysis on the final clustering results gotten from these ensemble methods, the following internal cluster validity measures were used; Silhouette Width index, Connectivity Index and the Dunn Index. The results of the performance evaluation showed that the Soft median Manhattan consensus function was the best. Furthermore, the consensus clusters generated were insensitive to miss-classification from the individual clustering algorithms.

Reproducible bioinformatics workflows for heterogeneous African computational environments

Presenter:
Mamana Mbiyavanga
University of Cape Town

Additional authors:

Mustafa Alghali
University of Khartoum

Don Armstrong
University of Illinois

Shaun Aron
University of the Witwatersrand

Shakuntala Baichoo
University of Mauritius

Hocine Bendou
University of the Western Cape

Eugene de Beste
University of the Western Cape

Scott Hazelhust
University of the Witwatersrand

Fourie Joubert
University of Pretoria

Brian O'Connor
UCSC Genomics Institute

Souiai Oussama
Institut Pasteur de Tunis

Sumir Panji
University of Cape Town

Alex Rodriguez
University of Chicago

Yassine Souilmi
Mohammed V University, Rabat

Peter van Heusden
University of the Western Cape

Yi Long
University of the Western Cape

Azza Ahmed
University of Khartoum/Future University

Ayton Meintjes
University of Cape Town

Abayomi Mosaku
Covenant University

Phelelani Mpangase
University of the Witwatersrand

Lerato Magosi
University of Oxford

Nicky Mulder
University of Cape Town

Milt Epstein
University of Illinois

Victor Jongeneel
University of Illinois

Liudmila Mainzer
University of Illinois

Jenny Zermeno
University of Illinois

Gerrit Botha
University of Cape Town

Michael Crusoe
Common Workflow Language Project Co-founder, IT

Designing a bioinformatics pipeline that is runnable on any environment is challenging. Many things need to be taken into consideration: the setup of the software stack, the integration with the scheduler, how to control access to data, the tuning of parameters based on resources available and defining data locations for archival, processing, publishing and reference databases. H3ABioNet is responsible for supporting data analysis for H3Africa projects. It was therefore necessary to create SOPs to assist in the analysis process. We identified that these SOPs should be built into bioinformatics pipelines but also decided to containerise the software stacks involved, in order to reduce some of the aforementioned challenges. In August 2016 a five day hackathon was held by H3ABioNet where pipelines were developed and containerised for human exome variant calling, GWAS, SNP imputation and 16S rRNA analysis. The hackathon was an immense success and we managed to produce working code and good relationship between members. Post hackathon we spent time cleaning up our code and containers. Two of the pipelines were built using Nextflow while the other two made use of CWL. We used Docker to containerise the software stacks. All the code is available on GitHub and the containers have been deposited on Quay. Our future plans are to package our workflows using Singularity containers. Additionally since the packages make the process of software installation easier we plan to include these workflows as base material of some of the H3ABioNet training modules.

Enabling the processing of bioinformatic workflows where data is located through the use of cloud and container technologies

Presenter:
Eugene De Beste, University of the Western Cape, South Africa

Additional authors:

Alan Christoffels University of the Western Cape, South Africa
Antoine Bagula, University of the Western Cape, South Africa

he use of “big data” to inform biomedical decisions poses complex problems of storage, privacy and data security. This is especially true for fields such as e-health which deal with human health records. Organisations holding such data need to be able to assure regulators and patients of the security of their data storage and handling. In addition, when dealing with large datasets, movement of data for processing can pose a challenge. Many applications that are used to process various types of data have strict software package dependencies, imposing competing requirements on the administrators of institutional computing platforms. Software containers are a lightweight and generally better performing, albeit less diverse alternative to virtual machines. The advancement and increase in adoption of these container technologies have resulted in adaption for use in a variety of scenarios and fields. This allows researchers to replace the shipment of data with shipment of code by packaging their software into containers. Researchers are allowed to define their own toolchains and workflows to do analysis with rather than being limited to what has been allowed by the organisation managing the data set. Utilizing the growing cloud environment ecosystem, with platforms such as OpenStack, it is possible to provide researchers with an easy to use interface to execute custom workflows remotely, without the hassle of software dependency management and direct technical knowledge and reducing the need to send potentially large data sets from one location to another.

- top -

CONFERENCE SPONSORS

ISCB Africa ASBCB Conference on Bioinformatics 2017

SESSION 4: Bioinformatics of human genetics and population studiesOral Presentation Abstracts

SESSION 4: Bioinformatics of human genetics and population studies
Oral Presentation Abstracts