HOME

Posters

Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category A - 'Bioinformatics of Disease and Treatment'

A001 - Identification of MiRNAs as specific biomarkers in prostate cancer diagnostics : A combined in silico and molecular approach

Short Abstract: This poster is based on Proceedings Submission: Cancer is a class of diseases, classified by the organ of origin and characterized by uncontrollable cell growth. Our focus is on prostate cancer (PC) which starts in the prostate and is generally a walnut-sized gland located below the bladder. Approximately 4500 males in South Africa are diagnosed annually, thus making it the second most common cause of cancer death in men.

Current diagnostic methods include digital rectal examinations (DRE), prostate specific antigen test (PSA), biopsy and ultrasound however, they are invasive and lack specificity and sensitivity. Therefore, the development of a non-invasive, specific and sensitive early detection method is required. Biomarkers are biological indicators i.e. DNA, proteins and miRNA, which have recently been identified as targets for the early detection of disease.

MiRNAs are small, naturally occurring, non-coding RNA molecules directly involved in regulating gene expression at the post-transcriptional level. They offer great potential as biomarkers for cancer detection due to their remarkable stability in blood and characteristic expression in different diseases. The aim of this study is therefore to identify miRNAs as specific biomarkers for the early detection of PC.

The identification of specific miRNAs and their targets will be done using various bioinformatics techniques including programming and statistical analyses. Once identified, these miRNAs will be experimentally validated to generate expression profiles using molecular techniques. Furthermore, newly identified, experimentally validated miRNAs will be used in combination with nanotechnology to develop a diagnostic kit for the early detection of PC.

A002 - Navigating chemical and biological space – in the search of novel pharmaceuticals

Short Abstract: Typically, virtual screening of compound libraries is based on the assumption that structurally similar compounds are likely to share similar properties and bind to the same group of proteins. This model often fails due to the rugged nature of the activity landscape. Furthermore, similarity in chemical space cannot explain the activity of compounds against a specific pathway or groups of pathways. Compounds that incur similar phenotypes and yet are structurally diverse are therefore often overlooked in automated searches. Our alternative perspective on virtual screening and library design is based solely on the interactions of compounds with the proteome. Ligands may be quantitatively grouped by the biological closeness of their targets by means of their biological fingerprints. We study similarity and diversity in biological space as necessary ingredients for compounds in screening libraries. We demonstrate here how compound-target interaction networks can be steered to find novel and biologically relevant chemical matter.

A003 - MalaCards: an integrated compendium for diseases and their annotation

Short Abstract: Comprehensive disease classification, integration and annotation are crucial for biomedical discovery. At present, disease compilation is incomplete, heterogeneous and often lacking systematic inquiry mechanisms. We introduce MalaCards, an integrated database of human maladies and their annotations, leveraging the architecture and strategy of the GeneCards database of human genes. MalaCards mines and merges 44 data sources to generate a computerized card for each of 16,919 human diseases. Each MalaCard contains disease-specific prioritized annotations, as well as inter-disease connections, empowered by the GeneCards relational database, its searches, and GeneDecks set-analyses. First, we generate a disease list from 15 ranked sources, using disease-name unification heuristics. Next, we employ four schemes to populate MalaCards sections: 1) Directly interrogating disease resources, to establish integrated disease names, synonyms, summaries, drugs/therapeutics, clinical features, genetic tests, and anatomical context; 2) Searching GeneCards for related publications, and for associated genes with corresponding relevance scores; 3) Analyzing disease-associated gene-sets in GeneDecks to yield affiliated pathways, phenotypes, compounds, and GO terms, sorted by a composite relevance score and presented with GeneCards links; 4) Searching within MalaCards itself, e.g. for additional related diseases and anatomical context. The latter forms the basis for the construction of a disease network, based on shared MalaCards annotations, embodying associations based on etiology, clinical features and clinical conditions. This broadly disposed network has a power-law degree distribution, implying inherent properties of such networks. Work in progress includes hierarchical malady classification and variation association, striving to make MalaCards an effective tool for biomedical research.

A004 - Sounds of silence: synonymous nucleotides as a key to biological regulation and complexity.

Short Abstract: Messenger RNA accommodates numerous nucleotide signals that overlap protein coding sequences and are responsible for multiple levels of biological regulation. A wealth of structural and regulatory information, which mRNA carries in addition to the encoded amino acid sequence, raises the question of how these signals and overlapping codes are delineated along non-synonymous and synonymous (‘silent’) positions in protein coding regions. Selection pressure on the coding gene regions follows three-nucleotide periodic pattern of nucleotide base-pairing in mRNA, which is imposed by the genetic code. Synonymous positions codon regions are subject to RNA-level selection and are multifunctional in their regulatory and structural roles. Synonymous positions define mRNA secondary structure and stability, affect the rate of translation, and exert downstream effects on folding and post-translational modification of nascent polypeptides. Recent experimental and bioinformatics evidence suggest that there is an evolutionary trade-off between selective pressure acting at the RNA and protein levels.

A005 - Systematic Computational Drug Repositioning

Short Abstract: Systematic drug repositioning is perhaps one the best ways for computational biology to show clear translational value in the pharmaceutical and biotech industry. Bionformatics methods that use genome-wide association studies (GWAS), side effects and connectivity map data are proving to have value. We built a computational pipeline to examine the relationship between the drug disease indications of drugs and genetics findings such as GWAS traits. When the drug indication was different from the GWAS disease trait we hypothesized that the drug could potentially be repositioned. We identified almost 100 GWAS genes with at least one associated drug that suggest potential drug repositioning opportunities. Further investigations provided additional evidence for some of these opportunities. We will also show some recent developments in connectivity map and side effect methods to reposition rapidly drugs and ultimately benefit the patients.

A006 - Computational identification of a transiently open L1/S3 pocket for reactivation of mutant p53.

Short Abstract: The tumour suppressor p53 is the most frequently mutated gene in human cancer. Reactivation of mutant p53 by small molecules is an exciting potential cancer therapy. Although several compounds restore wild-type function to mutant p53, their binding sites and mechanisms of action are elusive. Here computational methods identify a transiently open binding pocket between loop L1 and sheet S3 of the p53 core domain. Mutation of residue Cys124, located at the centre of the pocket, abolishes p53 reactivation of mutant R175H by PRIMA-1, a known reactivation compound. Ensemble-based virtual screening against this newly revealed pocket selects stictic acid as a potential p53 reactivation compound. In human osteosarcoma cells, stictic acid exhibits dose-dependent reactivation of p21 expression for mutant R175H more strongly than does PRIMA-1. These results indicate the L1/S3 pocket as a target for pharmaceutical reactivation of p53 mutants.

A007 - Prioritization of Candidate Disease Genes based on Integrative Molecular Networks

Short Abstract: To understand the foundations and mechanisms of human diseases, it is crucial to identify and characterize the involved genes. Computational prioritization methods exploit the available biomedical knowledge to rank candidate genes for further studies. Protein interactions and functional annotations are valuable sources of information. The limitations of single data sources, such as incompleteness and low-quality, have motivated integrative prioritization approaches [1]. However, most freely available prioritization methods rely on pre-defined data and are implemented as black-boxes that the user cannot influence.

Our new Cytoscape plugin NetworkPrioritizer analyzes molecular networks to prioritize candidate genes or proteins. Its versatility facilitates the integration of any data source of interest. Candidates are ranked according to their relevance for the disease or phenotype under study based on network connectivity, which is computed by centrality measures for weighted or unweighted networks. Each measure generates another ranking of candidates. Various rank aggregation algorithms are provided to merge different rankings. The application of NetworkPrioritizer to susceptibility loci for Crohn’s disease [2] resulted in a number of putative disease genes for follow-up experiments.

References

[1] Doncheva, N.T., Kacprowski T., Albrecht M. (2012) Recent approaches to the prioritization of candidate disease genes. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 4:429-442.

[2] Franke, A., McGovern, D.P.B., Barrett, J.C., Wang, K., et al. (2010) Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nature Genetics 42:1118–1125.

A008 - Caribbean Medicinal Plant Ontology

Short Abstract: This study aims to initiate the construction of an ontology that represents the concepts and their relationships of indigenous knowledge of medicinal plants in the Caribbean region. This is vital as the herbal industry has been growing rapidly in the Caribbean and the popularity of herbal products is increasing in both the developing and developed world. There are numerous concerns over the safety of herbal use and there exists the potential for unforeseen negative allopathic drug-herb reactions.
The cons of herbal commercial activities should not outweigh the pros nor the potential for new and exciting drug discovery potential. It is imperative that the necessary steps be taken and active stakeholder involvement take place to effectively document and research indigenous knowledge so that this resource is not lost.An analysis of online and other database resources on plants and medicinal plants and their attributes in the Caribbean region revealed a lack of semantically meaningful query facilities, lack of standards, data linkages and there are no unique identifiers for entries. The ontology was created utilizing information gathered from domain experts and medicinal plant literature. The success of the ontology, it is suggested, will drive the concurrent development of other databases and ensure a higher standard of data, semantically meaningful querying and interoperability.

A009 - Molecular Modeling study of 4-Phenyl-1H-Imidazole and its derivatives as potential inhibitor of indoleamine 2,3-dioxygenase (IDO)

Short Abstract: Science world has little knowledge about immune escape which is a crucial feature of cancer progression.
Many human tumors express indoleamine 2, 3- dioxygenase (IDO), an enzyme which mediates an immune-escape in several cancer types. An approach for creating new IDO inhibitors by computer-aided structure-based drug design was created. Molecular docking approach using Lamarckian Genetic Algorithm was carried out to elucidate the extent of specificity of IDO towards different classes of 4-PHENYL-1H-IMIDAZOLE. Combining a novel algorithm for rapid binding site identification and evaluation with easy-to-use property visualization tools, the software has provided an efficient means to find and better exploit the characteristics of ligand binding site. Total molecules of 3000 were virtually screened from different databases on the basis of the structural similarity & Substructure of 4-PHENYL-1H-IMIDAZOLE.

The docking result of the study of 3000 molecules demonstrated that the binding energies were in the range of -11.28 kcal/mol to -2.35 kcal/mol, with the minimum binding energy of -11.28 kcal/mol. We report molecule AP-1 which showed 4 H- Bonds with active site residue and lowest binding energy of -11/28 kcal/mol. The molecule AP-1 showed Drug Likeness score of 0.92 with Mol PSA as 42.10 A2 and MolVol as 322.23 A3. The MolLogS was -7.41 (in Log(moles/L)) 0.01 (in mg/L with Drug Likeness Score of 1.26,Drug-Score of 0.34 and Solubility of -8.93.The molecule showed no indication for mutagenicity, & tumorigenicity.Also,no indication for irritating & reproductive effects found.Further in-vitro and in-vivo study is required on this molecule.

A010 - A Comparative Meta-analysis of Prognostic Gene Signatures for Late-stage Ovarian Cancer

Short Abstract: Ovarian cancer is the most lethal gynecological cancer and a leading cause of cancer deaths among women. Numerous prognostic gene signatures have been proposed, but diverse data and methods have made these difficult to compare or use in a clinically meaningful way. We sought to identify successful prognostic gene signatures from the ovarian cancer literature through systematic validation using public data. A systematic review identified 14 prognostic models for late-stage ovarian cancer. For each we evaluated its 1) reimplementation as described by the original study, 2) performance for prognosis of overall survival in independent data, and 3) signature performance compared to random gene signatures. We compared and ranked all 14 models by validation in 10 publicly available datasets comprising 1,319 high-grade, late-stage serous ovarian cancer patients. Twelve published models performed better than 97.5% of randomized risk scores; six out-performed 97.5% of random signatures of the same size trained on the same data. The four top-ranked models achieved overall validation C-Index of 0.56 to 0.60, and shared anti-correlation with immune response pathways that was absent in lower-ranked models. Most models demonstrated lower accuracy in new datasets than in validation sets presented in their publication. This analysis provides definitive support for a handful of the proposed prognostic models, but also confirms that these require improvement to be of clinical value. This work addresses outstanding controversies in the ovarian cancer literature, and provides a reproducible framework for meta-analytic evaluation of gene expression signatures.

A011 - Less is more: using targeted DNA resequencing to effectively capture customized genomic regions in translational research

Short Abstract: With the rapid development in sequencing technology and decrease in sequencing cost, the trend is rising to sequence the whole genome of tumor samples. However, disease samples often require high sequencing coverage to support any potential genome abnormality reported. The balance of data quality, clinical cohort size, sequencing coverage, and sequencing cost is critical for a successful translational study. Targeted DNA resequencing of genomic regions of interest presents a solution. Here we present a pipeline for targeted DNA resequencing to effectively capture customized genomic regions, from probe design and optimization to the NGS data analysis to identify the abnormal genetic events in tumor samples.

The probes to capture customized genomic regions are based on a whole genome survey of the intended targets ('query seeds'), e.g. transposable elements. Any genomic segment aligned to query seeds above a threshold is reported. The collection of such genomic segments is optimized for the objectives of sequence variation, distribution on chromosomes, and probe number. Paired-end sequencing is required for captured DNA. To identify reads from query seeds in targeted genomic regions, we leverage both BLAST-based sequence alignment tools and current short read mappers for NGS data, to effectively hunt reads that come from query seeds, either completely ('complete read') or partially ('split read'). Their paired reads as well as the unmapped portion of split reads are subsequently mapped to the reference genome to allocate novel genomic rearrangement sites of query seeds in tumors.

A012 - Identification of multiple mechanisms of resistance to vemurafenib in a patient with BRAFV600E-mutated cutaneous melanoma, successfully rechallenged after initial progression

Short Abstract: We report here a comprehensive analysis of genetic alterations in melanoma tumors harvested from two separate sites in a patient who had an excellent clinical response following reintroduction of vemurafenib. A patient with BRAFV600E cutaneous melanoma was initially treated with vemurafenib with a near complete response. Following relapse and several unsuccessful chemo/immunotherapy treatments, re-challenge with vemurafenib again achieved disease control until eventual progression. To understand the underlying mechanism(s) of resistance, exome and RNA sequencing were performed on a baseline tumor and two resistant metastases, one that was present at baseline and previously responded to vemurafenib (PV1), and one that appeared de novo after reintroduction of the drug (PV2). We have identified two different NRAS activating mutations, Q61R and Q61K, at codon 61 affecting two main subpopulations in one resistant metastasis (PV1) and BRAF alternative splicing, involving exons 4-10, in the other metastasis (PV2). These alterations were tumor-specific and were not detected in the pre-treatment tumor. This work describes the existence within the same patient of different molecular vemurafenib resistance mechanisms at different metastatic sites leading to disease progression. These findings have direct implications for the clinical management of BRAF mutant melanoma.

A013 - Machine learning analysis to identify biomarkers related to breast and colon cancer cells resistant to methotrexate

Short Abstract: Breast cancer is the second most common malignancy in the world to date and invasive breast cancer is the most common carcinoma in women, accounting for 22% of all female cancers. Colorectal cancer is the third most frequent cancer in the world in both sexes and the third most frequent cause of cancer related deaths. Methotrexate is an antimetabolite and antifolate drug used in treatment of cancer and autoimmune diseases. Regrettably, the efficacy of this chemotherapeutic agent is often compromised by the development of resistance in cancer cells. Public expression data with bioinformatics approaches was applied to identify useful biomarkers for breast and colon cancer, which could be further validated in clinical trials. The microarray expression dataset (platform ID GPL570) from the study conducted by Selga et al. 2009, available at Gene Expression Omnibus online database (GSE16070, GSE16080, GSE11440, GSE9412) was used. The dataset comprises 24 samples of human colon and breast cancer cells, resistant and susceptible to methotrexate. We proposed a machine learning method to predict biomarkers in colon and breast cancer cells. The performances of the models were evaluated using the area under the ROC curve. We used three aproaches, in the first one all data from breast and colon cancer cells were analysed as a whole, the second only colon cancer cells were used, and finally only breast cells were evaluated. Our findings suggest possible candidates that could be used as biomarkers for the resistance of breast and colon cancer cells to methtrexate.

A014 - Genome-wide Association Study of Ancestry-specific Tuberculosis Risk .

Short Abstract: The world-wide burden of tuberculosis (TB) remains an enormous problem, and is particularly severe in the admixed South African Coloured population residing in the Western Cape. Despite evidence from twin studies suggesting a strong genetic component to TB resistance, only a few loci have been identified to date. We conducted a study to determine whether ancestry-specific genetic contributions affect tuberculosis risk. We additionally conducted a genome-wide association study (GWAS) and a meta-analysis incorporating previous GWAS of TB. To further characterize the susceptibility genes, we conducted a post GWAS to identify significant sub-networks underlying ethnic differences. We demonstrate significant evidence of an association (odds ratio = 1.46, P = 1.58e-05) between SAN ancestry and TB risk that is not due to confounding by socio-economic status. This indicates that the genetic contribution to TB risk varies between continental populations, and illustrates the value of including admixed populations in studies of TB risk and other complex phenotypes. We confirm the WT1 chr11 susceptibility locus (rs2057178: odds ratio = 0.62, p = 2.71e-06) previously identified. We were able to refine the association signal of 6 genes, including MEGF10, PRRC1, HNRNPK, SLC8A3, SMOC1 and CTXN3 and replicate 3 other known TB associated genes, including IL8, SLC11A1, CCL2 and IFNGR1. We identified a novel central sub-network that is mostly implicated in acute and chronic myeloid leukemia signalling pathways, that includes the WT1 and IL8 genes. Our work provides insights into identifying disease genes and ancestry-specific disease risk, providing further insights into tuberculosis pathogenesis.

A015 - KeyNet: Dynamic Keyword Network Web Server Using PubMed

Short Abstract: The best way to summarize the enormous number of PubMed abstracts and study the disease-protein-drug associations is network visualization, and more importantly, temporal information. However, there has yet to be a web system that supports the disease-protein-drug associations using network visualization and temporal information. We developed a dynamic keyword-network web server (KeyNet) that contains natural language processing and more than ten million PubMed abstracts. The KeyNet server is freely available at http://syslab.nchu.edu.tw/KeyNet. To understand comprehensive keywords, we used the Unified Medical Language System (UMLS) to transfer the abstract descriptions into keywords. Since UMLS provides comprehensive vocabularies, we selected 3 major vocabularies as follows: Medical Subject Headings (MeSH), National Drug File Reference Terminology (NDFRT), and UMLS Metathesaurus. To concentrate on disease, gene, and drug studies, we selected 7 semantic types as follows: (1) Pathologic Function, (2) Injury or Poisoning, (3) Anatomical Abnormality, (4) Body Part, Organ, or Organ Component, (5) Tissue, (6) Cell, and (7) Organic Chemical. We downloaded 11,552,035 abstracts from NCBI PubMed on Aug 8, 2012, and then transferred the abstracts to UMLS concepts. Since the UMLS provides hierarchical relations, we extracted parental concepts for each concept and abstract. To construct connectivity among disease-protein-drug keywords, we systematically performed enrichment analysis by hypergeometric distribution between any two keywords. Given two keywords, we extract the abstracts associated with these two keywords, then we calculate the Intersection/Union Ratio (I/U Ratio), the number of overlap abstracts, and hypergeometric P-value.

A016 - Differential Centrality: A Novel Approach to Biomarker Detection Using Gene Co-Expression in Colorectal Cancer

Short Abstract: Being the third most frequently diagnosed cancer in the world, colorectal cancer is a major cause of death in human beings. While various treatment options have been proposed and shown to have an effect on disease outcome, the comprehension of individual responses to cancer therapy remains poor. Genes involved in such responses could serve as important predictive biomarkers for existing therapies and as targets for drug discovery and development.
We present a novel approach towards the detection of differentially regulated genes after treatment by comparing time-course expression profiles in two colorectal cancer cell lines with different genotypes.
In a first step, responsive genes are extracted based on their expression behavior over time by using a measure of variation. Those candidate genes are assembled into sample-specific co-expression networks, which also integrate gene relationships from text mining and public protein-protein interaction databases. Finally, we rank genes by their differential centrality, i.e., we identify those genes whose centrality values differ the most between the networks.
We applied our method to high-throughput time-course microarray data from various cell lines treated with different substances. We find that our approach in many cases improves the recovery of known biomarkers when compared to differential expression as baseline algorithm.
Novel findings among these results can be interpreted in their respective biological context, possibly leading to new insights on the molecular processes underlying the disease and effects of specific treatments.

A017 - A comprehensive pipeline for RNA-seq data analysis

Short Abstract: RNA-seq measurement technology provides a high-resolution approach to quantify transcriptional activities in a population of cells. The basic idea behind RNA-seq is massive parallel sequencing, which yields more than 100 million 70-100 bps fragments (short-reads). While experimental protocols for RNA-seq have been evolving fast, the bottleneck currently is in data management and analysis. In particular, there is a need for a framework that allows straightforward integration of several bioinformatics tools for quality control, alignment, quantification, and downstream analysis.

We have recently published a computational framework, Anduril for large-scale data analysis (Ovaska et al. Genome Medicine 2010). Here we report an expansion of the Anduril framework for RNA-seq analysis. Briefly, an Anduril workflow relies on components (software packages) that perform specific tasks of the pipeline. Implementing the pipeline on Anduril provides several advantages, such as ease of data integration, apt visualization of results and high efficiency of CPU time. Only components that have changed in the last run are executed and tasks are automatically parallelized-, which grants more flexibility in workflow construction.

We demonstrate our Anduril RNA-seq pipeline by analyzing samples from eight diffuse large B-cell lymphoma patients who have relapsed or remained in remission in response to dose dense chemoimmunotherapy. The preprocessing steps of the pipeline include assessing the reads, removing low quality sequences and providing statistics on the alignments. We focused the analysis on identifying differentially expressed genes, isoforms (Cufflinks and DESeq) and exon usage (DEXSeq) in relapsed and remission samples, as well as finding fusion genes (TopHat-Fusion).

A018 - PIDS : a protein structure information database of disease-related SNPs and drugs

Short Abstract: Numerous genetic variations have been found to be related to human diseases. Significant portion of those affect the drug response as well by changing the protein structure and function.
Therefore, it is crucial to understand the trilateral relationship among genomic variations, diseases and drugs.
We present the variations and drugs (PIDS), a consolidated database containing information on diseases, related genes and genetic variations, protein structures and drug information.
PIDS was built in three steps. First, we integrated various resources systematically to deduce catalogs of disease-related genes, single nucleotide polymorphisms (SNPs), protein mutations and relevant drugs.
PIDS contains 137,195 disease related gene records (13,940 distinct genes) and 16,586 genetic variation records (1,790 distinct variations).
Next, we carried out structure modeling and docking simulation for wild-type and mutant proteins to examine the structural and functional consequences of non-synonymous SNPs in the drug-related genes.
Conformational changes in 590 wild-type and 4437 mutant proteins from drugrelated genes were included in our database.
Finally, we investigated the structural and biochemical properties relevant to drug binding such as the distribution of SNPs in proximal protein pockets, thermo-chemical stability, interactions with drugs and physico-chemical properties.
The PIDS database would be a useful database for researchers studying the underlying mechanism for association among genetic variations, diseases and drugs.

A019 - Integrative approach of Exome-Seq and RNA-Seq for detecting fusion genes

Short Abstract: In human cancer, gene fusions have been recognized as important genomic events because they can drive progression of cancer and may even be the cause behind it. Recently, a number of fusion genes have been discovered different cancers using next generation sequencing (NGS) technologies. Most of these studies use RNA-Seq data, and a few use WGS-Seq data as a part of their computational approach for fusion gene discovery. In contrast, Exome-Seq has not yet seen any use. We developed a new method to identify fusion genes by integrating Exome-Seq and RNA-Seq data, and applied the algorithm to in house lung cancer data. In this approach, we first obtained fusion gene candidates from RNA-Seq data and then filtered the results using Exome-Seq data. We were able to predict fusion boundary in the structural variations of the genome such as chromosomal translocation, deletion and inversion by aligning the sequence to the pseudo-reference in reads of the Exome-Seq data. Our method is limited to breaks that occur within exons-exon boundaries. Even so, we suggest that our integrative approach of Exome-Seq and RNA-Seq can identify new fusion genes in cancer.

A020 - MetExtract: A software tool for Xenobiotic metabolisation studies using LC-HRMS and stable isotopic labelling

Short Abstract: MetExtract is a novel software tool for processing of LC-HRMS based metabolomics data. It is designed for the automated and non-targeted extraction of metabolisation products derived from xenobiotics or other substances in living organisms. The software makes use of stable isotopic labelling (SIL) by applying mixtures of non-labelled and corresponding 13C labelled tracers or xenobiotics to an organism of interest. Since all isotopologues of the applied substance are metabolised, every metabolisation product is present as a 12C and a partly 13C isotopologue form. MetExtract uses the distinct isotopic patterns derived from both the 12C and 13C isotopologues of the substance to find only the metabolisation products of the tracer or xenobiotic and to annotate the extracted features with the remaining number of carbon atoms from the tracer in the metabolisation product. For confirmation the intensity ratios of the first isotopologues of both forms are compared with their theoretical ratios. Further, the chromatographic peak shapes of the mono-isotopic and corresponding labelled isotopologues are compared using the Pearson correlation coefficient and different adducts and in-source fragments of the same metabolisation product are grouped.

The workflow and developed software have been applied to study the metabolisation of the mycotoxin deoxynivalenol in wheat ears. For LC-HRMS measurements an LTQ Orbitrap XL has been used. A total of 57 features have been extracted and subsequently grouped into 9 unknown and known metabolisation products and the unprocessed deoxynivalenol. The largest group showed as many as 30 different features for one metabolisation product.

A021 - MetaPredictor for Human Leukocyte Antigens

Short Abstract: Human leukocyte antigens (HLA) are proteins involved in the human immunological response. The understanding of the HLA-peptide binding interaction is a crucial step for peptide-based vaccine design. However, the high rate of polymorphisms in HLA makes this task difficult. The in silico approach represents a useful, less time consuming and inexpensive way to investigate the peptide binding activity. A pool of binding and non-binding peptides extracted from literature and Immune Epitope Database (IEDB) has been used as the training set. We performed classification of the experimental data separately for each HLA II protein using different machine learning methods, such as Support Vector Machine (SVM), Random Forest (RF), Naive Bayes (NB), Artificial Neural Network (ANN) and K-Nearest Neighbor (K-NN), and by consensus approach. The proposed MetaPredictor exploits the capability of various well-known supervised classifiers to yield better solution than any single algorithm. The final prediction is done by cluster-based similarity partition (CSPA) algorithm. The performance of the MetaPredictor is described using accuracy, precision, recall and F-measure, together with confusion matrices. The error estimates are calculated using the leave-one-out procedure. Results show that the MetaPredictor produces maximum of 16% gain over any tested single machine learning method. Finally, statistical significance tests have been performed to establish the superiority of the proposed predictor.

A022 - A modular framework for gene set analysis integrating multilevel omics data

Short Abstract: Modern high-throughput methods allow the investigation of biological functions across multiple "omics" levels. Levels include mRNA and protein expression profiling as well as additional knowledge on e.g. DNA methylation and microRNA regulation. The reason for this interest in multi-omics is that actual cellular responses to different conditions are best explained mechanistically when taking all omics levels into account. To map gene products to their biological functions, public ontologies like Gene Ontology (GO) are commonly used. Many methods have been developed to identify terms in an ontology, overrepresented within a set of genes. However, these methods are not able to appropriately deal with any combination of several data types. Here, we propose a new method to analyse integrated data across multiple omics-levels to simultaneously assess their biological meaning. We developed a model-based Bayesian method for inferring interpretable term probabilities in a modular framework. Our Multilevel ONtology Analysis (MONA) algorithm performed significantly better than conventional analyses of individual levels and yields best results even for sophisticated models including mRNA fine-tuning by microRNAs. The MONA framework is flexible enough to allow for different underlying regulatory motifs or ontologies. It is ready-to-use for applied researchers and is available as a standalone application from http://cmb.helmholtz-muenchen.de/mona.

A023 - Integration of genetic and epigenetic data suggests different roles for known tumor suppressor genes in subtypes of colon and renal cancer

Short Abstract: It is increasingly recognized that specific cancer types, such as colon cancer, represent a collection of different diseases. Gene expression data and clinicopathological characteristics have been utilized for identifying subtypes in various cancer types. To develop a better understanding of the biological differences between tumor subtypes, it is necessary to investigate deregulation of cellular processes by integrating information from different data types. We developed a new approach for detecting genes that are commonly targeted by genetic and epigenetic aberrations. To this end, our method compares tumor and normal samples to detect significant changes in DNA methylation and DNA copy number alterations that correlate with gene expression. Additionally, the method considers the frequency of somatic mutations found in tumor samples. First, we applied our new method to compare the types and frequency of aberrations of known tumor suppressor genes in renal clear cell carcinoma and renal papillary cell carcinoma. We found differences in, amongst other, the chromatin remodeler ARID1A and TSC2, an inhibitor of mTORC1 signaling. Second, we applied the method to gene expression-derived subtypes of colon cancer. We found that the nuclear factor, NF-kappa-B p105 subunit, was more frequently targeted by aberrations in a mesenchymal subtype compared with an epithelial subtype. This finding is in line with the involvement of NF-kappa-B in the induction of epithelial-to-mesenchymal transition. In summary, our method can improve the understanding of the biological differences between tumor subtypes by integrating data from different genome-wide experimental assays and help prioritize candidate genes for further investigation.

A024 - Effects of EWS-FLI1 oncogene on microRNA expression revealed by sequencing data

Short Abstract: EWS-FLI1 fusion gene is an aberrant transcription factor expressed by the large majority of Ewing tumors [1]. Since its expression is sufficient to change cell phenotype from normal to tumorigenic, it is considered as a major oncogene driving the development of this tumor.
Here, we study the effect of EWS-FLI1 on microRNA expression to identify potential microRNAs implicated in the disease. Data used in this study have been generated by Illumina sequencing of small RNAs from A673 Ewing cell line in which EWS-FLI1 is inhibited by a small hairpin RNA. NcPRO-seq tool has been used for data quality control, read mapping and read annotation. Then, differential expression analysis has been performed to identify EWS-FLI1 modulated microRNAs.
Results from small RNAs sequencing data have been compared to array-based microRNA quantification from the same cell line to check for consistency between measurements obtained by these two techniques. From this comparison, we identified groups of commonly detected modulated microRNAs.
Enrichment analysis has been applied to investigate whether modulated miRNAs are located in genomic regions frequently altered in cancer. Moreover, we investigate whether they belong to the same microRNA family that have highly similar binding sites and may coordinately regulate common target genes.

[1] Delattre O et al., N Engl J Med, 1994 ; 331:294-299
[2] Chen CJ et al, Bioinformatics, 2012 ; 28(23):3147-9

A025 - Importance of negation scope and spelling variation in text-mining of electronic patient records

Short Abstract: Electronic patient records are a potentially rich data source for biomedical research. By text-mining free‐text in such records, phenotype information can be extracted. This information can be used to investigate disease co‐morbidity, patient stratification and underlying molecular level disease etiology, which are all important prerequisites for personalized medicine.
Here we evaluated how adding functionalities to a baseline text-mining tool affected the overall performance. The purpose of the tool was to create enriched phenotypic profiles for each patient in a corpus consisting of records from 5,543 patients at a Danish psychiatric hospital, by assigning each patient additional ICD10 codes based on free-text parts of these records. The tool vas evaluated by manually curating a test set consisting of all records from 50 patients and determining whether the ICD10 codes assigned to each patient were correct. The functionalities of the tool evaluated were designed to handle spelling and ending variations, shuffling of tokens within a term, and introduction of gaps in terms. Additionally we investigated the importance of negation identification and scope.
The single most important functionality of the tool was handling of spelling variation, which greatly increased the number of phenotypes found in the records, without noticeably decreasing the precision. Additionally our results show that negations have very different scopes, some spanning only a few words, while others span up to whole sentences. This shows that negation identification and scope are important in clinical information extraction, and tools aiming at extracting information from these sources should include ways to handle negations.

A026 - A Novel Approach to Detect Disease-Related Genes using Protein-Protein Interactions and Literature-Driven Gene Network

Short Abstract: Detecting disease-related genes is one of significant research in bioinformatics because it can help finding new drugs or treatments. There are already a number of approaches to find disease-related genes such as microarray data analysis, literature-based discovery, using protein-protein interaction data and using pathway data. Those approaches usually use only one data types. However, approaches employing just one data type can easily meet limitations and moreover, using more than two data types can create synergy effect. We developed novel algorithm to detect disease-related genes using three data types, gene expression data, protein-protein interactions and literature data. The algorithm makes interactions on the basis of co-occurrence of biomedical terms from literature data and adds those interactions to protein-protein interactions to make gene network. Then calculates differentially expressed score with gene expression data for each gene in gene network. Make new score using average of neighbor’s differentially expressed score in the network. We assume that top scored genes have a high probability of having relation with disease. Our algorithm was tested with prostate cancer gene expression data(Singh et al., Cancer Cell, 2002), protein-protein interactions(Brown et al., Genome Biol., 2007), and “Prostate cancer” related 48,861 PubMed abstracts. The algorithm found out more prostate cancer related genes(68 related genes out of 500 top scored genes) than other comparing feature selection methods. The algorithm also showed better performance on classification accuracy and AUC(Area Under Curve) than other feature selection methods.

A027 - Disease indication identification through targeted integration of biological knowledge and data.

Short Abstract: Background: A challenge for the pharmaceutical industry is to identify the best disease population to test with a new drug. Finding a route to quickly identify indications for novel drugs is a valuable process and has been explored by a number of groups. However, the majority of published methods utilize only a single data modality, such as mRNA expression or literature or pharmaceutical pipeline data. Here, we present a method which effectively combines different data types from publically-available, independent sources providing an efficient route to identify drug indications. Method: Connectivity Map mRNA expression profiles (Lamb et al, 2006) based on drug treatments were processed as described by Iskar et al (2012), and genes significantly dysregulated by drugs were identified. Using natural language processing (NLP) methods, a high-quality set of ‘gene causes disease’ relationships were extracted from PubMed. Manual inspection indicated that this set included many known Mendelian associations, plus some novel associations. These results were combined to identify drugs that regulate disease causing genes, which suggests new indications for these drugs. Results & Discussion: By combining these two datasets we were able to correctly identify a small fraction of current drug indications. We are actively making further improvements to both NLP techniques and to indentify genes significantly dysregulated by drugs to improve the methodology and obtain both quantitative estimates of accuracy and more importantly actionable examples of hypotheses around new indications.

A028 - Complex MicroRNA networks discovery in Follicular Thyroid Carcinomas

Short Abstract: Introduction: Follicular Thyroid Cancer (FTC) etiology is poorly understood and molecular similarity to benign Follicular Thyroid Adenoma (FTA) is a clinical challenge.

Hypothesis: Differences in microRNA (miR) regulation between FTC and FTA can help in differential diagnosis.

Materials & Methods: 37 samples (20 FTA, 17 FTC) were used to run experiments of gene expression arrays and miR deep seqencing. Spearman correlations between mRNA and miR expression were calculated separately for FTA and FTC samples. Characteristic mRNAs and miRs for both FTC and FTA were selected based on percentiles values (95th and 5th for mRNAs and 98th and 2nd for miRs) and their correlations were checked against the TargetRank miR regulation prediction tool.

Results: Highly correlated mRNAs had highest correlation percentiles of 0.74 and lowest -0.72 and miRs highest 0.75 and lowest -0.72. Both, in FTC and in FTA samples characteristic mRNAs and miRs were found that are highly interconnected in miR regulatory pathways. In benign tumors (FTA) genes involved in G protein signaling pathways were highly over-represented and these pathways are known to be important for thyroid signaling.

Conclusions: Correlation percentiles analysis has a potential to shed some light on complex miR regulatory pathways. Selected miR regulations can be useful for molecular diagnosis.

Funding: FNP MPD Program “Molecular Genomics, Transcriptomics and Bioinformatics in Cancer” (BW, TS). CF was supported by a PhD scholarship of the Centro Nacional de Pesquisa e Tecnologia do Brasil (CNPq) (290023/2009-2), ME was supported by a DFG grant (ES162/4-1) and RP by DFG and Krebshilfe grants.

A029 - Transcriptomic and Proteomic Analysis of Human Pre-eclampsia tissue

Short Abstract: In an effort to map the human proteome, the Chromosome-centric Human Proteome Project (C-HPP) was recently initiated. To make a deep comprehensive map (with gene expression level, protein expression level, protein characteristic and alternative splicing product information) for a chromosome centric, we selected placenta as the model tissue because it contains the large number of gene products among all the organs and serves a similar function in mammals despite anatomical differences. To get more deep insight of placental protein’s function, we selected pre-eclampsia as disease model, and approved multi-omics approach. For genomic profiling, cDNA arrays (Illumina) will perform, and quantification at protein level, will perform iTRAQ labeling technique. After get gene and protein expression level, we can analysis gene and protein expression depending on normal and disease placenta. From this study, we can find out disease mechanism and also the role of proteins. (This study was supported by a grant from MediStar (A112047 to S.K.J.), the National Project for Personalized Genomic Medicine (A111218-11 to Y.K.P.), the National R&D Program for Cancer Control, Ministry of Health and Welfare (1120200 to Y.K.P.) by the Ministry for Health and Welfare, and World Class University (WCU) grant (R31-2008-000-10086-0). We thank Agilent Technologies and Thermo Fisher Scientific for their generous support of mass spectrometric analysis.)

A030 - Towards a bioinformatics analysis of anti-Alzheimer’s herbal medicines from a target network perspective

Short Abstract: With the growth of aging population all over the world, a rising incidence of Alzheimer’s disease (AD) has been recently observed. In contrast to FDA-approved western drugs, herbal medicines, featured as abundant ingredients and multi-targeting, have been acknowledged with notable anti-AD effects although the mechanism of action (MOA) is unknown. Investigating the possible MOA for these herbs can not only refresh but also extend the current knowledge of AD pathogenesis. In this study, clinically tested anti-AD herbs, their ingredients as well as their corresponding target proteins were systematically reviewed together with applicable bioinformatics resources and methodologies. Based on above information and resources, we present a systematically target network analysis framework to explore the mechanism of anti-AD herb ingredients. Our results indicated that, in addition to the binding of those symptom-relieving targets as the FDA-approved drugs usually do, ingredients of anti-AD herbs also interact closely with a variety of successful therapeutic targets related to other diseases, such as inflammation, cancer and diabetes, suggesting the possible cross-talks between these complicated diseases. Furthermore, pathways of Ca2+ equilibrium maintaining, upstream of cell proliferation and inflammation were densely targeted by the anti-AD herbal ingredients with rigorous statistic evaluation. In addition to the holistic understanding of the pathogenesis of AD, the integrated network analysis on the MOA of herbal ingredients may also suggest new clues for the future disease modifying strategies.

A031 - Correlation Network Balancing Test (CNBT): A multivariate nonparametric differential coexpression test

Short Abstract: Motivation. The differential genes coexpression (DC) analysis aims to identify genes with correlated expression patterns in one phenotype, but not the other. In the simplest case using univariate tests DC analysis identifies gene pairs that are significantly coexpressed. Methods for identifying DC pathways (gene sets) also exist and are primarily based on aggregated pairwise comparisons and ignore the full correlation structure between genes. Inspired by the power control strategies in cellular wireless systems, we propose the Correlation Network Balancing Test (CNBT), a multivariate nonparametric DC test for gene sets.

Results. Our test assigns initial weight factors to the genes in the network and, iteratively, modifies these factors with the objective of maximizing the autocorrelation to inter-gene correlations ratio for all genes. The algorithm achieves a guaranteed fast convergence. The weight factors converge to the values related to the topological position of genes. We demonstrate that the test statistic is sensitive to topological changes in a network by examining the network changes between the correlations of genes expression in p53 wild type and p53 mutated samples over several selected pathways.
The comparative power analysis of CNBT and gene set co-expression analysis (GSCA) tests demonstrates that CNBT targets network structural changes rather than differences in gene correlations. The CNBT algorithm is a new approach for the analysis of changes in pathways’ correlation structure between two phenotypes and will increase our ability to detect relevant biological processes leading to those changes.

Availability. R code is available from the first author upon request.

A032 - Feature selection from large scale clinical proteomics data sets: a promise for improved diagnosis and treatment of breast cancer patients

Short Abstract: The efficient treatment of breast cancer patients greatly depends on the accurate subtype classification, which despite the available biomarkers is not always straightforward and often suboptimal. Clinical mass spectrometry-based proteomics is becoming an increasingly powerful field in addressing the needs for better diagnostic and discovery of novel biomarkers. The current advances in sample preparation and quantification of proteins in tissues enable the characterization of thousands of proteins from patient samples.
We report the implementation of a comprehensive framework for analysis of proteome profiles and selection of discriminative features and its successful application to the classification of breast cancer patients. In our study we address the main challenges, related to the tasks of signature detection and subsequent sample classification: (i) high biological variability among patient samples and (ii) large feature space combined with low sample size. Support vector machines are employed and the outcomes of different feature selection methods are compared. The framework supports means- and classifier’s weights-based feature ranking methods embedded in cross-validation procedures for maximal generalizability.
Our results showed that despite the limited number of samples it is possible to distinguish disease-related patterns and to extract biologically-relevant features. Furthermore, we were able to demonstrate the effect of different feature selection methods tailored either for the detection of single biomarkers or for the identification of sets of features with high predictive power. In summary, clinical proteomics combined with rigorously-applied data mining techniques hold a great promise for improvements in the field of personalized medicine.

A033 - A sensitive and specific genetic marker to diagnose severe asthma using a Genome-wide association study(Exomechip).

Short Abstract: The aim of the present study was to develop a diagnostic set of SNPs for discriminating mild ashma from severe asthma groups using the 240K Exome-chip. First, the Exome-chip data were filtered according to p-values using the multiple logesitc regression, and the 10 candidate SNPs most closely associated with severe ashma were selected, based on 111 mild ashma(cases) and 111 severe asthma(controls) subjects. Second, Using the 10 candidate, the 1023(210-1) possible models were generated. For 1023 models, multiple logistic regression and receiver operating characteristic (ROC) curve analysis was performed. The 9 SNPs were chosen as the best model for distinguishing between mild ashma from severe asthma. A combination model of the 9 SNPs among the 10 SNPs showed the highest area under the ROC curve of 0.83(P-value: 6.0842E-17; asymptotic 95% confidence interval lower bound: 0.77, upper bound: 0.88; AUC: 0.77) showing the sensitivity and specificity were 78% and 74%, respectively(rs6628742(DMD), rs738479(PARVB), rs2227310(CASP7), rs32319(PDE1C), rs1849173(GABRB2), rs241439(TAP2), rs4809401(NPBWR2), rs4342887(KIF2C), rs8176746(ABO)). The genes of 6 SNPs among 9 SNPs have kwon as related with asthma or inflammation by the PubMed search, and CASP7(rs2227310) and KIF2C(rs4342887) were interacting each other by gene interaction network analysis. This result means that the 9 candidate SNPs may relate with severe asthma and the candidate set of 9 SNPs may be useful in predicting the severe ashma.

A034 - Integrated data analysis of DNA methylation and gene expression profiles on colorectal cancer through network-based method

Short Abstract: Recent rapid advances in experimental technologies of molecular biology produce vast amount of heterogeneous data, such as gene expression, many types of epigenetic profiles and somatic mutation data, particularly in cancer research. We need new integrated analysis methodologies to elucidate cancer subtypes, which could be valuable for tailor-made cancer treatment. To explore the possibilities of data integration, we have obtained DNA methylation data with Illumina® 450K platform and gene expression data with Affymetirx® U133plus2.0 on 81 colorectal cancer samples (41 samples with no metastasis and 40 samples with liver metastasis). We have developed a new methodology for analyzing the 450K array data to characterize the status of CpG island methylation profiles of cancer samples. Then we applied an integrated clustering algorithm, know as iCluster, to our DNA methylation and gene expression data. iCluster discovered five clusters and the one of them were correlated with liver metastasis status. We analyzed the genes whose expression patterns were specific to the cluster with NetHiKe (Network-based Hidden Key molecule miner), which was developed by our group. We identified SNIP1 (Smad nuclear interacting protein 1), which might play a role in cancer progression through stabilizing cyclin D1 mRNA, as a key molecule for the cluster. We will plan to analyze the rest of the four clusters and characterize them with clinical information that is useful for building the cancer treatment strategies.

A035 - Screening the Prostate Cancer Susceptibility Loci at 2q37.3 and 17q12-q21 for Novel Candidate Genes in Finnish Prostate Cancer Families

Short Abstract: According to several studies, genetic risk factors have been shown to be associated to prostate cancer susceptibility. Several chromosomal loci have been shown to be associated to familial prostate cancer. In a recent genome-wide linkage study strong signals coming from 2q37 and 17q21-22 were discovered in Finnish population. To study these loci in detail we performed a targeted high-throughput DNA sequencing on 21 families including 65 cases and 5 controls. In addition, RNA-sequencing was performed for 33 of these cases from purified RNA from whole blood. The aim of this study was to identify variants that associate to prostate cancer susceptibility.

Variant calling from sequencing was done using Samtools and variants were subsequently annotated using information from UCSC genome browser database. Three pathogenicity prediction tools Polyphen-2, Pon-P and Mutation taster were used to elucidate the possible phenotypic effects of variants located within genes. As an alternative variant priorization approach we compiled a list of prostate cancer associated genes within the regions of interest from literature, Cancer Gene, Gene-Ontology and pathway databases. To study the intergenic variants in more detail an eQTL-analysis was conducted applying two statistical models: Linear and a non-parametric directional test based model.

As a result of the pathogenicity prediction a ranked list of 152 variants with putative effect on protein function was obtained. 38 of these variants as well as 20 additional variants from prostate cancer associated genes were chosen for further validation in a larger population. Validation of these and eQTL targets is currently ongoing.

A036 - Analyzing Motion Characteristics of Single Molecules for Providing Evidence of Paroxysmal Nocturnal Hemoglobinuria in Microscopy Images

Short Abstract: Paroxysmal nocturnal hemoglobinuria (PNH) is a rare disease that is characterized by acquired hemolytic anemia, a decreased number of the red blood cells, kidney failure, and a high incidence of life-threatening venous thrombosis. This disease causes some proteins (CD59 (protectin)) not to be able to anchorage on the surface of the red blood cells.
The detection of these PNH affected cells is based on an efficient approach for finding and tracking single molecules in nanoscale microscopy images. In this context, the trajectory determination of single molecules of the CD59 antigen on erythrocytes plays an important role.

First, single molecules in PNH affected cells are identified in microscopy images using Gaussian fitting methods. Afterwards, these molecules are tracked over sets of images: A tracking algorithm has been developed to characterize lists of trajectories of PNH affected cells over a series of images using the Thompson formula for calculating optimal object parameters. This approach is used for detecting, analysing and visualizing CD59 trajectories.

The so determined trajectories are subsequently characterized using a set of features such as their length (mean trajectory length), their shape (confined, free diffusion) and other pre-defined features. Machine learning algorithms for instance support vector machines, neural networks, random forests, and genetic programming (all available in HeuristicLab (http://dev.heuristiclab.com)) are used for learning classifiers to be able to distinguish between cells of healthy patients and PNH affected cells.
Using this approach we are able to characterize and recognize PNH affected cells as well as its disease states based on microscopy images.

A037 - Network information improves cancer outcome prediction

Short Abstract: Disease progression in cancer can vary substantially between patients. Yet patients often receive the same treatment.
Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize
treatment options. Despite first diagnostic kits on the market, there are open problems such as the choice of random gene signatures or noisy
expression data. One approach to deal with these two problems employs protein-protein interaction networks and ranks genes using the random surfer model
of Google's PageRank algorithm.
In this work we created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and systematically evaluated the use of networks and a PageRank derivative, NetRank, for signature identification.
We show that the NetRank algorithm performs significantly better than classical methods such as foldchange or t-test.
Despite an order of magnitude difference in network size, a regulatory and protein-protein interaction network perform equally well.
Experimental evaluation on cancer outcome prediction in all of the 25 underlying datasets suggests that the network-based methodology
identifies highly overlapping signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types.
Integration of network information into gene expression analysis allows the identification of more reliable and accurate biomarkers and
provides a deeper understanding of processes occurring in cancer development and progression.

A038 - Implications of the expression status of VEGF and aberrant miRNA expression in breast cancer

Short Abstract: Activation of vascular endothelial growth factor (VEGF) pathway in breast cancer is known to cause tumors with high microvascular density, influence prognosis and response to conventional hormonal therapy. Here, we study two groups of breast cancers based on the expression status of VEGF for identifying differential co-expression of genes and miRNAs. Differentially co-expressed genes based on a large independent dataset are validated to previously published findings based on another dataset. We show differential association of VEGFA status with breast cancer aggressiveness-relevant biological processes and pathways. In addition, we show the prognostic significance of VEGF on large independent dataset.

A039 - FragExtract: a new software tool for the automated extraction of LC-MS/MS derived signals to help structure elucidation in metabolomics research

Short Abstract: Structure elucidation of (unknown) metabolites of interest is still a major bottleneck in untargeted metabolomics studies. For definitive compound identification, tandem mass spectra (MS/MS) need to be recorded and compared against mass spectral databases or MS/MS spectra of authentic reference standards obtained under the same experimental conditions. One elegant approach in this respect is full in vivo stable isotopic labelling (SIL) of whole organisms.
Here, we present FragExtract, a novel software tool for processing high-resolution liquid chromatography tandem mass spectrometry (HR-LC-MS/MS) data of SIL-assisted experiments where MS/MS spectra of native and corresponding 13C labelled compounds are acquired in the same analytical run. FragExtract provides a powerful tool to determine meaningful MS/MS fragment signals by finding pairs of corresponding signals in MS/MS spectra of native and labelled substances and calculating their number of carbon atoms. This way, artefacts and noise-related signals can be filtered out. Additionally FragExtract generates meaningful suggestions for elemental formulas of the fragment ions. It shows the potential to assist in structural elucidation and annotation of unknowns in untargeted studies, significantly contributing in future SIL-assisted untargeted metabolomics studies.
The workflow and the presented software were developed based on product ion MS/MS spectra from measurements of selected reference substances. For verification of the algorithm, mixtures of 12C and 13C labelled substances were spiked at different concentration levels into culture filtrates of the filamentous fungus Fusarium graminearum. Under the tested conditions, the presence of matrix compounds did not alter FragExtract’s ability to properly and reliably filter out unspecific signals.

A040 - Accuracy of algorithms and databases for the prediction of deleterious and disease causing mutations in healthy individuals.

Short Abstract: Whole genome sequencing comes with the unprecedented opportunity to obtain secondary information not related to the original clinical question. In order for this information to prove useful, assessing the clinical validity is crucial. We must ensure that prediction algorithms and databases are able to accurately predict the phenotype in healthy individuals. To this end we compared the specificity of several prediction algorithms and databases like SIFT, PolyPhen and HGMD regarding the detection of damaging or disease causing mutations. We also tested a new in-house developed algorithm called eXtasy that is disease-centric and incorporates the phenotype of the individual in the analysis.

We analyzed samples from people who are considered healthy from the 1000 Genomes Project, publicly available samples from Complete Genomics, and research samples from the centre for human genetics at our university. In those samples, we looked for mutations in genes associated with severe congenital disorders characterized by extreme dysmorphologies and early onset.

We found large differences between combinations of the different prediction algorithms, databases, datasets and modes of inheritance. Depending on the dataset 98-100% of individuals had mutations predicted to be damaging by both PolyPhen and SIFT. For HGMD mutations, we found that up to 22% of the samples had mutations annotated as disease causing. Overall there were more false positives for autosomal dominant disorders compared to autosomal recessive disorders. The results obtained from eXtasy are promising and show a large increase in specificity compared to other prediction algorithms.

A041 - An ensemble method for cancer classification with microRNA expression profiles

Short Abstract: For the last decade, mRNA expression profiling with microarray has been widely used to classify the different types of human cancers. Recently, much research has been shown that miRNA expression profiling can more accurately classify human cancers than mRNA expression profiles. However, there is little work on machine learning algorithms to classify cancers with miRNA expression data. In this study, we introduce a feature subset based ensemble method for classifying multiple tissues with miRNA expression data. The proposed method has three major steps: i) Generation of multiple miRNA subsets based on the correlations among miRNAs. In this procedure, the symmetrical uncertainty is used to measure the redundancy of the miRNAs, and then the redundant miRNAs are included in the different subsets. ii) Using C4.5 decision tree algorithm as base classifier to learn the model from each miRNA subset. iii) Combination of the result of each classifier by average of probabilities. The idea is to sum up the conditional probability vector obtained from each classifier and then get the average by dividing the sum by the number of base classifiers. To demonstrate the effectiveness of our method, we tested our method on the miRNA expression dataset which was available on the Gene Expression Omnibus (GEO) database. The proposed method achieved the average AUC of 0.979 with leave-one-out cross validation. Moreover, our method was found to result in the best prediction accuracy, which was 91.78% comparing with Bagging, AdaBoost and RandomForest. Clearly, the proposed method outperforms other ensemble methods on the miRNA expression data.

A042 - AGP: A tool for prediction of genes associated with Age-Related Disorders

Short Abstract: Interplay of genetic factors has been shown to play a pivotal role in the inevitable process of aging. In the process of aging, accumulation of mutations may lead to one or many age related disorders (ARDs). Many genes have been observed to be the common links between many ARDs. Some such examples include HLA-DQB1 with rheumatoid arthritis, Grave’s disease and Crohn’s disease; MTHFR with cardio vascular disorders, schizophrenia and atrial fibrillation; and NOS3 with hypertension and dementia. However, till date no tool exists that may be used to prioritize

A043 - Reconstructing High-resolution Images from Multiple Low-resolution Views

Short Abstract: We propose a geometry-guided algorithm to reconstruct one high-resolution image from multiple low-resolution views of a sample. The method is based on estimating the intensity and its gradient in the low-resolution images using linear regression and/or non-local means. The high-resolution image is then estimated by solving a Helmholtz equation. We solve this equation in the image domain using a direct solver with a runtime complexity in O(n log(n)), where n is the number of pixels in the high-resolution image. We test, demonstrate, and benchmark the method on real-world images. We then show its application to 2D and 3D biological image data from fluorescence microscopy.

A044 - Comparison of available computer software for nuclear magnetic spectroscopy

Short Abstract: There are currently many computer software which deal with Nuclear Magnetic Spectroscopy. Some of them are commercial and the others are available for free including open source license. Such software is responsible for full analysis of given Nuclear Magnetic Resonance (NMR) signal. Aim of current research was to design computer software in MATLAB environment that will be able to run full pre-processing analysis and further processing. Analysis of such signal might be divided into few steps. At the beginning investigated spectrum in the form of raw data has to be read from file with respect to different type of scanners and saved in one common order. In the next step measured signal is subjected to pre-processing techniques which are used for enhancement of the signal and it is performed on the side of computer software. The pre-processing techniques include noise filtering, removal of water signal, phase correction, base-line correction and signal modeling. Due to the fact that signal presented in time domain is unreadable for future diagnosis software has to transform signal into spectrum in frequency domain thanks to Fourier Transform. Authors proposed few solutions in comparison to available software. As a result of project authors obtained functional system for processing NMR spectra. After series of experiments authors observed that obtained result is satisfactory according to the assumed quality criterion. However precision of used algorithms may be still improved. Since the software was warmed welcomed by cooperating physicians, authors decided to continue work on proposed system.

A045 - Computational genomic analysis of the gene expression profile of fluid transporting proteins in malignant pleural mesothelioma based on Gene Ontology annotations.

Short Abstract: Background: Water and ion transporters of mesothelial cells play a pivotal role in pleural fluid absorption. In malignant pleural effusions, the effusion volume is significantly correlated with the number of pleural tumor foci. Pertinent to the above, it has been shown that the occurrence of pleural effusion is an independent prognostic factor of poor outcome in patients with malignant pleural mesothelioma (MPM).
Aims: To identify the differential gene expression of proteins involved in fluid transport according to Gene Ontology (GO) annotation, in MPM patients, using a computational genomics approach.
Methods: We extracted the components of the “fluid transport” GO annotation in AmiGO database. We then interrogated the gene expression profile of these components in Oncomine Cancer Microarray database using the Gordon Mesothelioma study comparing MPM versus controls, in order to detect differentially expressed genes.
Results: In AmiGO, 44 genes were involved in "fluid transport". In the Gordon Mesothelioma study 37 of them were assessed. Out of these genes, ADCY7, AQP1, GNAS, PDPN, PRKAR1B and RAB11A were significantly over-expressed, whereas AQP4, AQP5, AQP6, AVPR2 and CFTR were significantly under-expressed in MPM patients. Deming regression analysis revealed significant associations among the gene expression profiles of AQP6 with AQP4, CFTR, AVPR2, GNAS and RAB11A; of AQP1 with PDPN and RAB11A; of GNAS with CFTR; of PDPN with PRKAR1B.
Conclusions: We identified 11 genes differentially expressed in MPM, never reported before. Our results warrant the experimental investigation of these genes in order to clarify their physiological role and their potential value as drug targets in MPM.

A046 - L1000 data processing pipeline based on Fuzzy-c-means guided Gaussian mixture model peak calling method and its application to compound signature discovery

Short Abstract: This poster is based on Proceedings Submission 174.
Motivation: L1000 platform is a novel genome-wide expression profiling approach that have produced a variety of expression profiling based on different cellular activities. An efficient and accurate data processing tool to deal with the large data set is needed, which could provide a solid foundation for further network construction.
Methods: We proposed a parallel L1000 data processing pipeline in this paper. A peak calling method, “fuzzy c-means guided Gaussian mixture model” was developed and embedded in the pipeline. The pipeline could provide the normalized gene expression data, as well as the log fold change data between the treated wells and control wells. The involved quality control step is divided into plate and well based level. Based on the gene expression data calculated by the pipeline, we used a proposed biclustering method, constrained sparseness non-negative matrix factorization (csNMF), to discover the compound signatures.
Results: Comparing to the peak calling method on the LINCS website, which is based on K-means algorithm, our method is more stable to noises. The pipeline is system platform-free, time efficient, and very suitable for large data set processing. All the L1000 data across 14 cell lines with thousands of treatments are processed using this pipeline. Using the processed data, we discovered 10 different compound signatures in A375 cell line based on csNMF algorithm. We also found the triple relations to explain the cellular activities after the compound treatments.

A047 - A database for curating the associations between killer-cell immunoglobulin-like receptors and diseases in worldwide populations

Short Abstract: Due to their involvement in the innate immune response and their high variability, killer cell-immunoglobulin-like receptors (KIR) genes have been associated with a large number of diseases, for example infectious diseases (e.g. HIV, malaria), autoimmune disorders (e.g. lupus, rheumatoid arthritis), cancer and pregnancy-related complications. There are a total of sixteen known KIR genes and individuals may carry variable combinations of these genes, as well as different alleles of genes. The variability across individuals presents challenges to researchers in identifying which sets of genes (single genes, genotypes and other combinations) and polymorphisms are truly responsible for the disease associations observed. Here we report the development of the KIR and Diseases Database (KDDB), capturing a large amount of data from published KIR and disease association studies. A semi-automated process was performed to identify relevant studies, followed by extensive manual curation to extract relevant data. The back-end of the database was developed using a relational database schema, and interactive web pages for querying and submitting data were developed using the ASP scripting environment and JavaScript language. The graphical display was designed using HTML and CSS. KDDB is accommodated within, and linked to, the Allele Frequency Net Database (AFND), which contains various immunogenetic-related resources – including very large collections of data on global allele and gene frequencies for healthy individuals. To date, KDDB contains more than a thousand KIR-disease records, comprising more than 50,000 individuals, thus providing a new community resource for understanding how KIR genes are associated with disease.
Database URL: http://www.allelefrequencies.net/diseases/

A048 - Biomedical Text Mining for Disease Gene Discovery

Short Abstract: Background: The sheer quantity of electronic literature makes it challenging for biologists to search biomedical corpuses for any kind of desired information beyond simple text retrieval. We are developing a Google-like tool that, given a free-text query about a disease or disorder, returns a list of related genes.

Methods: Our tool is based on text mining. We use the “MetaMap portal” to index all the biomedical abstracts published in “PubMed” and acquire a corresponding set of “Unified Medical Language System (UMLS)” terms. Then for each gene recorded in “Entrez Gene”, we generate a keyword profile of UMLS terms based on the gene-abstract functional annotation provided by “GeneRIF”. Similarly for each user query, we generate another keyword profile based on the query-abstract annotation provided by “PubMed”. We decide how strongly a gene is linked to a given query by examining the fraction of shared keywords between their profiles.

Results and Conclusions: For validation, we use the phenotype-gene annotation provided by the “Human-Phenotype-Ontology”. Primary results show a mix of recall rates. For example, we achieved a recall rate of 60% and 75% for Alzheimer’s disease and Holoprosencephaly respectively. While for Diabetes Mellitus and Microcephaly, the recall rate was 22% and 38% respectively. The evaluation is still ongoing; nevertheless we expect our tool to perform quite well. This is because we apply an extensive search on the literature and look for hidden evidence in order to link a gene to a given disease.

A049 - GEPETTO : Open-source framework for Gene Prioritization

Short Abstract: In the era of omics "Big data", and in particular next-generation sequencing (NGS), gene prioritization is a crucial task, involving the integration of huge amounts of heterogeneous data and the selection and analysis of genes predicted to be involved in a specific biological process, such as pathology. Large sets of genes must be evaluated, in order to score and rank them according to their similarity to known genes and their potential viability as candidates for important applications, such as diagnostic/prognostic markers, drug targets, etc. A customizable and extensible framework is needed for gene selection that can handle large-scale, public-private biological information. To our knowledge, no other open-source framework for gene prioritization has previously been developed.
GEPETTO (Gene Prioritization Extended Tool) is an original open-source framework, distributed under the LGPL license, for gene selection and prioritization on a desktop computer that ensures confidentiality of personal data. It takes advantage of the data integration capabilities in the SM2PH-Central knowledge base, combined with in-house developed gene prioritization methods. It currently incorporates six prioritization modules, based on gene sequence, protein-protein interactions, gene expression, disease-causing probabilities, protein evolution and genomic context).
GEPETTO is written in Java/Python and supported by and advanced modular architecture, which means that it can be easily modified and extended ; in order to include alternative scoring methods and new public/private data sources. In the future, we intend to extend the system to variant prioritization, by exploiting the variant data in the MSV3D database. The software is available at http://sourceforge.net/projects/gepetto/files or http://decrypthon.igbmc.fr/sm2ph/cgi-bin.gepetto.

A050 - Computational transcriptomic analysis reveals a significant role for PARK7 and for ESC/E(Z), Sin-3, NuRD and PcG protein complexes in malignant pleural mesothelioma.

Short Abstract: Background: Over-expression of Parkinson protein 7 (PARK7) in various neoplasms is linked with enhanced neoplastic cell survival, metastatic and relapse incidence, chemoresistance and patient survival. To date, a possible implication of PARK7’s expression in malignant pleural mesothelioma (MPM) has not been evaluated.
Aims: The aim of our study was to assess the differential gene expression of PARK7 and its interactors in MPM using data mining techniques in order to identify novel candidate genes that in conjunction with PARK7 may play a role in the pathogenicity of MPM.
Methods: We constructed the gene Network of PARK7 using the ConsensusPathDB database. We then interrogated the Oncomine Cancer Microarray database using the Gordon Mesothelioma study, in order to detect the differentially expressed genes of the PARK7 network. Subsequently, we performed prediction of Gene Ontology (GO) annotations enrichment for PARK7 network regarding biological processes using the GeneMANIA algorithm.
Results: In ConsensusPathDB, 38 protein interactors of PARK7 were identified. In the Gordon Mesothelioma study, 34 of them were assessed. Out of these genes, PARK7, SOD1, SUMO1, UBC3, PIAS2, KIAA0101, HDAC2, DAXX, RBBP4, BBS1, NONO, RBBP7, HTRA2, STUB1 and HSP70 were significantly over-expressed, whereas TRAF6 and MTA2 were significantly under-expressed in MPM patients. GO annotation enrichment revealed significant roles of PARK7 gene network in ESC/E(Z), Sin-3, NuRD and PcG protein complexes.
Conclusions: We identified 17 novel genes differentially expressed in MPM, never reported before. We also identified the predicted biological processes that PARK7 network is involved in and we report novel pathways involved in MPM disease.

A051 - Regression versus Classification Based Approaches for Epistasis Detection in Case-Control GWAS

Short Abstract: Epistatic interactions between genotyping variables are essential in explaining phenotypes, especially complex diseases, in genome-wide association studies. However, the practical detection of epistatic loci based on SNP genotyping arrays poses both computational and statistical challenges, resolution of which requires development of novel techniques. In this study, we focus on comparing two epistasis detection methods, specifically, a recently developed fast method based on ROC curves for classification, and the classical regression based Fisher’s method.
Our presentation is based on exhaustive search for epistasis in Case-Control studies. The analysis is based on publically available GPU-based algorithms: for classification based method, we employed the GWIS algorithm [1] available via web server for, while for the regression based method, we employed the GBOOST algorithm [2] with source downloadable via web. Both algorithms were run on the same hardware and on the identical SNP array datasets, including seven WTCCC datasets and two Celiac-disease datasets. Our analysis focuses in particular on the issue of timing and replication of the results with respect to the same disease.
In summary, we find significant overlap of results between those approaches, but also some important differences and complementarity.

[1] B. Goudey, D. Rawlinson, Q. Wang, et al., GWIS - Model-free, Fast and Exhaustive Search for Epistatic Interactions in Case-Control GWAS, BMC Genomics 2013, accepted

[2] Yung LS, Yang C, Wan X, Yu W: GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics 2011, 27(9):1309-1310

A052 - In-silico ranking candidate molecules according to their biomarkabilty/druggability properties

Short Abstract: The study of biomarkers has gained momentum in part due to advances in high-throughput technologies. A biomarker is a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention. Is important therefore a system to rank and evaluate the biomarker properties of molecules in different conditions.

In this work we present a method to in-silico score potential molecular biomarkers using different biological criteria with the aim of sorting those molecules that present better biomarker properties. This work started by selecting from the Drugbank database droggable proteins and characterized them in base of different biological features. Among the different properties studied we have considered the identification of the corresponding gene/peptide in plasma, hydrophoby, longitude, percentage of aminoacids and types of aminoacids in protein, molecular weight, isoelectric point, secondary structure and post-translational modifications.

Feature selection methods (filter and wrapper) were used to determine and score those properties that better discriminate between the droggable proteins and a control set selected randomly from the genome. Finally, to determine the importance of each biological feature, machine learning algorithms were used to weight the importance of each criterion depending on their discriminative capacity. The final step is the ranking of molecules in base of the presence or absence of those weighted properties.

A053 - A Bioinformatics Web-Platform for Integration and Analysis of High-Throughput Screening and Profiling Data in Cancer Research

Short Abstract: Systematic, non-hypothesis driven high-throughput approaches are an appropriate strategy for discovering new starting points for personalized anti-cancer therapies. Within the NanoCAN center, a joint project at the University of Southern Denmark, we have been building up one of Northern Europe’s largest screening platforms. Our primary intention for this platform is the discovery of new personalized treatment options in cancer based on cancer stem cell eradication. To achieve this, we facilitate comparative genome-wide screens for siRNA and miRNA lethality using established cancer cell-lines. Subsequently, we obtain proteomics data using reverse-phase-protein array (RPPA) technology. This approach allows us to measure protein expression levels of several known cancer markers in parallel. High-throughput-screening (HTS) and RPPA data each provide challenges in data processing and visualization. Furthermore, the amount of data generated demands a systematic approach for sample tracking.

Here, we present a set of tools addressing these issues in an integrated bioinformatics platform. Already now, samples can be tracked between screens and proteomics through an intuitive web-interface. In addition, these tools enable researchers to work on these data by exporting it to the R statistical environment where functions for processing, analysis, and visualization are made available. A variety of existing methods have been customized or wrapped to work on our data model, whereas missing functionality has been implemented. We envision that the combined analysis of HTS and RPPA data with the proposed platform will make a significant contribution in gaining a better understanding of the complex gene relations that lead to cancer development.

A054 - Improving HIV-1 coreceptor usage prediction using marginalized multi-instance kernels

Short Abstract: Approved anti-retroviral drugs against HIV-1 can only reduce viral load but not cure a patient. Due to the high mutation rate of HIV-1, many different variants can emerge in infected patients under therapy, possibly leading to drug-resistant variants.

The only entry inhibitor approved for patient treatment, Maraviroc, blocks a certain coreceptor that can be found on CD4 cells. HIV-1 variants exist that need exactly this coreceptor in an accessible form to enter the cell. Since some HIV-1 variants can utilize another coreceptor to enter CD4 cells, testing which coreceptor the viral population of an HIV-1 infected patient can use is one prerequisite for the prescription of Maraviroc.

Recent improvements in predicting coreceptor usage (tropism) from HIV-1 V3 loop sequences have increased interest in these methods in clinical practice. The first prediction models were based on data from Sanger sequencing of the V3 loop of HIV-1. Recently, a method based on next generation sequencing (NGS) data was introduced that extends previous approaches by modeling the prediction problem on the patient level taking the information of all reads from NGS data jointly into account. The method is based on an SVM with the normalized set kernel. We extend this approach by using a marginalized multi-instance kernel together with the original geno2pheno[coreceptor], which leads to a significant improvement in the prediction performance of Sanger-based sequencing data.

This improvement is important for laboratories without next generation sequencing capabilities, which is still the case for most of the facilities responsible for the genetic tropism test.

A055 - A Catalog of Cytokine SNPs and Their Association with Specific Cancer Types

Short Abstract: Recently there has been a tremendous increase in cancer cases and it is classified as the disease of today’s world. However, success rate of the cancer treatment have not reach the increase rate of cancer cases. As a result, it continues to be one of the top cause of death worldwide. Currently many studies focus on identifying disease related biological biomarkers for prediction of susceptibility and early detection for prevention of cancer, in addition to developing new therapeutic approaches. In our study, we have investigated the single nucleotide polymorphisms (SNPs) of human cytokines which play an important role in immune system, as potential cancer biomarkers. Our main aim was to identify cancer causing genetic variations on known human cytokines and then to identify associations between cytokine variations and specific cancer types. In particular, the data mining approaches were used to map SNPs on known cytokine genes and to extract SNPs associated to specific cancer types from various biological databases. By congregating these data a new biological relational database was developed. The Genome-Wide Association Studies (GWAS) resources were utilized to validate associations between cytokine SNPs and specific cancer types. This relational database enabled us to search with different parameters and to analyze the associations from different aspects. As a result, a catalog of cytokine SNPs and their association with specific cancer types including the most cancer causing cytokine genes and their polymorphisms, the common cancer causing SNPs, the cytokine genes who have high numbers of SNPs is formed.

A056 - Machine Learning based Evaluation of Peak Detection Methods for MCC/IMS Measurements in Clinical Breath Studies

Short Abstract: The combination of multi-capillary columns (MCC/IMS) and ion mobility spectrometry has become an established inexpensive, non-invasive bioanalytics technology for detecting volatile organic compounds (VOCs) in human exhaled air. To pave the way for this technology towards daily usage in medical practice, different steps still have to be taken. The goal of modern biomarker research is to distinguish "healthy" from "not healthy" patients on the basis of volatile metabolite patterns within human breath. The increasing number of measurements requires an automated data analysis and classification as a support for physicians by proposing the most likely patients condition.
Although sophisticated machine learning methods exist, a reliable and robust peak detection without manual intervention is an inevitable preprocessing step. While comparing four state-of-the-art automized peak detection approaches to the manual gold-standard, all methods performed equally well in terms of classification error. In contrast, the approaches showed differences in the robustness against perturbations and overfitting. Nevertheless, in medical studies the manual peak picking remains the gold-standard.
However, the trade off between a slightly higher accuracy (manual) and a huge increase in processing speed (automatic) has to be considered carefully. We conclude that all methods, though small differences exist, are largely reliable and enable a wide spectrum of real-world biomedical applications in clinical breath studies.

A057 - Linked2Safety - Providing pharmaceutical companies, healthcare professionals and patients with an innovative semantic interoperability framework facilitating the efficient and homogenized access to distributed EHRs

Short Abstract: The European healthcare information space is fragmented due to the lack of legal and technical standards, cost effective platforms, and sustainable business models.Linked2Safety (288328) is an FP7 project funded by the European Commission under the area of ICT for health. The vision of the project is to advance clinical practice and accelerate medical research, by providing pharmaceutical companies, healthcare professionals and patients with an innovative semantic interoperability framework facilitating the efficient and homogenized access to anonymized aggregated distributed EHRs, while conforming to all legal and ethical issues. The Linked2Safety platform consists of the following spaces: i) the data cube generation space, ii) the interoperable EHR data space, iii) the linked medical data space and iv) the genetic analysis space. The project uses aggregated data in the form of data cubes along with cell suppression and perturbation as further security measures for preventing the re-identification of patients. The data analysis space provides a number of algorithms for performing quality control, feature selection, data mining and single hypothesis testing on the data. It also allows the combination of these algorithms and the creation of workflows that can be applied on the data of interest. For demonstrating the impact of the outcomes of the project, three usage scenarios will be used: subject selection for a phase III clinical trial, phase IV post authorization clinical trial and identification of relations between molecular fragments and specific adverse event types.

A058 - Scientific Lenses: An Approach to Dynamically Vary the Relationship between Datasets

Short Abstract: Within complex scientific domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specific. For example, searches for the chemical “Fluvastatin” on ChemSpider and DrugBank return different compounds: although their basic chemical structure matches, the compounds differ in their stereochemistry. Under some contexts, a user would would deem it appropriate to relate these records for a specific task whereas under another context the same user would deem it inappropriate to relate them for the same task. Similar problems occur when searching for proteins: should you return information for a specific species or across species; can you accurately distinguish between genes and proteins.

We present an approach for enabling users to control the notion of operational equivalence by applying scientific lenses over Linked Data. Data integration in Linked Data relies on equality links between resources across different datasets. We have extracted multiple sets of links between several key data sets that hold under different contexts. The scientific lenses vary the links that are activated between the datasets which affects the data returned to the user search. Lenses can be layered on top of each to give combined effects. For example, the stereochemistry lens, which matches at the stereoisomer level, can be used in conjunction with the cross-species proteins lens.

A059 - Neurocarta: aggregating and sharing disease-gene relations for the interpretation of genomics studies

Short Abstract: Understanding the genetic basis of diseases is key to the development of better diagnoses and treatments. Unfortunately, only a small fraction of the existing data linking genes to phenotypes is available through online public resources and, when available, it is scattered across multiple access tools. Neurocarta (http://neurocarta.chibi.ubc.ca) is a knowledgebase that consolidates information on genes and phenotypes across multiple resources and allows tracking and exploring of the associations. The system enables automatic and manual curation of evidence supporting each association, as well as user-enabled entry of their own annotations. Phenotypes are recorded using controlled vocabularies (e.g. Disease Ontology) to facilitate computational inference and linking to external data sources. Neurocarta currently holds more than 30,000 lines of evidence linking over 7,000 genes to 2,000 different phenotypes including in-depth annotations on neurodevelopmental disorders. We are currently assessing various schemes to categorize the evidence from different sources according to an intuitive tiered “star system”, in which evidence from animal models or transcriptome studies would get a single star, whereas very strong direct evidence from human studies would obtain five stars. The real power of this approach will be seen when evidence is integrated across sources to provide a single score for each gene-phenotype association. In summary, Neurocarta is a one-stop shop for researchers looking for candidate genes for any disorder of interest and particularly for those working on brain development. We believe that quality-based evidence aggregation will drastically change the way investigators can use gene-phenotype information in the interpretation of genomics data.

A060 - A Mathematical Model of Kinetic Analysis to Estimate Tumor Growth and Metastatic Progression

Short Abstract: Host immune response is a critical determinant for the progression and therapeutic outcome of cancer. However, the immune-tumor interaction is too complicated to be addressed for the regular molecular and biochemical methods. In this project, we aim to develop mathematical models to analyze the quantitative data from mouse model of Lewis Lung Caricinoma (LLC) progression and metastasis. We have established a mouse model of spontaneous metastasis, allowing the quantitative tracking of disease progression and therapeutic responses and analysis of the isolated metastatic tissues. We are applying quantitative methodologies to build mathematical models to analyze the data collected from the LLC mouse tumor model. A traditional set of Gompertzian formula and a set of modified empirical formula were developed to delineate the interactions between immune cells and tumor cells in the process of metastasis. The preliminary results demonstrated that kinetic parameters of tumor growth determine therapeutic response, suggesting that tumor growth kinetics may predict clinical outcomes of cancer treatment. The upcoming results may be used to identify potential prognostic markers, which accelerate identification of therapeutic targets of metastatic diseases. We will further test if these results can be good predictor of the survival rate and lifespan of the affected individuals.

A061 - Applying next generation sequencing to detect mutations in the thyroid cancer

Short Abstract: Papillary and follicular thyroid cancers are the most common carcinomas of the thyroid gland. Many cases of these tumors carry one of the known genetic alterations. The most often are point mutations of BRAF and RAS genes and rearrangements of RET or PPARG. There is also a significant amount of cases in which none of the known alterations is present. Applying next generation sequencing to the thyroid cancer samples, gives us an opportunity to detect other unknown mutations that can play a role in the development of these tumors.
Up to now, we have collected two follicular and two papillary thyroid cancer samples. We sequenced their transcriptomes by single-end method using Illumina Genome Analyser IIx. After the alignment to the hg19 reference genome, we detected genetic variants by samtools/bcftools and GATK. As we were looking for mutations, we selected the variants that were absent in dbSNP and were potentially deleterious according to the SIFT value.
Mutations were detected in all the samples but only one mutation was present in more than one sample. We chose four most interesting mutations and applied a validation on the same samples, using Sanger sequencing. We successfully confirmed all of them. The validation on larger independent group of various cancers will be performed, to answer the question whether these mutations are specific for thyroid cancers.

A062 - A Decision Support Model Based on Integration of Genomic and Clinical Findings for the Differential Diagnosis of Alzheimer’s Disease

Short Abstract: SNPs are DNA sequence variations that are distributed throughout the whole genome. Many SNPs are associated with susceptibility to complex diseases such as diabetes, heart diseases, joint illnesses, schizophrenia, or Alzheimer's disease (AD). Most chronic diseases are multifactorial, and might be explained by combined effects of SNPs on different genomic locations. There is a need to determine the disease associated SNP subsets and analysis of the genotyping data with clinical findings to provide practical and affordable methodologies for the prediction of diseases in clinical settings. So far, there is no established approach for selecting the representative SNP subset and patients’ clinical data. Data mining methodology that is based on finding hidden and key patterns over huge databases have the highest potential for extracting the knowledge from genomic datasets and to select the number of SNPs and most effective clinical features for diseases that are informative and relevant for clinical diagnosis. In this study we have applied one of the widely used data mining classification methodology: “decision tree” for associating the SNP Biomarkers and clinical data provided from the dbGAP’s late-onset AD GENADA Study. Different tree construction parameters have been compared for the optimization, and the most efficient and accurate tree for predicting the AD is presented. Besides the development of the integrated Alzheimer’s model, molecular etiology of AD based on our GWAS findings and strategies to further improve the prediction performance of the disease model will be discussed.

A063 - Investigation of Schizophrenia Related Genes And Pathways Through Genome Wide Association Studies

Short Abstract: Schizophrenia is a complex mental disorder that is commonly characterized as deterioration of intellectual process and emotional responses and affects 1% of any given population. Here we have investigated schizophrenia related genes and pathways through genome wide association studies(GWAS). The NonGAIN genotyping dataset for 1,385 control and 1,576 cases are obtained from dbGAP of NCBI. We have used several well-known public bioinformatics tools, like PLINK, as well as recently published METU-SNP software, which enable users to filter and reduce the dataset into a statistically significant and a biologically relevant subset. Out of 909,622 SNPs analysis of the dbGAP’s NonGAIN data identified 25,555 SNPs with a p-value lower than 5x10-5. Next, combined p-value approach to identify associated genes and pathways and AHP based prioritization to select biologically relevant SNPs with high statistical association are used through METU-SNP software. 6,000 SNPs had an AHP score above 0.4, which mapped to 2,500 genes, which are suggested to be associated with schizophrenia and related conditions. In order to identify novel genes and/or pathways related with schizophrenia DAVID and Reactome, and GeneMANIA tools were used for pathway and network discovery respectively. In addition to previously described neurological pathways, pathway and network analysis showed enrichment of two pathways; melanogenesis and vascular smooth muscle contraction pathways. The overall analysis of all highly related biological pathways revealed organization of all associated genes into one biological network, which might reveal further information on molecular etiology of schizophrenia and other psychotic disorders.

A064 - Classification of disease subtypes based on genome-wide microRNA expression signatures

Short Abstract: MicroRNAs (miRNAs) are non-coding RNAs of 20-23 nucleotide length involved in regulation of multiple cellular processes through repression of target genes. Recent studies demonstrate that miRNAs are involved in key processes in tumorigenesis such as inflammation, cell cycle regulation, differentiation, invasion and apoptosis, and differentially expressed in different disease states. In adition, miRNAS are stable and tissue/state specific, which make them potential biomarkers. However, metodologies for high-throughput miRNA quantification and analysis are still developing. The main methods currently avalilable for genome-wide miRNA profilling are based on qRT-PCR arrays or sequencing [1]. In this poster we use geNetClassifier, a new R/Bioconductor package (www.bioconductor.com), to classify subtypes of cancer based on miRNA expression signatures and to find potential miRNA biomarkers for each of the pathological subtype. In adition, geNetClassifier also provides a network of miRNAs associated to each disease.
This method has been applied on three cancer datasets, each of them based on a different technology: Applied Biosystems TaqMan MicroRNA Arrays, Affymetrix GeneChip miRNA Arrays and Illumina miRNASeq. The results reveal the value of miRNA signatures to identify cancer subtypes.

[1] De Planell-Saguer M and Rodicio MC. Detection Methods for microRNAs in Clinic Practice. Clinical Biochemistry (2013) doi: 10.1016/j.clinbiochem.2013.02.017

A065 - Mining Functional Dependencies Between Genes in Triple Negative Breast Cancer

Short Abstract: Histologically similar cancers, such as breast cancers, have large differences at the molecular level which influence prognosis and treatment. Currently, clinical assessment of breast cancer is done using three cell membrane receptors (estrogen receptor, progesterone receptor and HER2). Tumors where none of these receptors is active are called as triple-negative breast cancer (TNBC) and this subtype is the most aggressive and insensitive to current treatments. To identify key genes that drive TNBC we have developed a novel framework that allows integration of existing knowledge of signaling pathways with expression data. This enables analysis of TNBC individual samples at the network level. In particular, our approach allows to identify cancer-specific altered signaling pathways and to provide putative targets for treating TNBC. Here we used 55 TNBC primary tumors and generated 55 individualized gene regulation networks. Our results show that there are large differences between the numbers of differentially expressed genes in the individual gene regulatory networks. Furthermore, TNBC has the largest number of differentially expressed genes in all subtypes. By mining common regulations in the individual networks we identified 56 differentially expressed genes that can distinguish TNBC from other breast cancer subtypes. We not only found consistently differentially expressed genes in TNBC patients, but also identified subtypes inside TNBC.

A066 - GRANATUM-LiSIs: Making complex in silico predictive models accessible to wet biologists

Short Abstract: Chemoprevention research aims to finding drugs/natural substances to prevent the occurrence of a particular disease and elucidating their mechanism of action. The discovery of novel chemopreventive agents is severely hampered by the lack of high throughput assays to screen quickly and reliably promising chemical compounds.

We present LiSIs, a platform developed in the context of the cross-disciplinary GRANATUM project (http://www.granatum.org) aiming to bridge the gap between biomedical researchers by ensuring their seamless access to the globally available information needed to perform complex experiments and to conduct studies on large-scale datasets.

LiSIs aims to provide cancer chemoprevention experts with a set of online tools to create, update, store and share virtual screening Scientific Workflows (SWs) for the discovery of new chemopreventive agents. LiSIs is based on the Galaxy platform, and is freely available via a web interface through a password protected, tiered login process, providing different level access to platform functionalities based on the user profile. “Regular” users are able to assemble SWs utilizing available in silico models and tools. “Power” users may build new models and tools through the development of custom SWs. Workflows execute on the system server and results are stored on the user’s GRANATUM workspace, enabling accessing, manipulating or sharing SWs, datasets and results with other users. The current version of LiSIs provides, among others, trained models for predicting cytotoxicity, estrogen receptor binding activity with accuracies in par with recent literature.

This work has been partially supported through the EU-FP7 GRANATUM project, contract number ICT-2009.5.3.

A067 - Prioritization of obesity candidate genes using reverse GWAS

Short Abstract: This poster is based on Proceedings Submission xxx.
With development of molecular-genetic techniques that led to big datasets, more than 2000 loci (quantitative trait loci (QTL), protein coding and microRNA genes) were associated with obesity. The current version of the Obesity genomic atlas includes 1515 protein coding genes and 221 microRNA genes. Prioritization of these loci is necessity to develop diagnostic molecular markers with effect on obesity. For candidate gene prioritization reverse genome wide association study (reverse GWAS) was performed. The data for three phenotypic groups was downloaded from Mouse Phenome Database (MPD): body composition (145 traits), body weight, size and growth (73 traits) and body fat pads (26 traits). Four inbred mouse strains with extreme values were selected for association studies. The analyses were performed between 70 calculated allele combinations and 221 obesity related traits. Genotypes for 1086 non-synonymous SNPs within obesity candidate genes were extracted from the MPD database and allocated with results from association studies. The candidate genes were prioritized on the basis of marker effects estimation. The analyses resulted in a list of polymorphisms significantly associated with obesity related traits. These polymorphisms were visualized on a genomic view and overlapped with all known obesity related QTL, protein coding and non-coding (ncRNA) genes. Using this approach, genomic regions with highest marker effect on obesity traits were determined. Results of this study will contribute to the development of biomarkers for diagnostics and therapeutic targets of obesity in human as well as body composition control in other mammals.

A068 - Enabling docking-based virtual screening within the GRANATUM-LiSIs platform

Short Abstract: Protein-ligand docking methods complement property-based predictive models in virtual screening large compound databases for drug discovery and, recently, cancer chemoprevention research. However, the wide adoption of these methods by wet biologists is often hampered by the lack of user-friendly docking tools, seamlessly integrated to other components of virtual screening pipelines, not to mention the need to transform relevant data between different types of formats.

Within the freely available LiSIs online scientific workflow environment (a component of the GRANATUM project; http://www.granatum.org), we have integrated user-friendly tools enabling expert and novice users to incorporate protein-ligand docking as a component in virtual screening experiments. Users can benefit from the LiSIs toolbox, which has been developed to assist researchers involved in the Cancer Chemoprevention sector. More specifically, generic tools are made available to (i) create, share, update and store Workflows and data, (ii) pre-process/transform data regarding ligands (e.g. generating 3D coordinates, transform between chemical formats), (iii) pre-process the protein target structures (e.g. remove water atoms, define the docking site location/size) and preparing the necessary files for docking, and (iv) performing docking of a collection of small molecules against a pre-processed protein target. Provision of auxiliary functionalities is based on established open-source software packages (e.g. RDkit, OpenBabel), whereas docking is powered by AutoDock Vina.

The LiSIs environment has been used to perform docking-based virtual screening experiments on several publicly available compound libraries against target structures, which are now validated experimentally.

This work has been partially supported through the EU-FP7 GRANATUM project, contract number ICT-2009.5.3.

A069 - Genomic Profiling of Melanoma Resistance to Combination Chemotherapy

Short Abstract: Based on a combinatorial drug screen, we chose a number of two-drug
combinations to explore mechanisms of synergistic cytotoxicity in BRAF-mutant,
NRAS-mutant or wildtype melanoma cell lines. To enumerate the differing contexts
wherein response occur, all lines experience: whole-exome resequencing to identify
nonsynonymous, function-altering SNV/indels; high-density SNP genotyping to
infer ancestry and copy number aberrations; CpG methylation profiling to measure
epigenetically-silenced loci; and transcriptional profiling of basal gene expression.
To identify functional response to combinatorial treatment, a subset of cell lines
were exposed to high-throughput proteomic and transcriptomic assays. While each
genomic dataset reveals molecular similarities and clusterings, no correlation is
seen across modalities; e.g. CpG methylation patterns do not correlate with basal
gene expression patterns. Similarly, BRAF/NRAS-mutant status does not correlate
with transcriptional or methylation states. With Sorafenib/Diclofenac responses,
associated with induction of apoptosis and cell death, which we call emergence.
In contrast, we observe a pan-tumor increase of PLX4720 (a vemurafenib analog)
response in combination with lapatinib, which we call potentiation. Increase
responsiveness to PLX4720 is not universal, however and does not correlate with
driver mutation status, HER2 receptor expression, or any of the various genomic
clusterings; the subset of genes whose differences in basal expression most strongly
correlates with combination responses are strongly enriched for genes involved
in metabolism, though this does not provide a mechanistic understanding of
potentiation. We are now undertaking a systems-based network approach to better
integrate and interpret tumor-specific responses in the context of mutational,
epigenetic, and transcriptional landscapes specific to each cell line.

A070 - Accurate Genome-Wide Survival Analysis of Somatic Mutations in Cancer

Short Abstract: Deriving clinical utility from genomic datasets requires the identification of statistically significant associations between genomic measurements and a clinical phenotype. An important instance of this problem is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. The most widely used statistical test for comparing the survival time of two (or more) classes of samples is the nonparametric log-rank test. Nearly all implementations of the log-rank test rely on the asymptotic normality of the test statistic. However, this approximation gives poor results in many high-throughput genomics applications where: (i) the populations are unbalanced; i.e. the population containing a given variant is significantly smaller than the population without that variant; (ii) one tests many possible variants and is interested in those variants with very small p-values that remain significant after multi-hypothesis correction.

Our contributions are: (1) We show empirically that the inaccuracy of the asymptotic approximation for the log-rank test results in a large number of false positives in cancer genomics applications due to unbalanced populations. (2) We develop and analyze a novel algorithm for computing the p-value of the log-rank statistic for any range of population sizes under the exact distribution. (3) We demonstrate the practicality and accuracy of our approach by testing the algorithm on somatic mutation data from The Cancer Genome Atlas (TCGA). In particular, a number of mutations statistically significant to survival, many of which are supported by the literature (e.g., BRCA2 in ovarian serous adenocarcinoma), are identified only using our algorithm.

A071 - Exploring the influence of diet and lifestyle in body weight changes from large questionnaire data sets

Short Abstract: Various environmental exposures, such as diet and physical activity may be involved in the complex mechanisms of the alteration in the quantitative levels of body weight, BMI and body fat distribution. The key objective of this study is to understand the relation between diet and subsequent body weight changes.

Data on dietary composition, lifestyle habits and anthropometric measurements were obtained from 57,000 Danish women and men aged 50-64 years. We applied the newly established method ´Compass´ to identify significant associations within a large dataset of obese subjects and environmental factors in diet and lifestyle (hundreds of variables). This approach is a hybrid of two existing methods: Self-Organizing Maps (SOM) and Association Mining (AM). We utilize SOM as the initial step to reduce the search space, and then apply AM in order to find association rules. This procedure allows for recognizing local patterns in sub-populations and is able to generate variable groups which act as “hotspots” for statistically significant associations.

From the questionnaires, we successfully generated a number of interesting association rules, which relate particular dietary composition, e.g., a combination of quantitative levels of vegetable and meat intake, with specific aspects of obesity such as BMI, weight gain/loss or changes in waist circumference. The strength of the association rules can be accessed globally by confidence scores and significance calculations. An important aspect of the method is that the methods scales well and can handle large cohorts.

A072 - An Efficient Method for Predicting Dementia Drug Targets Using Multi-Relational Data Mining

Short Abstract: Dementia is a neurodegenerative condition of the brain in which there is a progressive and permanent loss of cognitive and mental performance. Current medical treatments for dementia are purely symptomatic and hardly effective, despite the advances in the molecular characterization of the disease. One of most challenges ahead is to predict in silico drug and drug targets to aid dementia treatment. We have developed a novel computational method to predict protein targets of drugs for dementias. We investigated molecular targets of drugs in different phases of the drug discovery process (from preclinical to marketed drugs) to provide comprehensive information on the targets of interest. Heterogeneous data was extracted from multiple databases, i.e., protein-protein interactions from the i2d database, pathways from the Reactome database, GO terms from the Gene Ontology databases, and protein domains from the Interpro database. We applied multi-relational data mining, particularly multi-relational association rules algorithm, to combine data and build a predictive model. Because of lacking a standard dataset of negatives for drug targets, different experiments were done with various numbers of random negative examples to obtain reliable results. Through comparative evaluation, our method achieved a better performance (with accuracy = 93%) than other methods, the decision tree method (accuracy = 89%), the naïve Bayes method (accuracy = 83%), and the neural network method (accuracy = 88%). Analyzing interpretable rules, we found several interesting rules (with high confidence) related to neuronal system, postsynaptic membrane, G protein-coupled receptor domains, etc., that are promising to study further in pharmacology.

A073 - Accurate multi-category classification of breast cancer subtypes using gene expression barcodes and machine learning

Short Abstract: Cancer is a clinically and molecularly heterogeneous disease and accurate classification of subtypes therefore improves treatment decisions leading to improved outcomes and 5-year survival rates. While quantitative gene expression micro-array analysis has been shown to identify molecular signatures that accurately classify tumour subtypes, reproducibility between studies is very often limited, constraining its utility in clinical settings. We explored the Gene Expression Barcode method, which converts continuous expression levels into binary calls signifying genes as silenced or expressed, as a way to enable integration data from multiple experiments and across chip platforms for the purposes of machine-learning based classification of tumour samples. The generated binary calls enabled us to develop a simple yet biologically-relevant feature selection/minimization method that simultaneously addressed the 'curse of dimensionality' and the sparsity of training samples, which are significant problems when ussing micro-array data in machine-learning applications. When testing the selected features with a K-means clustering classifier, we were able to correctly segregate mixtures of normal and malignant tumour tissues from various cellular origins at >95% accuracy. We implemented a multi-class feature selection variation and tested it on samples from normal and several subtypes of malignant tumours. This feature-selection filter yielded an expression barcode of 346 probes, which enables clear separation of malignant breast tumour subtypes and unseen samples from entirely different origin than the training set, were classified with 90% accuracy using simple K-means clustering. We further detail the performance of a support vector machine classifier trained on barcoded expression features selected with our methods.

A074 - A web server for the functional characterization of drugs from gene expression following treatment

Short Abstract: Many drugs exert their therapeutic activities through the modulation of multiple targets. Moreover, this polypharmacology is often associated with both beneficial and adverse off-target effects. For most drugs these targets are largely unknown and identification among the thousands of gene products remains difficult. Yet a better knowledge about such drug-protein interactions, along with the molecular pathways involved and the associated diseases, could be of substantial value to drug development, in particular to predict side effects and explore potential drug repositioning.

DNA microarray technology enables us to observe the effect of drug treatment on the activity of all genes simultaneously and thus forms the perfect starting point for drug mode of action prediction. Hence we have developed an easy-to-use analysis suite for functional characterization of drugs based on gene expression changes following treatment. Our software provides all necessary tools for gaining new insights into the biological effects of a drug by integrating (1) preprocessing of gene expression data obtained from different Affymetrix array types; (2) quality assessment and exploratory analysis of these data; (3) genome-wide drug target prioritization; (4) prediction of pathways involved in the drug’s mode of effect; (5) identification of associated diseases enabling side effect prediction and drug repurposing; and (6) result visualization and reporting. Drug target prioritization is performed by means of an in-house developed algorithm for network neighborhood analysis, integrating the expression data with functional protein association information. All of the above functionalities are demonstrated on gene expression data for treatment with well-characterized drugs.

A075 - Deep Phenotyping of Multitube Flow Cytometry Data Reveals New Cell Types Associated with NPM1 Mutation in AML

Short Abstract: Acute myeloid leukaemia (AML) is a blood cancer with poor prognosis (67% five year survival). Several recurrent genetic lesions have implications for prognosis, including a 4bp insertion in NPM1, which has an associated CD34- immunophenotype. Deep and comperehensive phenotyping of AML has yet to be performed, and could help to elucidate further immunophenotypic consequences of NPM1 and other genetic lesions.

We analyzed retrospective clinical diagnostic flow cytometry data of bone marrow from 129 AML patients, which was multiplexed across several four-marker tubes. Using flowBin, a novel methodology we developed, we combined these tubes into 17-marker high-dimensional data. We then used flowType to enumerate all 616,285 cellular immunophenotypes made up of combinations of one to six markers and present in at least one patient. Finally, we applied the Wilcoxon rank sum test with Holm correction to interrogate each cell type for differences in abundance with NPM1 status.

We found 801 immunophenotypes significantly correlated with NPM1 mutation status. We used Cytoscape to visualise the immunophenotype correlation network, finding that the major association with NPM1 status was CD34+/-.
Using the RchyOptimyx visualization tool we identified four classes of interest. The first, CD34-CD13+CD33+, corresponds to known NPM1-associated expression patterns. The second, CD13+CD34−CD2−CD4+ has been reported in Acute Promyelocytic Leukemia, but not previously associated with NPM1. The remaining two, HLA+CD34+CD64− and CD34+CD61−CD14−, are also novel and extremely specific to NPM1-wt.

We have developed and validated a general methodology for comprehensively interrogating multi-tube flow cytometry databy finding novel NPM1-associated AML immunophenotypes.

A076 - Probabilistic inference of subclonal copy number and LOH in whole genome sequencing of tumours

Short Abstract: Background: Tumours are often composed of heterogeneous mixtures of cell populations having undergone clonal evolution and expansion. This intrinsic clonal diversity is often implicated in treatment resistance and metastasis. Copy number alterations (CNA) and loss of heterozygosity (LOH), which make up the structural landscape of somatic aberrations, can be measured through quantification of read abundance in whole genome sequencing (WGS). However, inference of CNA/LOH remains a challenge due to statistical under-sampling of alleles and sources of noise such as GC-content bias and normal cell contamination. In this contribution, we address these challenges in order to explore the estimation of clonal abundance of CNA/LOH events.

Methods: We present solutions to identify subclonal CNA/LOH events by deconvolving signals from mixed cell populations in WGS of individual tumour biopsies. Using hierarchical probabilistic modeling, our approach jointly analyzes allelic fractions at all germline SNP loci and read counts from tumour and normal libraries. We propose an HMM that simultaneously segments and predicts subclonal CNA/LOH, and reports estimates of cellular frequency, normal contamination and ploidy, thus providing a more complete characterization of the tumour.

Results: We present simulation results using real data from multiple, spatially separated biopsies from a high-grade serous ovarian carcinoma. For benchmarking, we generated in-silico mixtures of these samples at known proportions. We demonstrate that our method performs with increased sensitivity compared to existing tools that do not account for subclonal heterogeneity. Our work provides an important advance in identifying and quantifying fractional events in heterogeneous tumours.

A077 - Inferring The Phylogeny Of Clonal Populations In Cancer

Short Abstract: Human cancers evolve under the principles of Darwinian selection at the level of clonal populations of cells. Over time, tumour cells acquire mutations which can confer phenotypic advantages and act as substrates for Darwinian selection. As a result, when tumours are diagnosed, they are often composed of mixtures of heterogeneous clonal cell populations related by a phylogenetic tree. Accurate identification of clonal populations informs major clinical end-points such as treatment resistance and metastatic potential. Though heterogeneity and clonal evolution have been accepted features of cancer for decades, only recently have technological advances in high throughput DNA sequencing allowed for accurate quantification of these phenomena through identification and interpretation of somatic mutations.

In this work we develop a novel statistical model and sequential Monte Carlo (SMC) sampling scheme which allows us to infer the phylogenetic tree relating the clonal populations and infer the abundance of these clones using digital allelic abundance of point mutations subjected to deep sequencing. We show how this model can be applied to multiple samples taken from the same patient which allows it to share statistical strength across the samples. In addition the model handles the confounding effects of normal cell contamination, copy number aberrations and zygosity. We show using synthetic data and controlled mixtures of healthy tissue that the model can recover the correct phylogeny, even in the presence of complex branching patterns and multifurication. Finally, we apply the model to a dataset consisting of multiple spatially separated samples taken from individual high grade serous ovarian cancers.

A078 - Reactome Knowledgebase: Annotating Cancer Variants and Anti-Cancer Therapeutics

Short Abstract: Reactome is an open-source, free access, manually curated and peer-reviewed biological pathway knowledgebase. Information is authored by expert biological researchers, maintained by the Reactome editorial staff and cross-referenced to publicly available web-based informatics resources. The Reactome data model describes life processes ranging from metabolism to cell signaling, in a single internally consistent, computationally navigable format. Recently, Reactome has expanded its content and data model to provide users with cancer-specific information. While COSMIC database catalogues cancer variants of human proteins along with their frequency and distribution across different cancer types, Reactome aims to capture molecular details of functional differences between cancer variants and their normal counterparts. Biological differences between different cancer mutants have important implications for the design and application of targeted therapy, as already demonstrated in clinical practice by different sensitivity of EGFR mutants to gefitinib and erlotinib. Reactome is able to capture and display mechanistic differences between protein variants in detail, including their interaction with drugs. Reactome has so far published Signaling by EGFR, FGFR, Notch1 and PI3K/AKT pathways in Cancer. Reactome annotations include information on the molecular function of over 200 cancer variants, cross-referencing COSMIC database whenever possible, as well as the information on approximately 50 anti-cancer drugs. Reactome includes tools for pathway enrichment analysis and large-scale data querying. Pathway data can be exported in several formats including SBML, BioPAX and derived interactions. Reactome content, the database and software interface are freely available under an open-source licence. See www.reactome.org for more information.

A079 - Applications of LitMS: A Relevance Scoring System for Biomedical Relationships

Short Abstract: Medline records report relationships among genes, diseases, tissues and other entities, with new connections published daily. The number of papers that mention a relationship typically correlates with the likelihood that two entities are related, but for new findings, using counts of documents with co-occurring terms as a validity indicator is ineffective. To efficiently distinguish documents with valid relationships from documents with mere co-mentions, we describe elsewhere development and validation of LitMS (Literature Mining System), a Lucene-based system that scores individual Medline records based on their relevance to one or two topics typically queried by biomedical researchers. LitMS uses dictionary look-up in combination with contextual and positional features in the text to assign scores of low, medium and high relevancy. We describe here how in addition to identifying novel gene-disease relationships, other applications have emerged. For example, high-confidence literature groupings of genes by disease or tissue are automatically generated and used for gene set enrichment analysis. LitMS also reveals anomalies between document frequency and maximum relevancy score. We observe that 20-40 documents that relate a pair of entities with only low relevancy scores are indicative of meaningful, indirect relationships. Furthermore, depending on the entity types, there are consistent themes in how the entities are related. For instance, frequently occurring, low relevancy protein-tissue relationships describe how immunohistochemical protein staining and tissue pathology together lead to disease diagnosis. Relevancy scoring of each relationship in a document adds a new approach for deciphering biological meaning from extracted entities.

A080 - Integrated genomics for lethal prostate cancer

Short Abstract: Prostate cancer is the most diagnosed internal malignancy in the western world. While the majority of prostate cancers are non-lethal, there is currently no reliable approach to distinguish lethal from non-lethal prostate cancer at an early, curable stage. Over the last 7 years, researchers at Epworth Medical Centre and Royal Melbourne Hospital have compiled a bio-bank of over 1500 tumour specimens to better understand the underlying molecular mechanisms governing lethality in prostate cancer. Along with excellent clinical information, the bank contains many individuals that have matched whole-blood, primary tumour and distant metastases. This provides a unique resource for molecular profiling to help understand lethal prostate cancer. Dr Chris Hovens and his group at Royal Melbourne Hospital have carried out molecular profiling of a number of samples using RNA-SEQ, Illumina 2.5M SNP Chip, whole-genome sequencing and Illumina 450K methylation chip. In collaboration with NICTA, VLSCI, and the Welcome Trust Sanger Institute, bioinformatics analysis of this data is currently underway. Samples from two individuals have been analysed so far: whole-blood, primary tumour and castrate resistant metastasis in patient 1 and whole-blood, primary tumour, local recurrence, two hormone naive metastases, and a castrate resistant metastasis in Patient 2. Single nucleotide variants, small insertion/deletions, copy-number variations, structural rearrangements, differentially expressed genes and differentially methylated regions are being used to uncover the molecular mechanisms driving progression to metastasis and treatment resistance. This study highlights the benefits of an integrated genomics approach for tracking the progression of a tumour in a given individual.

A081 - Inference of subclonal genomic rearrangements in sequenced tumours using deStruct

Short Abstract: Genomic heterogeneity, a hallmark of many cancers, provides a window into a tumor’s evolutionary history and alludes to the tumor’s potential to survive cancer therapy. Genomic rearrangements are key mutational events in a tumor’s evolution, driving proliferation by creating fusion genes and disabling tumor suppressors.

High-throughput genome sequencing now facilitates identification of clonally dominant and subclonal rearrangements and inference of each rearrangement’s clonal abundance. Quantifying clonal abundance is a proxy for characterizing clonal population structure and inferring evolutionary histories of tumor cell populations in the context of the clinical trajectories. Identification of subclonal events may become a necessary aspect of personalized cancer therapies, allowing for treatments that eliminate therapy resistant subclones.

We propose a computational method, deStruct, for identification and quantification of subclonal rearrangements from genome sequencing data. We leverage our previous work on low coverage genomes (comrad), to detect subclonal rearrangement breakpoints with highest possible sensitivity. deStruct uses a maximum likelihood based combinatorial algorithm to calculate the clonal abundance of each breakpoint and identify false positive breakpoint predictions. The algorithm produces sets of breakpoints clustered according to the likelihood they coexist in the same subclone.

We demonstrate the accuracy of deStruct using simulated genomes with known clonal population structure. We applied deStruct to a prostate tumor and validated several dominant and subclonal predictions. Finally, we applied deStruct to a primary/recurrence dataset from a high grade serous ovarian cancer patient to characterize the dynamics of clonal structure in the presence of therapy.

A082 - Alternative Splicing in Triple Negative Breast Cancers Suggests Differences in Precursor Differentiation State

Short Abstract: Alternative splicing (AS) compounds the complexity encoded in the genome by producing different gene products from the same genomic locus in a tissue- and developmental-stage specific manner. Whole transcriptome analysis provides a detailed look at gene expression and AS. RNA-sequencing of tumor samples may help identify cancer-specific modifications of AS profiles. Specific events could have an effect on a gene's function and hence serve as useful biomarkers in the study, diagnosis and treatment of the disease.

Messenger RNA from 90 triple negative breast cancer tumors was sequenced on Illumina machines and processed through an alignment-based pipeline (GSNAP, DEXSeq, MISO) and a reference-free pipeline (Trans-Abyss). Samples were classified according to expression-based intrinsic subtypes (basal and non-basal). Coverage across genes was used to infer differential exon usage. Alternative expression analysis between the two subtypes resulted in 337 exons in 114 genes to be alternatively expressed (qval < 1e-10). Myoepithelial and luminal epithelial cells from healthy breast tissue were used as normal controls and were found to share gene isoform preferences with the basal and non-basal groups respectively. Results include INPP4B, expressing a shorter isoform in myoepithelial and basal groups; and MYO6, known to be expressed in an isoform-specific way in polarized cells.

We found cell-type-specific AS events linking two breast cell types with tumor subtypes, which could be used to help determine the cellular background of a tumor sample. Further mining of our data is underway to determine other significant changes that may play a role in the disease.

A083 - Application of semantic technology in identifying functional variants in the exome of a paediatric Multiple sclerosis case study

Short Abstract: Multiple sclerosis (MS) is a prevalent multifactorial inflammatory disease, causing demyelination of axons in the brain and the Central Nervous System. Pediatric MS is a very rare form of the disease presenting unique diagnostic, treatment and management challenges. Genetic factors are presumed to be the dominant cause, with environmental contributions considered to be miniscule or irrelevant.

We have performed whole exome sequence analysis in an atypical pediatric MS case also presenting with non-anemic iron deficiency which, when rectified through iron supplementation, halted demyelination. We identified 502 ‘high impact’ variants, including 266 frameshifts, and 6987 predicted functional missense variants. As MS is known to be a multi-genic disease, we developed a variant prioritization strategy that relies on a semantic model of the disease implemented in our BioOntological Relationship Graph (BORG) database. The model links the ‘multiple sclerosis’ term in the Disease Ontology to terms relevant to the disease in other ontologies, e.g. ‘demyelination’ or ‘abnormal myelination’ in the Phenotype Ontology and functions relating to ‘myelination’ and ‘inflammation’ in the Gene Ontology. As phenotypes known to be associated with both human genes and those arising in mouse and rat gene knockout experiments are modeled in the BORG, we used a concept of ‘guilt-by-indirect-association’ to implicate candidates whose roles are unobvious yet biologically plausible.

We present a combination of high-impact variants, many which are novel, in genes involved in myelination and the immune response as putative contributors to pediatric MS and also present variants that may explain the iron deficiency phenotype.

A084 - Positive and Unlabeled Learning for Prioritization (PULP)

Short Abstract: Identifying disease-associated variations in exome sequencing studies often involves strategic filtering and selection from numerous identified variations of unknown significance. Common filtering strategies are often too weak to sufficiently narrow a candidate list, or make broad assumptions that oversimplify the underlying epidemiology. We demonstrate here a novel algorithm for the prioritization of variants identified in exome sequencing studies. Our algorithm—Positive and Unlabeled Learning for Prioritization (PULP)—builds a non-traditional supervised machine learning model for ranking gene sets based on expert-selected features from a broad range of genomic-scale datasets. We demonstrate an application of this algorithm in the identification of disease-associated variants in a genetically heterogeneous retinal degenerative disorder, Retinitis Pigmentosa (RP). Two classes are used, ‘positive’ and ‘unlabeled’. The ‘positive’ class consists of all genes known to harbor causative RP mutations. The ‘unlabeled’ class contains all other RefSeq genes as the disease-association states of these genes are unknown. Each gene is assigned continuous features representing RNA expression level in multiple tissues of the body and eye, CRX-binding ChIP-seq data (CRX is a photoreceptor-specific transcription factor) in the retina, and features representing gene characteristics such as length and exon count. We train several logistic regression models in an approach analogous to a leave-one-out analysis to construct an unbiased ranking of disease-association probabilities. Ranking variants by this method outperforms ENDEAVOUR, a popular non-disease-specific technique for disease gene prioritization. Future efforts will continue to investigate the use of PULP in RP and other disease models.

A085 - LRRfinder2.0 tBrowse: visualization of Toll-like receptor sequence, structure and variation

Short Abstract: Toll-like receptors (TLRs) are vital components of the innate immune system. Recent advances in high-throughput technologies have rapidly increased the availability of TLR sequences, resulting in the identification of variations associated with susceptibility and resistance to infectious and autoimmune diseases. However, less than 1% of TLR sequences have resolved structures. TLRs, like many proteins with immune-related functions, contain leucine-rich repeats (LRRs). The publically available tool LRRfinder2.0 (www.lrrfinder.com) identifies these motifs in any LRR-containing protein and can be utilised to improve alignments used for comparative modelling. Combining sequence and structural information can lead to a better understanding of species-specific variation in innate immune responses. Our TLR database (tLRRdb) stores over 3,500 sequences containing more than 60,000 LRRs. From this, we have supplemented human, murine, bovine, porcine and ovine sequences with annotations including: sequence conservation, candidate sites of positive selection, structural features and post-translational modifications. To make this data user-friendly, we have developed tBrowse (www.lrrfinder.com/tbrowse), a graphical interface for TLR-specific features. Annotations are often spread across many different websites. By combining sequence, structure and variation data into a single browser it is possible to reduce the time spent swapping between databases and resources. tBrowse is a graphical interface which presents all of these annotations in a single place, in an accessible format. Thus, it will enable us to better understand the structural and functional implications of TLR variations, providing insight into species-specific immune responses.

A086 - Whole-genome sequencing of monozygotic twins discordant for Crohn’s Disease

Short Abstract: It has been shown that monozygotic twins can differ in their copy number variation profiles, representing a special case of somatic mosaicism. We hypothesized that genetic differences between twins discordant for Crohn’s Disease (CD) may play a role in disease etiology and performed thorough genetic analyses of three monozygotic discordant twin pairs.
For one twin pair we carried out whole genome and exome sequencing of blood samples as well as biopsies from the bowel. Additionally, blood samples of two further monozygotic discordant twin pairs were subject to exome sequencing. On these data we performed pairwise comparative analyses to detect differences in single nucleotide variants (SNVs) and copy number variation (CNVs) between twins. The availability of DNA from blood samples as well as biopsies from the affected tissue type for one of the twin pairs allows for a particularly close look at the distribution of somatic mosaicisms present between monozygotic twins.
Genetic characterization of genomes and exomes revealed several novel, possibly functional variants in known CD regions that may contribute to CD susceptibility of the twins but are shared by affected as well as healthy individuals. Comparative pairwise analyses between twin samples yielded several hundred potentially differing SNVs and CNVs. Manual inspection of alignments as well as validations by Sanger were not able to confirm any genetic differences between samples so far. Yet, our data provide an exceptionally thorough genetic characterization of the examined twin pairs and are the first example of whole genome sequencing applied to monozygotic twins discordant for CD.

A087 - Ultrafast approximation for phylogenetic bootstrap

Short Abstract: Nonparametric bootstrap has been a widely used tool in phylogenetic analysis to assess the clade support of phylogenetic trees. However, with the rapidly growing amount of data, this task remains a computational bottleneck. Recently, approximation methods such as the RAxML rapid bootstrap (RBS) and the Shimodaira–Hasegawa-like approximate likelihood ratio test have been introduced to speed up the bootstrap. Here, we suggest an ultrafast bootstrap approximation approach (UFBoot) to compute the support of phylogenetic groups in maximum likelihood (ML) based trees. To achieve this, we combine the resampling estimated log-likelihoodmethod with a simple but effective collection scheme of candidate trees. We also propose a stopping rule that assesses the convergence of branch support values to automatically determine when to stop collecting candidate trees. UFBoot achieves a median speed up of 3.1 (range: 0.66–33.3) to 10.2 (range: 1.32–41.4) compared with RAxML RBS for real DNA and amino acid alignments, respectively. Moreover, our extensive simulations show that UFBoot is robust against moderate model violations and the support values obtained appear to be relatively unbiased compared with the conservative standard bootstrap. This provides a more direct interpretation of the bootstrap support. We offer an efficient and easy-to-use software (available at http://www.cibiv.at/software/iqtree) to perform the UFBoot analysis with ML tree inference.

A088 - Drug repositioning with large-scale electronic medical record of patients in a network frame

Short Abstract: The pharmaceutical industry today faces various hurdles to developing therapeutic agents including raising research cost and high rate of drug attrition during clinical trials. Therfore, the identification of novel disease indications for approved drugs (i.e., drug repositioning) offers several advantages in drug development.

Precise prediction of therapeutic indications, thus, offers the higher possibility of faster development times and broader applications for drug repositioning. Computational prediction of the novel drug indications has largely been based on genomic signatures related to drugs and diseases, and limited in systematic utilization of large physiological data of human patients. Here, we propose a novel network-based method which utilizing extensive clinical physiomic signatures from electronic medical records (EMRs) of patients, in addition to diverse genomic signatures, for drug repositioning.

Our method uses 10 years’ EMR data from a tertiary teaching hospital, containing over 530,000 patients and clinical records of theirs for clinical physiomic signatures and public resources for genomics signatures. Our method represents known drug indications as a bipartite network which consists of drugs and diseases and then identifies novel drug indications using clinical physiomic signatures from the EMR and the genomic data repositories. On cross validation, proposed method outperformed various prediction models, in addition to an existing method. Moreover, the predicted indications showed a significant enrichment with current clinical trials for drug repositioning (p-value 3.08e-07). The evaluation of our method initially suggests that medical records of patients in EMRs can be a valuable source for drug repositioning.

A089 - Effects of smoking and smoking cessation on human serum metabolite profile: results from the KORA cohort study

Short Abstract: Metabolomics is an emerging approach helps to identify links between environmental exposures and intermediate biomarkers of disturbed pathways. We quantified 140 metabolites from 1241 fasting serum samples, which were collected from the population-based cohort, Cooperative Health Research in the Region of Augsburg (KORA), at two time points (baseline survey conducted between 1999 to 2001, S4 and seven years follow-up, F4). Metabolite profiles were compared among groups of current smokers, former smokers and never smokers, and were further assessed for their reversibility after smoking cessation. We identified 21 smoking-related metabolites in the baseline investigation (18 in men and six in women, with three overlaps) enriched in amino acid and lipid pathways, which were significantly different between current smokers and never smokers using multivariate logistic regression analysis. Moreover, 19 out of the 21 metabolites were found to be reversible in former smokers. In the follow-up study, 10 of 13 measured reversible metabolites were confirmed in male quitter in a linear mixed effects model. By Gaussian graphical modelling (GGM) of metabolite reactions, we found effects of smoking on fatty acid desaturation and beta oxidations which increase cardiovascular disease risks. We further constructed protein-metabolite networks to illustrate the consistent reversibility of smoking effects on metabolite profile and gene expressions. In conclusion, our results suggest that the metabolites could be used as biomarkers to measure success of cessation interventions and evaluate disease risk.

A090 - Identification of cancer-specific biomarkers by using microarray gene expression profiling

Short Abstract: Carcinogenesis is a complex biological process that is affected by multiple genes, some of which can be used as biomarkers for specific tumor stages
or types. An effective method for predicting such tumor markers, which are important for both diagnosis and
prevention, is gene expression profiling. Here, we used a classification method and survival tests to predict cancer biomarker genes from individual cancer gene expression profiles. To validate the ability of classification in our samples, an area under the curve was calculated using support vector machine classification
methods for selected genes. Twenty-three of the candidate biomarkers were correlated with patient survival. To confirm classification performance in other samples,we validated our results by comparison with breast and ovarian cancer samples. We conclude that these 23 genes might be used as cancer biomarkers.

A091 - Racing driver gene teams in cancer

Short Abstract: The development of targeted cancer treatment strategies is critically dependent on the knowledge of sets of genes whose mutation drives cancer progression. A great deal of recent research is invested in the search for such driver genes, and in efforts to reconcile those drivers into functional pathways (teams). It is now accepted that functional relations between teams of driver genes are reflected in their mutational patterns. Here, we propose a significance-based approach for identification, and evaluation ("racing") of driver gene teams from mutation data measured in large cohorts of tumors. For a given set of genes that candidates to be a functionally related cancer driver team, we draw two factor graphs. One encodes the probability of observed mutations under the model that the genes are truly related and show a specific pattern, which might be obscured by errors. The other evaluates the probability of the same data under the mutational independence model that assumes that these genes are unrelated. Finally, we apply a likelihood ratio test for ranking and assessment of the candidate gene sets.

A092 - Modeling distinct osteosarcoma subtypes in vivo using Cre:lox and lineage-restricted transgenic shRNA.

Short Abstract: Osteosarcoma is the most common primary cancer of bone. Osteoblastic osteosarcoma represents the major subtype of this tumor, with approximately equal representation of fibroblastic and chondroblastic subtypes. We and others have previously described murine models of osteosarcoma based on osteoblast-restricted Cre:lox deletion of Trp53 (p53) and Rb1 (Rb), resulting in a phenotype most similar to fibroblastic osteosarcoma in humans. We report a model of the most prevalent form of human osteosarcoma, the osteoblastic subtype. In contrast to other osteosarcoma models that have used Cre:lox mediated gene deletion, this model was generated through shRNA-based knockdown of p53. As is the case with the human disease the shRNA tumors most frequently present in the long bones and preferentially disseminate to the lungs; feature less consistently modeled using Cre:lox approaches. We report gene expression analysis of the two models, and relate these to both human cancer types and OS expression data from mouse and dog. In addition, we report genome wide DNA methylation profiling for the Cre:lox model. Our approach allowed direct comparison of the in vivo consequences of targeting the same genetic drivers using two different technologies, Cre:lox and shRNA. This demonstrated that the effects of Cre:lox and shRNA mediated knock-down are qualitatively different, at least in the context of osteosarcoma, and yielded distinct subtypes of osteosarcoma. Through the use of complementary genetic modification strategies we have established a model of the most common clinical subtype of osteosarcoma and more fully recapitulated the clinical spectrum of this cancer.

A093 - Predicting drug side-effect profiles from the integration of chemical and biological spaces

Short Abstract: Drug side-effects, or adverse drug reactions, have become a major public health concern, and remain one of the main causes of drug failure and of drug withdrawal once they have reached the market. Therefore, the identification of potential severe side-effects is a challenging issue. In this study we develop a new method to predict potential side-effect profiles of drug candidate molecules based on their chemical structures and target protein information on a large scale. We propose an extension of kernel regression model for multiple responses to deal with heterogeneous data sets. The original feature of the method is that the prediction is based on the integration of the chemical space of drug chemical structures and the biological space of drug target proteins in a unified framework. In the result, we demonstrate the usefulness of the proposed method on the joint prediction of about one thousand side-effects for small molecule drugs from their chemical substructure and target protein profiles, and show that the prediction accuracy consistently improves owing to the proposed regression model and integration of chemical and biological information. Finally, we conduct a comprehensive side-effect prediction for uncharacterized drug molecules stored in pubic drug databases, and confirm interesting predictions using independent information sources. The proposed method is expected to be useful at many stages of the drug development.

A094 - Phenet: Network Analysis for Gene Prioritization of Exome Sequencing Results in Syndrome Patients

Short Abstract: Exome sequencing has been successfully established as a useful tool for scientific discovery of new disease genes and diagnosis of patients with unknown syndromes. When the exome of a single human individual is sequenced and compared to a reference genome, 20.000 to 50.000 variants are usually identified. After filtering of synonymous and non-coding variants and presumably non-pathogenic variants present in public databases, several hundred variants remain. Therefore, methods are required to narrow down the number of variants that are considered candidates in the search for the disease-causing variants. Because manual review of all variants by a human expert is a very time-consuming task, a variety of computational methods to predict disease-causing variants from the set of all variants have been suggested, so-called gene prioritization methods. Among these, network analysis of protein-protein interaction networks has been proven successful in the prediction of disease genes. Here, we propose a new method for the prediction of disease genes of genetic syndromes based on 1.) the HPRD protein-protein interaction network and 2.) a set of disease genes derived from patients’ phenotypes, using phenotype-gene associations available as part of the Human Phenotype Ontology. The method is validated using 1.) 100 computationally generated patient data sets, and 2.) exome sequencing data of four patients with two genetic syndromes (Nager syndrome and Coffin-Siris syndrome) whose causative genes were unknown at the time of analysis, but have been identified in the meantime.

A095 - Discovery of phenotype association networks using Association rule Mining

Short Abstract: Pleiotropic effects have been observed more with an increasing number of variants identified through Genome-wide association study, which implies potential comorbidity effects in the human population. Discovering systematic correlations or associations between disease related phenotypes could potentially unveil unknown disease mechanisms. This work reports a novel data mining approach to discover patterns of multiple phenotype associations from a large scale epidemiological data in a medical checkup database for Korean using Association Rule Mining and to infer phenotype networks from the patterns. This approach composes of an equal-frequency binning strategy for transforming continuous data into categorical data, association rule mining, a more sophisticated scheme for refining association rules to extract the patterns and visualize them with networks.
The representative patterns of the phenotypic associations were informative to draw relations between plasma lipid levels with bone mineral density(BMD) and a cluster of common traits (Obesity, hypertension, insulin resistance) related to Metabolic Syndrome (MS). More interesting finding was observed in that BMD was associated with high levels of glucose but not with insulin levels although the association between high glucose levels or insulin resistance with BMD has been inconclusive. We suggest that multiple phenotypic associations between plasma lipid levels with BMD and common traits in MS, be affected by the common genes harbouring pleiotropic effects.

A096 - Medical sequencing: assessing the Incidentalome

Short Abstract: Advances in next-generation sequencing technology (NGS) have made exome and whole genome sequencing feasible for molecular diagnostics of rare genetic disorders. In contrast to Sanger sequencing of specific genes or gene panels, NGS provides the ability to query all human genes. However, this opportunity presents new challenges. Selecting which variants should be reported to patients poses major ethical and clinical management issues. Besides variants that can be related to the ascertained disorder(s), there are others potentially conferring risk for additionally disorders, termed the “Incidentalome”.

Berg et al. (Genetic Medicine 2012) have proposed a variant classification method for incidentalome findings. Genes are classified as (a) clinically actionable (medical intervention or prevention) or potentially causing psychosocial harm, and (b) by their mode of inheritance. Incidentalome variants need to be (i) rare (minor allele frequency less than 5%), (ii) impact protein-coding genes and (iii) cause loss-of-function (stop-gain, frameshift, splicing) or have established disease association (e.g. according to the HGMD database). Variants are finally classified (i) as likely to cause disease or recessive, (ii) as clinically useful or harmful if reported.

We have implemented Berg’s method and present results for Complete Genomics whole genomes of control samples. We discuss several key issues, such as: (1) are there additional disease-associated genes? (2) is there a consensus on Berg’s gene classification? (3) what are the optimal variant quality filters? (4) what annotation pipeline should be used for Complete Genomics data?

This incidentalome annotation pipeline will be made available in the MedSavant genetic variant browser.

A097 - A preliminary analysis on a disease progression as a transition of disease attractors on the potential field determined by G×G and G×E interactions

Short Abstract: As Waddington proposed a metaphor for development and differentiation as "epigenetic landscape", understanding of developmental and differentiation processes as state transition of attractors of cells along with "valley" where a "marble" rolls down to the point of lowest local elevation has been a holy grail in biology. Half a century later, Kauffman proposed the idea that a cell type is an attractor of the gene regulatory network. Not only development and differentiation but also disease pathogenesis and progression can be understood by trajectories of "disease attractors" on epigenetic landscape determined by G×G and G×E interactions. A disease is believed to be an aberration of biological system. However, disease is not a temporal aberration but a stable aberration of biological system. That is, a "disease type" is also an attractor on the potential field determined by G×G and G×E interactions. Here we show preliminary analysis on a disease progression as a transition of disease attractors on the potential field determined by G×G and G×E interactions during Alzheimer’s disease progressions. We collected hippocampal microarray data of 9 control subjects and 22 AD subjects (7 incipient, 8 moderate, and 7 severe AD subjects). According to the AD progression stage, we identified the stage specific genes and then inferred the gene regulatory network of AD pathogenesis. We estimated a potential field on a state plane for 9 control subjects and 22 AD subjects. We then calculated gradients on an expression potential field for showing “disease progression trajectories” among disease attractors, that is, disease states.

A098 - Comprehensive genetic variant database for rheumatoid arthritis, RAvariome.

Short Abstract: Rheumatoid arthritis (RA) is the most common autoimmune inflammatory disease of the joints and is a multi-factorial disease involving genetic and environmental risk factors. In the last 5 years, many genome-wide association (GWA) studies have identified many genetic variants associated with RA. However, some risk variants show poor reproducibility due for genomic differences between geographical populations or have different effects on disease in each geographical population. Since there are few genetic association databases containing geographical population information of the subjects, our aim was to establish a first human genetic variation database providing reliable set of RA related genetic variants that have been confirmed in more than one study in a specific geographical ancestry. We collected 7,739 association signals between RA and genetic variants including SNPs, HLA types, CNVs and VNTR from 90 published papers. Then, the statistically significant associations from the largest GWA studies were graded by the reproducibility among independent studies. As a result, 101 RA risk variants were confirmed as reproducible in either the same or different geographical populations. By comparing RA risk variants reported in recent RA review papers, some of these risk loci have been validated in recent meta-analyses, whereas others attracted little attention in the reviews. Moreover, based on the confirmed variants, we developed RA genetic risk prediction tool to be usable for clinical research. RAvariome is available at http://hinv.jp/hinv/rav/.

A099 - Time to join forces to advance personalized medicine with Biomedical Informatics

Short Abstract: The EU project INBIOMEDvision (http://www.inbiomedvision.eu/) aims to become a European-wide initiative intended to monitor the evolution of Biomedical Informatics (BMI) addressing the effective and synergic integration between computational methods and technologies supporting life sciences research (Bioinformatics) and the informatics supporting healthcare and medical research (Health or Medical Informatics) by means of collaborative efforts performed by a broad group of experts with complementary perspectives on the field.
BMI comprises both the clinical translation of systems biology, as well as the research re-use of clinical information. Patient clustering for cohort identification, interpretation of genetic data from patients, patient stratification for treatment regimens are some of the areas that have benefitted from advances in BMI.
This poster presents the reports recently published by INBIOMEDvision. The report “Prospective analysis on Biomedical Informatics enabling personalized medicine” outlines the state-of-the-art, current and future work to address the challenges of BMI into three main areas: Health-Related Genomics, Network-based decision support for Systems Medicine and integration of Electronic Health Records data (EHR). The solutions needed to successfully enable the re-use of EHR clinical data for research purposes are complex, and do require a multidisciplinary approach. In this respect the development and implementation of standards that will help in the data mining process to extract data from EHR is a main area of study.
The collaboration of specialists of different areas involved, be it clinicians, biomedical researchers, IT specialists including the areas of security and privacy, and other stakeholders will help ensure optimum solutions to the challenges encountered.

A100 - A Consolidated Cell Line Molecular Feature Database and Suite of Data Visualisation Tools for the Generation of Novel Hypotheses in Cancer Biology

Short Abstract: The use of cell lines in cancer research has allowed for investigations into the development and progression of cancer, with the hope of developing new therapies. Such research, conducted within a laboratory environment, allows for tumour samples to be characterised based on their molecular aberrations, whilst testing novel therapeutic agents in a model system with no risk to patient safety.

As a means of selecting appropriate cell lines that contain desired molecular features (e.g. cell lines representative of a patient population), the AstraZeneca Oncology Bioinformatics team developed a consolidated cell line and molecular feature resource consisting of proprietary AstraZeneca cell line molecular data, public data from Novartis and Sanger Institute.

Along with this consolidated cell line molecular feature resource, a number of data analysis and visualisation tools were developed to support rapid data extraction and analysis, and subsequent visualisation and hypothesis generation. Using a combination of workflow technology, custom scripts, and distributed databases, this toolkit equips researchers with a mechanism in which to rapidly identify and select cell lines based on their molecular features, whilst providing a platform to aid with hypothesis generation and experimental design with regards to therapeutic agents, through simple and intuitive graphical user interface.

As a result of using these toolkits, bio-scientists with AstraZeneca and partnering companies, have been able to rapidly access and assimilate oncology data, and generate novel hypotheses for both late stage development and early target opportunities.

TOP

View Posters By Category

Search Posters:

TOP