Analysis of non-coding transcripts related to prostate and colorectal cancer

Summary:

Prostate cancer and colorectal cancer are two of the most commonly diagnosed cancers in developed countries, both creating a big burden on society. Long non-coding RNA (lncRNA) is a type of RNA that does not transcribe into proteins or lack an open reading frame of larger than 100 amino acids. The function(s) of these RNAs are mostly unknown, but play a big role in cancer diagnosis, prognosis, development, metastasis, and pathophysiology. New correlation methods have been developed for attempting to predict annotation on unknown gene transcripts (specifically lncRNAs) by correlating with annotated genes by RNA expression values. These correlating annotated genes are used as a basis for predicting Gene Ontology (GO) term annotations for the transcripts. Topological information content similarity (TopoICSim) is a semantic distance measure for calculating similarities between genes given specific GO annotations. TopoICSim can be used to benchmark predictions made with the new correlation measures on already annotated genes, possibly giving more insight in biases or limitations of these annotation predictions. TopoICSim can also possibly serve as a predictor of quality of annotation predictions, making annotation predictions done on unknown annotations more tangible. For this benchmark, FANTOM 5 and GSE63733 expression data is used together with a multitude of HALLMARK/custom genesets. To give further insights on which processes and properties might be related to lncRNAs in each cancer, KEGG gene sets from HALLMARK together with lncRNA transcripts were analyzed by their annotations and annotation predictions. Especially in the case of poorly annotated transcripts or genes, this could give new insights on gene-gene and lncRNA-gene interactions related to the cancers. The FANTOM 5 and TCGA expression data is used for this process. Determining the quality of annotation prediction on transcripts without official annotation available remains fickle, but lncRNAs can be separated into several distinct groups with their molecular mechanisms. Interpretation of GO terms for each lncRNA also remains difficult, because of the number of transcripts and GO terms. This may be solved by either doing case-by-case analysis or incorporating lncRNAs in different visualizations or other interactive environments.

This project was done as part of an internship in collaboration with the NTNU.

Contributors:



Summary:

GAPGOM (novel Gene Annotation Prediction and other GO Metrics) is an R package with tools and algorithms for estimating correlation of gene expression enriched terms in gene sets, and semantic distance between sets of gene ontology (GO) terms. This package has been made for predicting the annotation of un-annotated gene(s), in particular with respect to GO, and testing such predictions. The prediction is done by comparing expression patterns between a query gene and a library of annotated genes, and annotate the query gene by enriched terms from the set of genes with similar expression pattern (often described as "guilt by association").

This project was done as part of an internship in collaboration with the NTNU.

Contributors:




Summary:

Repository for the XEMCLA (energy-dispersive X-ray Electron Microscopy CLustering Analysis) project, aiming to cluster cell structures in EM images with extra atomic-element xray diffraction matrices/images. A more in depth description about the project can be found in our preliminary and regular report.

Contributors:



MetaToKrona

Summary:

MetaToKrona is a web-application that combines MetaPhlAn 2 and Kronatools. The reason for this is because Krona plots are interactive and give an easier to understand view of a metagenomic sample than the static plots Metaphlan2 produces. The webtool has a queuing system that allows multiple users to submit their jobs. When a job is done running, you should receive an email with a link to the output.

Contributors:



Evaluating the best combination of aligners for both single-cell and multi-cell data

Summary:

This project is a general RNA-seq pipeline made for comparing single-cell RNA and multi-cell RNA expression data as close as possible with a certain combination of commands and a combination of aligners and mappers.

Contributors:



NCBI_SRA_TO_EBI_FASTQ

Summary:

This python script will download .fastq/.fq files from the EBI (http://www.ebi.ac.uk/) using any ftp link of the NCBI (https://www.ncbi.nlm.nih.gov/) containing .sra (Sequence Read Archives file) files. This is, to prevent the need of using the poorly documented fastq-dump (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump) command line tool and understanding all it's parameters. Also to prevent unneccesary conversion and waste CPU cycles.

Contributors:



Re-estimating the reproduction numbers of Ebola virus disease (EVD) during the 2014-2016 outbreak in West Africa using a complete dataset.

Summary:

This project aimed to get a better guess at the reproduction numbers of Ebola after the 2014-2016 outbreak in Africa. The original study did not provide a complete dataset as the outbreak was ongoing. We repeated the said study and recalculated the reproduction numbers behind the outbreak.

Contributors:



LeapingPymol

Summary:

This project aims to combine the Leap sensor with Pymol with help of the Leap SDK and python3.

Contributors: