Gene expression in individual cells can now be measured for thousands of cells in a single experiment thanks to innovative sample-preparation and sequencing technologies. State-of-the-art computational pipelines for single-cell RNA-sequencing data, however, still employ computational methods that were developed for traditional bulk RNA-sequencing data, thus not accounting for the peculiarities of single-cell data, such as sparseness and zero inflated counts. Here, we present a ready-to-use pipeline named gf-icf (Gene Frequency – Inverse Cell Frequency) for normalisation of raw counts, feature selection and dimensionality reduction of scRNA-seq data for their visualization and subsequent analysis. Our work is based on a data transformation model named Term Frequency - Inverse Document Frequency (TF-IDF), which has been extensively used in the field of text-mining where extremely sparse and zero-inflated data are common. Using benchmark scRNA-seq datasets, we show that the gf-icf pipeline outperforms existing state-of-the-art methods in terms of improved visualisation and ability to separate and distinguish different cell types.
Read more.. about GF-ICF on Front. Genet. - Statistical Genetics and Methodology journal