Introduction

Research on long non-coding RNAs (lncRNAs), a previously unsuspected major output of genomes of complex organisms, has been dogged by uncertainty and controversy from its beginning. lncRNAs have the unfortunate distinction of being named for what they are not, rather than what they are. This loose description has its origins in the belief that the main role of RNA is to act as the intermediate between a gene and a protein, with other ‘housekeeping’ non-coding RNAs such as ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), small nucleolar RNAs (snoRNAs), spliceosomal RNAs and other small nuclear RNAs (snRNAs) being ancillary to this function.

Broad recognition of RNA as a regulatory molecule occurred in the early years of the first decade of the twenty-first century with the unexpected discovery of large numbers of small interfering RNAs (siRNAs), microRNAs (miRNAs) and small PIWI-interacting RNAs (piRNAs) that regulate — through Argonaute family proteins — gene expression at transcriptional, post-transcriptional and translational levels in eukaryotes1, although there were examples of other small regulatory RNAs in the literature, especially in bacteria2. A few long regulatory RNAs, notably meiRNA in the fission yeast Schizosaccharomyces pombe, hsrω, RNA on the X1 (roX1) and roX2 in Drosophila melanogaster, and H19 and X-inactive-specific transcript (XIST) in mammals, had also been reported in the preceding years3,4,5,6,7, but were regarded more as oddities than early examples of a general phenomenon. Moreover, the small regulatory RNAs did not disturb the conceptual framework that most genes encode proteins, but rather fitted comfortably into it. It was later found, however, that while some miRNAs are generated from the introns of pre-mRNAs8, non-coding primary transcripts of miRNAs and of snoRNAs can also have functions9,10 and that rRNAs, tRNAs and snoRNAs are processed to generate small regulatory RNAs, including miRNAs11,12,13,14, in some cases contributing to transgenerational epigenetic inheritance15.

A bigger surprise, and challenge to the reigning understanding of genetic information, came in the early and middle years of the first decade of the twenty-first century, when global transcriptomic analyses, intended to better define the proteome, revealed that most of the genome of animals and plants is dynamically transcribed into longer RNAs that have little or no protein-coding potential16,17,18,19. This surprise was compounded by the associated finding that the number, and to a large extent the repertoire, of protein-coding genes is similar in animals of widely different developmental and cognitive complexity — the nematode worm Caenorhabditis elegans (comprising ~1,000 somatic cells) and humans (~30 × 1012 somatic cells20) both have ~20,000 protein-coding genes — which was termed the ‘g-value paradox’21. By contrast, the extent of non-coding DNA, and consequently the transcription of non-coding RNAs, has increased with increasing developmental complexity22.

Understandably, the common initial reaction of the molecular biology community was to suspect that these unusual RNAs are transcriptional noise, because of their generally low levels of sequence conservation, low levels of expression and low visibility in genetic screens. Since then, however, there has been an explosion in the number of publications reporting the dynamic expression and biological functions of lncRNAs, aided by extensive technology development that has enabled their identification and characterization, although only a minority of lncRNAs have confident annotations and very few have mechanistic information. The realization that the genomes of plants and animals express large numbers of lncRNAs requires a framework for their classification and understanding of their functions and, more profoundly, a reassessment of the amount and type of information required to programme the development of complex organisms.

Purpose of this Consensus Statement

In this Consensus Statement we present a current and coherent picture of the roles of lncRNAs in cell and developmental biology, identify the key issues in understanding their functions and chart the path forward. We address lncRNA definition, nomenclature, conservation, expression, phenotypic visibility, functional assays and molecular mechanisms encompassing lncRNA connections to chromatin architecture, epigenetic processes, enhancer function and biomolecular condensates, as well as the roles of lncRNAs outside the nucleus. We argue that loci expressing lncRNAs should be recognized as bona fide genes and discuss lncRNA structure–function relationships as the means to parse mechanisms and pathways. Finally, we identify the current challenges and offer recommendations for understanding the relationship of lncRNAs to genome architecture, gene regulation and cellular organization.

The authors of this Consensus Statement were suggested by recommendations of colleagues. Consensus was reached by group e-mail and discussion.

Definition and nomenclature of lncRNAs

lncRNAs have been arbitrarily defined as non-coding transcripts of more than 200 nucleotides (200 nt), which is a convenient size cut-off in biochemical and biophysical RNA purification protocols that deplete most infrastructural RNAs, such as 5S rRNAs, tRNAs, snRNAs and snoRNAs, as well as miRNAs, siRNAs and piRNAs23. This definition also excludes some other well-known short RNAs such as the primate-specific snaRs (~80–120 nt), which associate with nuclear factor 90 (ref. 24); Y RNAs (~100 nt), which act as scaffolds for ribonucleoprotein (RNP) complexes25; vault RNAs (88–140 nt), which are involved in transferring extracellular stimuli into intracellular signals26; and promoter-associated RNAs and non-canonical small RNAs produced by post-transcriptional processing27,28,29. Other non-coding RNAs lie close to the 200-nt border, such as 7SK (~330 nt in vertebrates), which controls transcription poising and termination, including at enhancers30,31, and 7SL (~300 nt), which is an integral component of the signal recognition particle that targets proteins to cell membranes32 and the evolutionary ancestor of the widespread primate Alu (~280 nt) and rodent B1 (~135 nt) small interspersed nuclear elements33,34,35. Given this grey zone of sizes, we support the suggestion that non-coding RNAs be divided into three categories36: (1) small RNAs (less than 50 nt); (2) RNA polymerase III (Pol III) transcripts (such as tRNAs, 5S rRNA, 7SK, 7SL, and Alu, vault and Y RNAs37), Pol V transcripts in plants and small Pol II transcripts such as (most) snRNAs and intron-derived snoRNAs38,39 (~50–500 nt); and (3) lncRNAs (more than 500 nt), which are mostly generated by Pol II.

Many lncRNAs are spliced and polyadenylated, which has led to their description as ‘mRNA-like’. However, other lncRNAs are not polyadenylated or 7-methylguanosine capped19,40,41,42, are expressed from Pol I (5.8S, 28S and 18S rRNAs) or Pol III promoters, or are processed from precursors, including from introns and repetitive elements, leading to the more agnostic descriptor ‘transcripts of unknown function’43. With respect to protein-coding genes, lncRNAs can be ‘intergenic’, antisense or intronic. They are also derived from ‘pseudogenes’, which occur commonly in metazoan genomes44, with more than 10,000 pseudogenes identified in the mouse genome45 and almost 15,000 identified in the human genome46, some of which have been shown to be functional44,47. lncRNAs also include circular RNAs generated by back-splicing of coding and non-coding transcripts, also with demonstrated functions48, and trans-acting regulatory RNAs derived from sequences that conventionally act as the 3′ untranslated regions of mRNAs49.

There have been many attempts at nomenclature and classification of lncRNAs, by the HUGO Gene Nomenclature Committee, the GENCODE consortium and others, predominantly based on their genomic position and orientation relative to protein-coding genes46,50,51,52,53. Linking to nearby genes has been useful, as it provides context and has sometimes provided clues to lncRNA function, for example in regulating the expression of these genes, as is often the case with enhancers (see later), although enhancer activity should not be assumed to be directed to the most proximal genes.

Many early studies focused on long intergenic non-coding RNAs (lincRNAs), whose sequences do not trespass on nearby protein-coding loci, owing to the need to distinguish their function from that of proteins. However, many other lncRNAs overlap protein-coding loci or are expressed from enclosed introns. Moreover, the traditional view of genomes as linear arrangements of discrete protein-coding genes fails to accommodate the discovery that eukaryotic transcription, best characterized in human and model organisms, is a fuzzy continuum54, with ‘genes’ within genes, genes interleaved with other genes and non-coding transcripts overlapping or originating within them18,43,55, together posing a growing problem for genome annotations.

In both humans and D. melanogaster, for example, many protein-coding genes have 5′ exons that are incorporated into mRNA in early embryogenesis and lie hundreds of kilobases upstream of the usual first exon, bypassing many other genes in the intervening region56. Indeed, any base may be exonic, intronic or ‘intergenic’, depending on the transcriptional output of the cell at any point in its developmental trajectory or physiological state55. For this reason, unless a lncRNA is antisense to a protein-coding gene, we recommend naming lncRNAs for their own sake with allusion to a discerned characteristic or function (as has been traditional for proteins), such as XIST, antisense IGF2R non-protein-coding RNA57 (AIRN), HOX antisense intergenic RNA58 (HOTAIR), Gomafu (‘spotted pattern’ in Japanese; also known as Miat)59, COOLAIR (referring to plant vernalization)60 and auxin-regulated promoter loop61 (APOLO), for easy recollection, preferably accompanied by complete exon–intron structures and genomic coordinates. If no biological context is available, we recommend naming the lncRNA according to the GENCODE system46.

The wide range of functions of ‘non-coding’ RNAs precludes straightforward classification as specific RNA classes, with some acting locally and some at a distance, or both62. In the absence of more specific categorization, we recommend retention of the general descriptor ‘lncRNA’, noting that most have some type of regulatory or architectural, often related, role in cell and developmental biology, and because there are so many historical articles that use this term or variations thereof. Non-coding RNAs come in all shapes and sizes, and the territory is huge, covering most of the genome and a plethora of functions. Some RNAs have dual functions as coding and regulatory RNAs, and some, perhaps many, cytosolic lncRNAs encode small peptides63,64,65,66. Protein-coding loci also express lncRNAs through alternative splicing67,68,69, and, surprisingly, the major transcript produced by ~17% of human protein-coding loci is non-coding70. Indeed, both lncRNA genes and mRNA genes can produce transcripts that function following different levels of processing. Unspliced transcripts, spliced transcripts, circular RNAs, intronic RNAs and stable small RNAs generated from them can all have a function48,71,72. Any RNA can be regulatory, and any locus can encode both protein-coding and regulatory RNAs.

Well in excess of 100,000 human lncRNAs have been recorded52,73, many of which are specific to the primate lineage74. This is a vastly incomplete list due to the limited analysis of different cells at different developmental stages (see later). There are now hundreds of thousands of catalogued lncRNAs and dozens of databases (and databases of databases) with curated information75,76,77,78,79,80. Over the past decade, there have been ~50,000 publications with ‘long non-coding RNA’ as a key term and more than 2,000 publications reporting validated lncRNA functions81, although most have yet to be followed up in any detail.

From here on, we focus on lncRNAs derived from Pol II primary transcription units (and use the term in that context), as opposed to other non-coding RNAs that are expressed from Pol I or Pol III promoters, processed from introns (which, it should be noted, constitute a major fraction of the non-coding RNA in mammals and other organisms41,82,83,84) or formed by back-splicing, although many of the same considerations apply.

Conservation of lncRNAs

Most lncRNAs are less conserved among species than the mRNA sequences encoding the proteome. Initially, most of the mammalian genome (which included most lncRNA loci) was thought to be evolving neutrally, using the yardstick of the rate of divergence of common ‘ancient repeats’ (derived from transposons) between the human and mouse genomes, on the assumption that these sequences are non-functional and representative of the original distribution in the ancestor85. However, there is increasing evidence that transposable elements are widely co-opted as functional elements of gene expression and structure, forming promoters, regulatory networks, exons and splice junctions in protein-coding genes and lncRNAs86,87,88,89, and therefore cannot be used as indices of neutral evolution.

Regulatory sequences, including promoters and lncRNAs, are known to evolve rapidly due to more relaxed structure–function constraints than protein-coding sequences and due to positive selection during adaptive radiation85,90,91,92. Many lncRNAs are cell lineage specific. Indeed, given their association with developmental enhancers (see later), variation in the complement and sequences of lncRNAs may be a major factor in species diversity.

Loci expressing lncRNAs exhibit many of the characteristics of protein-coding genes, including promoters, multiple exons, alternative splicing, characteristic chromatin signatures, regulation by morphogens and conventional transcription factors, altered expression in cancer and other diseases74,93,94,95,96,97,98, and a range of half-lives similar to those of mRNAs99.

The promoters of lncRNAs exhibit levels of conservation comparable to those of protein-coding genes18,74. lncRNAs also have conserved exon structures, splice junctions and sequence patches18,74,93,97, and they retain orthologous functions despite rapid sequence evolution100,101,102. Indeed, low sequence conservation can be misleading.

The lncRNA telomerase RNA template component (TERC), which is required for telomere maintenance — a vital cellular function — differs widely in size and sequence, but has conserved structural topology from yeast to mammals, albeit with some variation, and a conserved catalytic core103,104,105,106,107,108 (see also later). X chromosome dosage compensation in Drosophila spp. requires the formation of a nuclear domain through phase separation by the lncRNAs roX1 and roX2 interacting with the intrinsically disordered region (IDR) of a specific partner protein, male sex lethal 2 (MSL2). Replacing the IDR of the mammalian orthologue of MSL2 with that of the D. melanogaster protein and expression of roX2 is sufficient to nucleate ectopic X chromosome dosage compensation in mammalian cells, showing that the roX–MSL2 IDR interaction is the primary determinant of compartmentalization of the X chromosome and that such interactions are preserved over vast evolutionary distances109. Similar processes are involved in the regulation of X chromosome dosage compensation in placental mammals by XIST, which performs several functions, including repulsion of euchromatic factors, scaffolding of new heterochromatic factors and reorganization of chromosome structure110,111,112,113.

Expression

Although there are exceptions (such as metastasis-associated lung adenocarcinoma transcript 1 (MALAT1; also known as NEAT2), which is one of the most abundant Pol II transcripts in vertebrate cells114, and nuclear paraspeckle assembly transcript 1 (NEAT1); see later), lncRNAs generally show more restricted expression patterns than mRNAs74,115, and are often highly cell specific116, which is consistent with a role in the definition of cell state and developmental trajectory. They also have specific subcellular locations, often nuclear, although a large fraction is cytoplasmic75. Although it is sometimes asserted that there are a few hundred cell types in a human, broad classifications obscure the fact that each cell occupies a precise place in a developmental ontogeny, illustrated by the differential expression of HOX genes in superficially similar skin cells in different regions of the body117, and by the expression of lncRNAs in various regions of the brain118,119,120,121 and at different stages of development122. lncRNAs are also dynamically expressed during differentiation of mammalian stem, muscle, mammary gland, immune and neural cells, among many others81,116, with a transition during development from broadly expressed and conserved lncRNAs towards an increasing number of lineage-specific and organ-specific lncRNAs123. lncRNA expression can also be strongly influenced by environmental factors, a feature that is especially prominent in plants124,125,126, which include a range of stress responses in animals and drug resistance in cancer127,128,129,130,131,132,133.

The restricted expression of lncRNAs in different cells at different stages of development and their generally low copy number (owing to their regulatory nature) accounts for their sparse representation in bulk-tissue RNA sequencing datasets134, whereas many lncRNAs are relatively easy to detect in particular cells118. The undersampling of lncRNAs is now being rectified by targeted capture98,135, advanced imaging136,137,138, spatial transcriptomics139 and, in some cases, single-cell sequencing120,121,140, which make it clear that, whereas ~20,000 human lncRNA loci have been identified by GENCODE46 and ~30,000 by the FANTOM consortium141, there is likely at least an order of magnitude more.

Due to the high complexity and the variation in transcription initiation and termination sites, expression levels and splicing, comprehensive characterization of transcriptomes is extremely challenging. A recent study showed that the low expression of a lncRNA can be essential for its functional role by ensuring specificity to its regulated targets, suggesting that low abundance levels may be an essential feature of how lncRNAs work142. To fully catalogue the universe of lncRNAs, and properly record their exon–intron organization and splice variants, high-depth sequencing will need to be performed on cells at all stages of differentiation and development, undergoing different neural, immunological and other physiological processes, and in various disease states. This is a huge task, but we recommend that future gene expression profiling should include full transcript analysis not just of mRNAs but also of small RNAs and lncRNAs that are intergenic, antisense and intronic to the annotated genes, and their stoichiometry143.

Phenotypic visibility

Like miRNAs, most lncRNAs have not been identified in genetic screens. There are two reasons for this. First, most genetic screens historically focused on protein-coding mutations, which often have severe consequences that are easy to track; by contrast, regulatory mutations often have subtle consequences that affect quantitative traits. Second, it is difficult to identify causal mutations among the many variations that occur in non-coding sequences. Indeed, most variations that influence human quantitative traits and complex disorders occur in non-coding regions, which are replete with genes expressing lncRNAs144,145 that are transcribed in cell types relevant to the associated trait141,146.

There are exceptions of lncRNAs that have been identified genetically, notably the roX1 and roX2 RNAs involved in X chromosome activation in male fruitflies5, mammalian parentally imprinted H19, Airn and Kcnq1ot1 RNAs in mice6,57,147,148 and others such as Tug1 in mice149, MAENLI (ref. 150) and HELLP (named for ‘haemolysis, elevated liver enzyme levels and low platelet count’; also known as HELLPAR)151, which are associated with disorders or developmental processes. In Arabidopsis thaliana, non-coding intronic single-nucleotide polymorphisms important for flowering-time adaptation were found to alter the splicing of the lncRNA COOLAIR152.

Many lncRNAs have been associated with the cause and progression of cancers, through altered expression of and/or mutations (including translocation breakpoints) in lncRNAs that act as oncogenes or tumour suppressors153,154,155. Other lncRNAs are involved in human genetic disorders81,156,157, including DiGeorge syndrome and other neurodevelopmental and craniofacial defects158,159,160. Phenylketonuria, one of the first documented human genetic disorders, caused mostly by mutations in the enzyme phenylalanine hydroxylase, is caused also by mutations in a lncRNA that can be treated by modified RNA mimics161.

A route to analysing lncRNA biological function is to silence or delete, or (less commonly) ectopically express, lncRNAs that have been identified in RNA sequencing datasets, usually as being differentially expressed. There have been problems with the interpretation of such experiments, however, particularly the difficulty of disentangling the loss of lncRNA expression from the loss of DNA regulatory elements162,163, which has been addressed by strategies such as inserting polyadenylation sites for early transcription termination or transcription repression by CRISPR interference (CRISPRi), replacement of the lncRNA with a reporter gene that leaves the promoter intact or deletion of lncRNA exons (although loss of downstream regulatory elements cannot be ruled out), antisense-mediated blockade of lncRNA splice sites, CRISPR–Cas13 targeting of the lncRNA (rather than its DNA sequence) and transgene rescue163,164. There are now many studies that have demonstrated the biological roles of lncRNAs163, and high-throughput loss-of-function reverse genetic screens are increasing the search speed, identifying, for example, lncRNAs that are required for mammalian cell growth and migration, brain, skeletal, lung, muscle and heart development, immune function, epidermal homeostasis and cancer drug responses or lncRNAs that have fitness effects81,165,166,167,168,169,170 (Fig. 1). CRISPRi-mediated transcription repression of more than 16,000 lncRNAs in seven human cell lines identified almost 500 lncRNAs required for normal cellular proliferation, 89% of which were expressed in only one cell type167.

Fig. 1: Visible phenotypes of mutations in long non-coding RNA genes in mice163.
figure 1

The following long non-coding RNAs (lncRNAs) are listed in the figure underneath their associated phenotypes: Airn, antisense of IGF2R non-protein-coding RNA147,435; Charme, chromatin architect of muscle expression436; Chaserr, CHD2 adjacent, suppressive regulatory RNA437; Fendrr, FOXF1 adjacent non-coding developmental regulatory RNA165,438; Firre, functional intergenic repeating RNA element316; Gaplinc, gastric adenocarcinoma predictive long intergenic non-coding RNA200; H19, clone pH19 (ref. 439); Handsdown, downstream of the protein-coding gene Hand2 (ref. 440) Kcnq1ot1, Kcnq1 overlapping antisense transcript 1 (ref. 441); linc-Brn1b, long intergenic non-coding RNA (lincRNA) downstream of the Brn1 protein-coding gene165; linc-Epav, endogenous retrovirus-derived lncRNA positively regulates antiviral responses442; lincRNA-Cox2, lincRNA downstream of the inflammation response gene Cox2 (ref. 443); lincRNA-Eps, lincRNA involved in erythroid prosurvival201; lnc-Lsm3b, interferon-inducible non-coding splice variant of the U6 small nuclear RNA-associated Sm-like protein lsm3 gene444; Maenli, master activator of engrailed1 in the limb165; Mdgt, midget165; Meg3, maternally expressed gene 3 (also known as Gtl2)445,446; Norad, non-coding RNA activated by DNA damage447; Peril, perinatal lethal long non-coding RNA165; Pnky, pinky (also known as lnc-Pou3f2)448; Tug1, taurine upregulated gene 1 (refs. 165,166,449) Upperhand, lncRNA upstream of the Hand2 cardiomyocyte transcription factor locus318; Xist, X-inactive-specific transcript450. Figure courtesy of Daniel Andergassen and John Rinn.

Phenotypic consequences of mutations in regulatory RNAs, like some protein-coding mutations, may be context dependent and not evident in laboratory conditions, and may be obscured by the robustness of biological systems171. Loss of Malat1, which localizes in nuclear speckles and associates with splicing factors, has no major phenotypes in mice114,172,173,174; however, it does affect cancer progression and synapse formation, among other physiological and pathophysiological processes175,176. Neat1, which is required for the assembly and function of enigmatic, mammal-specific nuclear organelles called ‘paraspeckles’177,178,179, does not appear to be required for normal development in mice but is important for the differentiation of reproduction-related female tissues such as corpus luteum and mammary gland180. Deletion of brain cytoplasmic RNA 1 (BC1), a highly expressed brain lncRNA, is seemingly harmless in mice but results in behavioural changes that would be lethal in the wild181. So extensive phenotyping is important, especially for cognitive functions. Organoid models may help to identify phenotypes in vitro182,183.

Functional annotation of lncRNAs can also be undertaken by molecular phenotyping184. Analysis of expression patterns, lncRNA–chromatin interactions and other molecular indices following CRISPR–Cas13-mediated depletion of more than 400 lncRNAs in culture indicated that lncRNAs regulate many genes involved in development, cell cycle and cellular adhesion, among other processes185.

Biological functions of lncRNAs

Characterized examples have indicated that RNAs participate in virtually all levels of genome organization, cell structure and gene expression, through RNA–RNA, RNA–DNA and RNA–protein interactions, often involving repeat elements88,186,187, including small interspersed nuclear elements in 3′ untranslated regions188. These interactions are involved in the regulation of chromatin architecture and transcription (see later), splicing (especially by antisense lncRNAs)189,190,191, protein translation and localization188,192,193, and other forms of RNA processing, editing, localization and stability194,195.

Many lncRNAs are involved in the regulation of cell differentiation and development in animals and plants23,81,116,124,196. They also have roles in physiological processes such as (in mammals) the p53-mediated response to DNA damage197, V(D)J recombination and class switch recombination in immune cells198, cytokine expression199, endotoxic shock200, inflammation and neuropathic pain201,202,203, cholesterol biosynthesis and homeostasis204,205, growth hormone and prolactin production206, glucose metabolism207,208, cellular signal transduction and transport pathways209,210,211,212, synapse function213,214 and learning215, and have roles in the response to various biotic and abiotic stresses in plants124,125. There is also an emerging association of lncRNAs with the cell membrane216 and with ribozymes217.

Presently, a growing number of lncRNAs have their own stories, and the literature is becoming replete with them. However, several convergent themes are emerging, which explain lncRNA ubiquity and importance in differentiation and development: the association of lncRNAs with chromatin-modifying proteins; the expression of lncRNAs from developmental ‘enhancers’; and the formation of RNA-nucleated phase-separated coacervates.

Control of chromatin architecture

Epigenetic modifications of chromatin supervise differentiation and development in complex organisms218. DNA methylation is known to be directed by small non-coding RNAs in plants219, and the RNAi pathway is required for heterochromatin formation and epigenetic gene silencing in fungi and animals220. The mammalian de novo DNA (cytosine 5)-methyltransferase 3A (DNMT3A) and DNMT3B, but not the maintenance DNA methylase DNMT1, bind siRNAs with high affinity221. In turn, DNMT1 (which restores methylation at hemimethylated CpG dinucleotides following DNA replication) binds lncRNAs to alter DNA methylation patterns at their cognate loci222,223,224, but this is still largely unexplored territory.

There are more than 100 different histone modifications that are differentially established by enzymes at a myriad of different positions in plant and animal genomes to control gene expression during development. The most studied are Polycomb repressive complex 1 (PRC1) and PRC2, which catalyse monoubiquitylation of histone H2A Lys119 (ref. 225) and dimethylation and trimethylation of histone H3 Lys27 (H3K27), respectively, but in mammals neither complex contains sequence-specific DNA-binding proteins218. Early studies suggested that PRC2 and/or the associated H3K9 methyltransferase G9a are recruited during mouse X chromosome inactivation by Xist186, and the control of parental imprinting in mice by Airn226 and Kcnq1ot1 (ref. 227), although these associations involve complexities and uncertainties228,229.

A subsequent survey of more than 3,300 lncRNAs in human cells showed that ~20% (but only ~2% of mRNAs) interact with PRC2, and that other lncRNAs are associated with other chromatin-modifying complexes230. Moreover, depletion of a selection of these RNAs caused derepression of genes normally silenced by PRC2 (ref. 230). PRC2 associates with many RNAs228,231,232, more than 9,000 in embryonic stem cells233. There are conflicting reports of whether these associations are nonspecific (‘promiscuous’)228,234 or specific high-affinity interactions with different RNAs232,235, although these alternatives are not mutually exclusive229. Some recent studies have shown that RNA is required for PRC2 chromatin occupancy, PRC2 function and cell state definition236, and that the interaction of PRC2 with RNA can regulate transcription elongation232. PRC1 function also appears to be controlled by RNA237,238. However, deconvoluting RNA–protein interactions is complicated by the low affinity of many antibodies used in pulldown assays and the fact that PRC2, for example, has at least two subunits that bind RNA228. The recent development of denaturing crosslinked immunoprecipitation (dCLIP), which is based on high-affinity biotin–streptavidin pulldowns, has indicated that PRC2 interacts with G-rich RNA motifs, including RNA G-quadruplexes, to achieve specificity of RNA-mediated recruitment232,239,240.

Other lncRNAs associate with the gene-activating Trithorax complexes (which methylate H3K4), including enhancer RNAs involved in the maintenance of stem cell fates and lineage specification241,242,243,244,245. H3K9 dimethylation is regulated by lncRNAs during the formation of long-term memory in mice246. lncRNAs also control methylation of a number of non-histone proteins involved in animal cell signalling, gene expression and RNA processing247.

Many other proteins involved in modulating chromatin architecture, including HOX proteins, pioneer transcription factors such as NANOG, OCT4 (also known asPOU5F1), SOX2 and other high mobility group (HMG) proteins, and proteins of SWI/SNF chromatin remodelling complexes, have only vague or promiscuous DNA sequence specificity248,249,250,251, which indicates that other factors are involved in determining their targets at different stages of cell differentiation and development. Moreover, binding-site selection by the zinc-finger transcription factor CTCF, which, together with cohesin complexes, anchors chromosome loops252, was shown to be controlled by the lncRNA just proximal to Xist (Jpx) during early cell differentiation, thereby regulating chromatin topology on a genome-wide scale253. CTCF binds thousands of RNAs, including Xist, Jpx and the lncRNA Xist antisense RNA (Tsix), which targets CTCF to the X inactivation centre254.

There is abundant evidence that RNA may guide chromatin remodelling complexes, although accessibility dictated by DNA and histone modifications (which are also likely directed by regulatory RNAs) may also have a role. The D. melanogaster Hox protein Bicoid (which controls anterior–posterior patterning) binds RNA through its homeodomain255. SOX2 binds RNA with high affinity through its HMG domain256,257, as do other members of the HMGB family257,258,259.

During mouse embryogenesis, the Sox2 locus expresses also an overlapping lncRNA260, and there are well-documented examples of lncRNAs that interact with SOX2 to regulate pluripotency, neurogenesis, neuronal differentiation and brain development257,261,262,263,264. SWI/SNF nucleosome remodelling complexes are directed to specific sites in chromatin or are antagonized by lncRNAs, including XIST and enhancer RNAs, in a wide range of differentiation processes and cancers251,265,266,267,268,269,270.

The lncRNA MaTAR25, which is overexpressed in mammary cancers, acts in trans to regulate the tensin 1 gene through interaction with the transcription co-activator PURB271. The master transcription factor myoblast determination protein (MYOD), which can reprogramme mammalian fibroblasts into muscle cells and is central to muscle differentiation in vivo, is regulated by lncRNAs272,273,274, as are other aspects of muscle gene expression275. The pioneer transcription factor CBP also binds RNAs, including those transcribed from enhancers, to stimulate histone acetylation and consequently transcription276. Some transcription factors (OCT4, NANOG, SOX2 and SOX9) are also regulated by lncRNAs, including pseudogene-derived lncRNAs277,278,279,280,281, and reciprocally regulate the expression of lncRNAs282. Enhancer-derived lncRNAs also regulate the expression of the nuclear hormone receptor ESR1 (ref. 283) and of CCAAT/enhancer-binding protein-α (CEBPA)284.

Enhancer action

Enhancers are non-coding genomic loci that control the spatiotemporal expression of other genes during development. There appear to be ~400,000 (±100,000) enhancers in the mammalian genome285,286,287,288, sometimes clustered into ‘super-enhancers’ or ‘enhancer jungles’288,289,290,291. Enhancers are thought to function by juxtaposing transcription factors bound at the enhancer promoters with the promoters of target genes292,293.

There is no question that enhancer action alters chromatin topology and may be responsible for the formation of chromatin-loop domains that act as local transcription and splicing hubs294,295. Enhancers are transcribed in the cells in which they are active141,289,296,297,298,299, which has led to uncertainty about whether the resulting RNAs are by-products of the binding of transcription factors or have a role in enhancer activity298.

The latter appears to be the case. The epigenetic landscape of and the features of transcription initiation at the promoters of protein-coding genes and enhancers are almost indistinguishable296,297,298,299,300. Enhancers express bidirectional promoter-associated short RNAs301,302,303, termed ‘eRNAs’, although such short RNAs are not specific to enhancers, as similar bidirectional transcripts are produced from the promoters of protein-coding genes304,305. Also analogously to mRNAs produced from protein-coding genes, enhancers express long (non-coding) RNAs (confusingly also referred to as ‘eRNAs’298,306), and transcription is considered the best molecular indicator of enhancer activity in developmental processes296,297,306,307,308 and cancers288. Moreover, enhancer-lncRNA splicing has been shown to modulate enhancer activity309,310.

Although the extent of congruency of combined genetic and high-depth transcriptomic data is uncertain, as their availability is still limited, the data suggest that many if not most lncRNAs are derived from enhancers141,298 and that lncRNAs are required for enhancer activity163,284,311,312,313,314, examples including the lncRNAs Evf2 (also known as Dlx6os1)315, Firre316, Peril317, Upperhand (also known as Hand2os1)318 and Maenli150 in mice. Enhancer RNA function is fertile ground for investigation, but if enhancer loci are considered bona fide ‘genes’, the g-value paradox (the perceived lack of increase in gene number with developmental complexity) is resolved. It also means that a key development in the evolution of complex organisms was the use of RNA to organize developmental trajectories319. It appears that “every cell type expresses precise lncRNA signatures to control lineage-specific regulatory programs”270, and that cell state during ontogeny is likely directed by lncRNAs.

Formation of biomolecular condensates

The past decade has seen the growing appreciation of the role of biomolecular condensates, or phase-separated domains (PSDs), in the organization of cells and chromatin. These condensates are highly dynamic assemblies with high local concentrations of macromolecules, a feature that promotes functional interactions. The condensates usually contain both RNA and proteins320,321,322, the latter having IDRs, which are the major sites of post-translational modifications323. IDRs interact with and are tunable by many partners324. The fraction of the proteome containing IDRs has expanded with cellular and developmental complexity323, and nearly all proteins involved in the regulation of development, including most transcription factors, histones, histone-modifying proteins, other chromatin-binding proteins, RNA-binding proteins, splicing factors, nuclear hormone receptors, cytoskeletal proteins and membrane receptors, contain IDRs323,325,326,327,328,329,330,331,332.

RNA is crucial for the form, composition and function of phase-separated RNA–protein condensates320,321,322. Specific ‘architectural’ lncRNAs333 associate with nuclear condensates of different half-lives and functionalities, including in centrosomes334, nucleoli335 (the lncRNAs SLERT138 and LETN336), nuclear speckles (the lncRNA MALAT1 (refs. 173,337)) rich in RNA-processing factors, speckle-related condensates that contain the lncRNA Gomafu in mice338,339 and paraspeckles (the lncRNA NEAT1 (refs. 340,341)) (Fig. 2), in vertebrates as well as polyadenylation complexes342 and other condensates in plants343. RNP condensates also include cytoplasmic membraneless organelles such as P-granules344,345, subcellular-localized translational messenger RNP assemblies346 and synaptic compartments320,322,347. The mammalian cytoplasmic lncRNA NORAD, which is induced by DNA damage and required for genome stability, prevents aberrant mitosis by sequestering Pumilio proteins (which bind many RNAs to regulate stem cell fate, development and neurological functions) into PSDs through its repeat sequences137,348.

Fig. 2: Roles of long non-coding RNAs in nuclear organization.
figure 2

a, 5′ small nucleolar RNA-capped and 3′-polyadenylated long non-coding (lncRNAs) (SPAs)42 and small nucleolar RNA-related lncRNAs (sno-lncRNAs)41 accumulate at their sites of transcription and interact with several splicing factors such as RNA-binding protein FOX-1 homologue 2 (RBFOX2), TAR DNA-binding protein 43 (TDP43) and heterogeneous nuclear ribonucleoprotein M (hnRNPM) to form a microscopically visible nuclear body that is involved in the regulation of alternative splicing42. b, The lncRNA functional intergenic repeating RNA element (Firre) is transcribed from the mouse X chromosome and interacts with the nuclear matrix factor hnRNPU to tether chromosome X (chrX), chr2, chr9, chr15 and chr17 into a nuclear domain451,452. c, The lncRNA nuclear paraspeckle assembly transcript 1 (NEAT1) is essential for the formation of paraspeckles178. NEAT1 sequesters numerous paraspeckle proteins to form a highly organized core–shell (dark and light purple, respectively) spheroidal nuclear body453. The middle region of NEAT1 is localized in the centre of paraspeckles, and the 3′-end and 5′-end regions are localized in the periphery453. Different paraspeckle proteins are embedded by NEAT1 into the spheroidal structure in the core region (non-POU domain-containing octamer-binding protein (NONO), fused in sarcoma (FUS) and splicing factor, proline- and glutamine-rich (SFPQ)) or in the shell region (RNA-binding motif protein 14 (RBM14))453. d, The lncRNA metastasis-associated lung adenocarcinoma transcript 1 (MALAT1) is localized at the periphery of nuclear speckles172,454 and is involved in the regulation of pre-mRNA splicing339,455. MALAT1 interacts with the U1 small nuclear RNA (U1 snRNA)428, whereas proteins such as SON DNA- and RNA-binding protein and splicing component 35 kDa (SC35) are localized at the centre of nuclear speckles456. e, The lncRNA CHD2 adjacent, suppressive regulatory RNA (Chaserr) forms a compartment within a region of the mouse chromosome corresponding to a topologically associating domain that includes its own gene as well as the Chd2 gene (encoding chromodomain DNA helicase protein 2 (CHD2))437. Chaserr limits in cis the expression of Chd2, which is important for proper regulation of many genes (not shown). f, The perinucleolar compartment contains the lncRNA pyrimidine-rich non-coding transcript (PNCTR), which sequesters pyrimidine tract-binding protein 1 (PTBP1) and thus suppresses PTPBP1-mediated pre-mRNA splicing elsewhere in the nucleoplasm369. The size of nuclear bodies is indicated where relevant457. Figure adapted from ref. 80, Springer Nature. Part e courtesy of Inna-Marie Strazhnik and Mitch Guttman.

It has been proposed that RNAs have a central role in organizing the genome and gene expression by the formation of spatial compartments and transcriptional condensates349,350,351,352,353. Phase separation appears to drive chromatin long-range interactions and to be required for the action of enhancers and super-enhancers328,351,354,355,356,357 as well as for transcription, transcription factors and polyadenylation complexes342,358,359,360,361, although transcription factor hubs have been reported to operate in the absence of detectable phase separation362. PSDs scaffolded by lncRNAs, including repeat-rich RNAs363,364, mediate the formation of heterochromatin353,365,366, euchromatin367, Polycomb bodies368 and alternative splicing369. lncRNAs are a substantial component of rapidly renaturing, repeat-rich RNA (technically termed ‘CoT-1 RNA’), and high-resolution imaging shows many repeat-containing RNAs bound to chromatin, indicating that the collective presence of thousands of lncRNAs serves to counter chromatin condensation364. High-resolution imaging also shows the localization of many lncRNAs in compartments in the nucleus that resemble PSDs136,353. These data all suggest that there are thousands of low copy number lncRNAs involved in the organization of chromosome territories.

lncRNA structure–function relationships

lncRNAs generally range in size from around 1 kb to longer than 100 kb (refs. 370,371) and have a modular structure372,373,374,375. They are often multi-exonic and highly alternatively spliced (Fig. 3a), a feature that was not obvious before the advent of high-depth sequencing98. They also contain a higher proportion of GC–AG splice sites376 and are therefore less efficiently spliced than protein-coding transcripts377,378, which are properties associated with alternative splicing379. Alternative splicing has, unsurprisingly, been shown to alter the function of lncRNAs42,152,380,381.

Fig. 3: Modular structures of long non-coding RNAs.
figure 3

a, Targeted RNA sequencing has revealed that human chromosome 21 (chr21) is pervasively transcribed into long non-coding RNAs (lncRNAs) and that lncRNA exons are almost universally (but not randomly) alternatively spliced to form diverse and complex isoforms98. The circle indicates the fraction of non-coding exons across all chr21 transcripts that are alternatively or constitutively spliced. b, Modular structural domains in lncRNAs that fulfil a range of functions372,373,374,375, including targeting DNA, such as in the case of auxin-regulated promoter loop (APOLO)61; binding other RNAs — for example, terminal differentiation-induced non-coding RNA (TINCR)458, potentially involving RNA-binding proteins such as Staufen 1; and recruitment of proteins — for example, pyrimidine-rich non-coding transcript (PNCTR) recruiting of pyrimidine tract-binding protein 1 (PTBP1) through special RNA motifs369 and X-inactive-specific transcript (XIST) recruiting split ends homologue (SPEN) and Polycomb repressive complex 2 (PRC2), perhaps in concert, which is the subject of active exploration and debate142,397,399,423,424,459. Modular functional domains can be repeated within a lncRNA or in multiple different lncRNAs7,87,186,369,388,391,393,394,395,396,397,398,399,400,401. Figure courtesy of Tim R. Mercer.

Some lncRNAs also exhibit common motifs and motif combinations101. At least 18% of the human genome is conserved among mammals at the level of predicted RNA structure382, and similar and potentially paralogous RNA structures occur at many places throughout the genome383,384. Chemical probing has shown that lncRNAs, including Xist, form complex multidomain structures108,385,386,387,388,389, with chemical data matching data predicted by evolutionary conservation of secondary structure389. Moreover, lncRNAs with similar k-base oligonucleotide (short motif) content have related functions despite their lack of general homology, implying that small sequence elements are also key determinants of lncRNA function390.

Many lncRNA exons are derived from transposable elements187,391. The most highly conserved sequences in Xist, which has been intensively studied, are its repeats7, whereas its unique sequences have evolved rapidly392, and many of its biological functions, including recruitment of gene-repressive complexes and gene silencing, are mediated through its modular repeat elements142,186,388,393,394,395,396,397,398,399. Transposable element-derived sequences participate in many RNA–protein interactions369,400,401, which leads to the conclusion that repeat structures are common building blocks of lncRNAs87,391,396 and essential components of their function391.

The molecular mechanisms of lncRNA action are unclear. In most well-characterized cases of RNA regulation, such as RNAi, snoRNAs, CRISPR and telomerase, RNA acts as a guide to target effector protein complexes to complementary RNA or DNA sequences. Data on selected lncRNAs (for example, HOTAIR, roX1, roX2, Meg3, Tug1, PARTICLE (also known as PARTCL), PAPAS and KHPS1) indicates that they form triplex structures with DNA at purine-rich GA stretches to recruit chromatin modifiers to specific loci across the genome402,403,404,405,406,407,408, with evidence that triplex formation by lncRNAs is a widespread phenomenon409,410,411. Others, especially antisense lncRNAs, appear to function through RNA–DNA hybrid formation61,412,413, but detail is presently lacking.

lncRNA RNP structure and function have been well characterized in only one instance, the telomerase complex, which has been studied for decades. Telomerase reverse transcriptase (TERT) catalyses the addition of telomere repeats to chromosome ends, and other proteins in the complex provide nuclear localization, stability or recruitment to telomeres or to Cajal bodies. The lncRNA TERC provides the scaffold for assembly of the RNP and the template for DNA polymerization by TERT, and mutations in TERT and TERC are major contributors to the aetiology of cancer and the cause of hereditary disorders such as dyskeratosis congenita103,104,105,106,107,414,415,416.

By contrast, while we know the phenotypes caused by the loss of some lncRNAs, we know almost nothing about how most of them work, although, considering that as recently as 2010 the very existence of pervasive transcription was still a matter of contention417,418,419 and the sheer number of lncRNAs, substantial progress has been made. It is assumed, in our view reasonably, that generally lncRNAs will engage in multilateral interactions similarly to TERC and the telomerase complex108, and there is some evidence to support this assumption in cases such as XIST (Fig. 3b), but the assumption has not yet been rigorously tested. There are promising discoveries, such as the demonstration that conserved pseudoknots in lncRNA Meg3 are essential for stimulation of the p53 pathway420. There is also growing evidence of discrete structural organization in lncRNAs421. Nonetheless, there is a long journey ahead to understand the structure and function of the many thousands of lncRNAs, and their splice variants, in the context of their associated RNP complexes and biomolecular condensates in both the nucleus and the cytoplasm.

Challenges

If the complex ontogenies of animals and, to a lesser extent plants, require a large number of RNAs to guide the epigenetic decisions at each cell division, then it is not surprising that many lncRNAs have common protein-binding modules and specific targeting sequences that vary between different stages of development. The challenge is to define which lncRNAs and modules within them interact with effector proteins and which convey target (DNA or RNA) specificity. The former is complicated by the multisubunit nature of many RNP complexes, but is being addressed by technologies such as iCLIP422, RAP–MS423, ChIRP-MS388 and iDRiP424. Determining target specificity is even more difficult, as specific targeting requires only short stretches of nucleotide complementarity given the strength of RNA–RNA and RNA–DNA interactions425, but it may be tackled by new methods that analyse RNA–chromatin and RNA–RNA interactions, such as GRID-seq426, RADICL-seq427, RIC-seq428 and RD-SPRITE353. Other lncRNAs are localized in cytoplasmic compartments, whose components also need to be characterized.

Understanding the roles of lncRNAs and how they function in dynamic assemblies with other macromolecules will provide a more comprehensive understanding of cell and developmental biology and of gene–environment interactions. Emerging challenges include understanding the roles of lncRNAs and RNA modifications in functional plasticity, especially in the brain, and the dysregulation of these lncRNA-mediated pathways in neurological disorders, cancer and other diseases.

Recommendations

  1. 1.

    In the absence of more specific categorization, we recommend retention of the general descriptor ‘lncRNA’ for non-coding RNAs greater than 500 nt in length.

  2. 2.

    Unless a lncRNA is antisense to a protein-coding gene (in which case the designation ‘gene name-AS’ should be used), we recommend naming lncRNAs for their own sake with allusion to a discerned characteristic or function (as has been traditional for proteins), preferably accompanied by complete exon–intron structures and genomic coordinates. If no biological context is available, we recommend naming the lncRNA according to the GENCODE system46.

  3. 3.

    We recommend that future gene expression profiling should include full transcript analysis of the isoforms and stoichiometry of mRNAs, lncRNAs and small RNAs in cells at different stages of differentiation, and in various physiological and disease states, learning and stress conditions.

  4. 4.

    These efforts should be complemented by cell-based, organoid-based and in vivo studies using strategies for conditional and tissue-specific or cell type-specific gain-of-function and loss-of-function of lncRNAs.

More broadly, identifying and understanding the roles of lncRNAs and RNA regulatory networks in multicellular development, cell biology and disease will require the following:

  1. 1.

    The determination of the interplay between lncRNAs, chromatin modifications, proteins and the genome in the assembly of the nuclear domains essential for chromatin organization, enhancer function, transcription and splicing. This effort will require the development of antibodies with high specificity for protein–RNA complexes, and of intracellular RNA-tracking methods429.

  2. 2.

    The determination of lncRNA localization, structure–function relationships and interactions using a range of sequencing, chemical probing, imaging methods430,431,432,433 and cryogenic electron microscopy434.

  3. 3.

    The identification and characterization of the many unknown nuclear and cytoplasmic compartments decorated by specific lncRNAs.

  4. 4.

    Harnessing the power of machine learning to interrogate large genomic, epigenomic, transcriptomic, proteomic and phenomic datasets to identify causal links and pathways.