Learning 3D Semantic Scene Graphs with Instance Embeddings

A 3D scene is more than the geometry and classes of the objects it comprises. An essential aspect beyond object-level perception is the scene context, described as a dense semantic network of interconnected nodes. Scene graphs have become a common representation to encode the semantic richness of images, where nodes in the graph are object entities connected by edges, so-called relationships. Such graphs have been shown to be useful in achieving state-of-the-art performance in image captioning, visual question answering and image generation or editing. While scene graph prediction methods so far focused on images, we propose instead a novel neural network architecture for 3D data, where the aim is to learn to regress semantic graphs from a given 3D scene. With this work, we go beyond object-level perception, by exploring relations between object entities. Our method learns instance embeddings alongside a scene segmentation and is able to predict semantics for object nodes and edges. We leverage 3DSSG, a large scale dataset based on 3RScan that features scene graphs of changing 3D scenes. Finally, we show the effectiveness of graphs as an intermediate representation on a retrieval task.


Introduction
Rapid progress has been made in digitizing the real world in 3D with data obtained from cameras, scanners and depth sensors. Advanced 3D reconstruction algorithms paired with recent 3D sensor technology are able to robustly scan complex environments. Naturally, the focus of the research community shifted from capturing basic geometric properties towards extracting more abstract scene representations, motivated by the wealth of applications that require such high-level understanding. The fields of applications range from robotics in unstructured environments and autonomous driving, Augmented and Mixed Reality for gaming or education, to generating scene layouts for interior design and architecture. Understanding the 3D surroundings to a degree Google Inc., Zürich, Switzerland that allows autonomous interaction or sophisticated augmentation requires to robustly extract semantic details, such as scene parts and objects, together with their geometry and attributes (e.g., pose), as well as with the relationships among each other. This aspect has been often overlooked due to its inherent complexity.
The research community recently focused on a variety of perception tasks, including 3D object detection (Zhou and Tuzel 2017) and recognition (Su et al. 2015;Song et al. 2015), instance segmentation (Hou et al. 2018;Lahoud et al. 2019;Thomas et al. 2019;Yi et al. 2019), 3D shape prediction (Najibi et al. 2020) as well as classification and semantic segmentation (Rosinol et al. 2020a;Qi et al. 2017a, b;Dai and Nießner 2018;Rethage et al. 2018;Liu et al. 2020). While these methods have the objective of obtaining object knowledge, contextual data is mainly used to advance object-level understanding and the semantics of the relationships themselves are mostly neglected. A direction worth noting here are methods that either estimate scene layouts or perform holistic scene parsing Nie et al. 2020). Rather than focusing on the semantic aspects, they estimate the geometric properties of the environment as well as the individual pose of the scene entities. Scene graphs are abstract representations that store the semantics of a scene, where the graph nodes are scene entities and their connections are meaningful relation- we leverage a 3D network to learn semantics and instance embeddings (center) that encode the points in the scene. We then infer a scene graph G by feeding these features into a graph prediction module that predicts class labels for instances and edges (right) ships between them e.g. support relations (Nathan Silberman Derek Hoiem and Fergus 2012). Such a representation is frequently used in the image domain for higher-level task such as partial (Wang et al. 2014) and full image retrieval (Johnson et al. 2015), image generation (Johnson et al. 2018) or even manipulation (Mittal et al. 2019;Dhamo et al. 2020). While 2D scene graph datasets such as Visual Genome (Krishna et al. 2017) or NYUv2 (Nathan Silberman Derek Hoiem and Fergus 2012) are widely available and feature relationships between scene instances and often instance attributes, scene graphs in 3D have not been explored much.
Although, 3D graphs have been used in computer graphics for decades to store 3D mesh data, the respective edges usually do not represent semantic connections but rather relative transformations such that when a parent node is relocated, the change is applied in a hierarchical fashion to all child nodes. Only recently, semantic scene graphs have started to emerge in the 3D context (Gay et al. 2018;Armeni et al. 2019;Rosinol et al. 2020b). Armeni et al. construct graphs for buildings, including rooms, major objects, camera views and the relations between these entities (Armeni et al. 2019). Rosinol et al. (2020b) incorporates dynamics to this representation by additionally considering moving humans. Both (Armeni et al. 2019) and (Gay et al. 2018) propose multiview graph prediction methods based on 2D masks (Armeni et al. 2019) and object detection networks (Gay et al. 2018). They estimate graphs from images while we operate on 3D data directly.
In this work, we explore semantically rich 3D graphs similarly to what has been successfully proposed and implemented in the image domain. We introduce a novel method based on sparse convolutions to predict 3D scene graphs from 3D data directly to ultimately gain high-level knowledge that goes beyond object understanding. Our method learns an instance embedding alongside semantic segmentation and is capable of predicting the class labels of both object nodes and edges by directly feeding the scene features into a graph prediction module, see Fig. 1. Notably, our network does not require any knowledge of the scene e.g. any segmentation at test time. For training and evaluation purposes, we utilize 3DSSG, a large-scale dataset based on 3RScan that features rich scene graphs of changing 3D scenes. 3DSSG describes the semantics of scene entities and their attributes as nodes and relationships as edges. For research purposes, the dataset is publicly available for download. 1 Furthermore, we open source our scene graph prediction method.
The scene graphs in this work are semantically rich and particularly dense. This implies that all object instances e.g. chairs, sofas or bags as well as the structural components of a room e.g. the floor, different walls or the ceiling are represented as independent nodes in the graph. For structures, this specifically means, that each planar entity is represented as a different instance in the scene. A regular room, see Fig. 4, consists of 1 floor and 4 walls while a multi-floor scan has at least two floor instances and several walls. The nodes are described by attributes such as the color, shape or affordances and the connections between them are semantically meaningful relationships e.g. lying on, same as, see Fig. 2. Notably, this scene graph representation is inspired by the image graphs proposed by Johnson et al. (2015). In contrast to images, the dimensionality and context of 3D data is quite comprehensive, resulting in large-scale scene graphs. Despite -or because of -this, we believe graphs are particularly suited for 3D since they are a human-readable, compact representation that includes all major scene information. However, to learn scene graphs from 3D data turns Fig. 2 Scene Graph Representation in 3DSSG consists of hierarchical class labels and attributes of scene nodes as well as semantic relationship between them. A scene graph tuple connects a subject with an object node with a predicate out to be quite challenging since it requires not only handling real world data with noise but also ambiguities in the node and relationship descriptions. This includes various scanning patterns and clutter as well as data labels that might not be unique and distinctive. The surface of a blanket that covers a tidy bed is technically also part of it and might sometimes even be labelled as such. Actually, the fact that the blanket in Fig. 2 is lying on the couch gives us a hint about its class while it might be identified as towel if found in a bathroom. On the other side, a set of chairs that lookalike could drastically vary in appearance if occluded and their neighborhood and connections differ depending on where they are positioned. We believe graphs can be particularly beneficial in changing indoor scenes e.g. when matching a single 2D image against a pool of 3D scenes taken at a different time possible including lighting and object changes such as rigid and non-rigid changes and even (dis-)appearance of scene entities. We demonstrate how they are effectively used as an intermediate representation when computing scene similarity. Furthermore, our experiments show how they are fundamentally resilient to dynamic environments.
In summary, our contributions are three-fold: a) we propose an embedding based method to learn semantic scene graphs from a raw 3D point cloud. b) We further publish scene graphs for the localization benchmark RIO10 complementary to the large scale 3D scene graph dataset, 3DSSG (Wald et al. 2020a) 2 . The datasets are an extension of RIO10 Wald et al. 2020b and3RScan (Wald et al. 2019) and include graph annotations in form of relationships, instance attributes and class label hierarchies for each instance. c) We finally show the effectiveness of such graphs in a retrieval application. Compared to our previous publication (Wald et al. 2020a) we propose a new method that can predict 3D scene graphs 2 https://waldjohannau.github.io/RIOSG from 3D scenes directly, not requiring any prior knowledge such as segmentation masks. While (Wald et al. 2020a) uses a PointNet backbone to encode objects and relationships based on the segmented point cloud we utilize a 3D backbone architecture based on sparse convolutions. We incorporate surface normals and color to learn semantic features and an embedding space for node segmentation. While (Wald et al. 2020a) assumes a ground truth class-agnostic segmentation, we initialize graph nodes with segmented clusters making the method applicable in real-world setups.

Related Work
Semantic Scene Graphs and Images Scene graphs were originally introduced by Johnson et al. (2015) with a novel dataset of 5, 000 images and are today, also thanks to the success of Visual Genome (Krishna et al. 2017), a common, compact representation for many scene understanding tasks. By definition, the scene entities are grounded to different regions of the image and, while Visual Genome is quite large, the edges that describe the connections between nodes are rather sparse. On top, attributes highlight the properties of the object in more detail but are rarely used in practice. The effectiveness of scene graphs has been demonstrated when solving different scene understanding tasks including image retrieval (Liu et al. 2007;Johnson et al. 2015), scene captioning , visual question answering (Teney et al. 2017) or image generation from graphs alone (Johnson et al. 2018), interactively (Ashual and Wolf 2019) or for image editing tasks (Mittal et al. 2019;Dhamo et al. 2020). Many of these methods either rely, or build upon, image-based scene graph prediction, a particularly well studied problem (Lu et al. 2016a;Peyre et al. 2017;Xu et al. 2017;Newell and Deng 2017;Li et al. 2017;Yang et al. 2018;Zellers et al. 2018;Li et al. 2018c;Herzig et al. 2018;Zareian et al. 2020). Classical approaches usually follow a multi-stage process: first, nodes are initialized with an off-the-shelf object detector, such as Faster R-CNN (Ren et al. 2015). In the second stage, the predicates are predicted based on object proposals. This stage is commonly designed as a predicate/relationship classification task that takes features of the entities as input. The features for the nodes and edges are either low-level features e.g. the bounding box and their relative configuration, directly extracted from a CNN (Xu et al. 2017;Yang et al. 2018) or a combination of these (Peyre et al. 2017). Lu et al. (2016a) and  go beyond visual features and incorporate linguistic knowledge by leveraging language priors when predicting relationships. To improve efficiency, Yang et al. (2018) prune relationships and only keep meaningful tuples before computing the predicates. Li et al. (2018c) instead propose a bottom-up clustering method to factorize the scene into sub-graphs while maintaining spatial information. Relationship predictions -often simply termed visual relationship detection (Lu et al. 2016a;Peyre et al. 2017) -is commonly implemented as a local process and computed independently for object pairs.
Recently, some methods build a graph to iteratively refine the edge and/or node features e.g. using attentional graph neural networks (Yang et al. 2018; or message passing with a recurrent neural network (Xu et al. 2017). Similarly, in Li et al. (2017), parallel and sequential message passing is used for information propagation among objects and relationships, while other works demonstrate the importance of permutation invariance (Herzig et al. 2018), suggest embedding based architectures (Newell and Deng 2017) or a graphical contrastive loss . Contrarily to all the above, (Zareian et al. 2020) propose to learn how to bridge scene graphs and knowledge graphs by means of an iterative graph-based neural network.
Notably, datasets such as Visual Genome (Krishna et al. 2017) or Visual Relationship Detection (VRD) (Lu et al. 2016a) offer scene graphs for a fairly large amount of images, enabling the implementation and evaluation of the aforementioned methods. Many leverage deep learning and therefore require large quantities of training data. Even though impressive progress has been made in the image domain in the last few years, scene graph prediction is still considered a challenging task, due to the complexity and interdependence of object detection and relationship/predicate prediction. Besides the lack of a large-scale 3D graph dataset, many of the presented concepts are not directly transferable to 3D due to the complexity and memory restrictions when using higher dimensional data. Learning 3D Semantics and Instances 3D scene understanding involves the extraction of knowledge from 3D environments, including its objects and structure, their categories and spatial and semantic relationships with each other. One of the most common 3D scene understanding tasks is 3D semantic segmentation where a single label from a fixed set of classes is assigned to each voxel or point of the 3D scene. Early methods process the dense volumetric data directly in form of occupancy grids or TSDF volumes (Dai et al. 2017). Dai and Nießner (2018) show that incorporating multi-view features is beneficial while Huang et al. utilize the textured 3D mesh directly (Huang et al. 2019). More recently, Kundu et al. (2020) has proposed a virtual multi-view fusion technique that -compared to previous approaches -achieves significantly improved segmentation accuracy.
Another popular line of research has focused on lightweight point network architectures (Qi et al. 2017a;Li et al. 2018b) while incorporating hierarchical context (Qi et al. 2017b;Engelmann et al. 2017), hybrid architectures (Rethage et al. 2018), 3D capsule networks  and efficient 3D sparse convolutions (Graham et al. 2018;Choy et al. 2019), which enable effective processing of large scale 3D data. Notably, these methods are among the state of the art on challenging benchmarks (Dai et al. 2017). We utilize the sparse convolutions proposed by (Graham et al. 2018) as backbone features of our method.
It is important to note that semantic segmentation alone does not enable reasoning about object instances. In contrast, 3D instance segmentation focuses on foreground objects, where, additionally to the semantic label, instance masks are computed. Instance segmentation methods can roughly be categorised into bottom-up and top-down approaches, where in the latter proposals are generated from the input data. These proposals are often filtered e.g. via NMS (non-maximum suppression) and are leveraged to compute bounding boxes and/or 3D masks. Hou et al. suggest a proposal-based instance segmentation, similar to 2D Mask R-CNN (He et al. 2017), that contrarily uses dense, volumetric 3D data paired with multi-view features (Hou et al. 2018). VoteNet, on the other hand, directly predicts the center of the object bounding boxes via a novel voting scheme . Alternatively, proposal-free methods were suggested (Lahoud et al. 2019;Han et al. 2020;Jiang et al. 2020), which use metric learning to generate embeddings that are trained to be similar on points/voxels of the same instance and different for other instances. For his purpose, multi-task learning (Lahoud et al. 2019) or an occupancy loss  were proposed. Given rich instance embeddings, instances are obtained by clustering similar features. Point-Group (Jiang et al. 2020) proposes an improved grouping scheme while 3D-MPA (Engelmann et al. 2020) combines bottom-up and top-down approaches by learning object centers that are grouped via a graph neural network, avoiding the classical NMS.
Notably, instance segmentation methods usually ignore background elements such as walls and the floor. While in panoptic segmentation (Kirillov et al. 2019;Narita et al. 2019) walls are segmented, but are still not recognized as instances. In contrast to aforementioned works we include those structural entities since they are required to identify the majority of support relationships e.g. a picture that hangs on a particular wall plane.

3D Object Context and Scene Layout
The works discussed so far are mostly object-focused and incorporate context only to improve overall segmentation performance. On the contrary, holistic scene understanding perceives scenes as a whole by combining several related tasks, such as the prediction of scene layout (Song et al. 2015;Avetisyan et al. 2020), camera pose  or object pose and shape or reconstruction (Nie et al. 2020;Najibi et al. 2020). To parse a 3D scene, different representations have been proposed including stochastic grammars (Zhao et al. 2011;Liu et al. 2014), dependency graphs or tree structures where, by definition, leave nodes are independent scene entities (or object parts) and intermediate parent nodes represent functional entities (Liu et al. 2014). Zhao et al. (2011) use a stochastic grammar with three production rules: AND, OR and SET to model the scene layout, detected objects, planes and the background. Scene synthesis methods that aim to generate realistic scene models and layouts incorporate knowledge about an objects' context and the scene composition either directly or indirectly (Jiang et al. 2018;Shi et al. 2019). Jiang et al. (2018) describe a configurable 3D scene synthesis pipeline based on stochastic grammars, so-called spatial and-or graphs. GRAINS Li et al. (2018a) combine a recursive VAE with object retrieval to iteratively generate a layout and objects. Shi et al. (2019) also suggests an iterative approach based on a novel variational recursive autoencoder. Kulkarni et al. (2019) on the other hand, create a 3D scene given a 2D image. They show that predicting relative transformations between objects improves their pose predictions compared to a neighbourhood-independent computation. Fisher et al. (2011) use kernel functions to compare and retrieve similar 3D scenes by incorporating relationships such as support and proximity. Similarly, Ma et al. (2018) parse natural language into graphs and retrieve 3D scenes that fulfill requested compositions.
Graph structures have also been used for object understanding. In such a setup, different nodes represent different object parts of e.g. a chair, such as chair leg or backrest. Te et al. (2018) solve semantic part segmentation by using a graph neural network. StructureNet (Mo et al. 2019) goes even further and represent a shape as a hierarchical graph of embeddings where each object is a latent graph of its composing parts to ultimately be able to sample and interpolate aiming to generate new, novel shapes. They however learn a graph for each object category and their utilized relationships are restricted to relative transformations and known physical connections.
Only a few works have explored scene graphs in 3D. Gay et al. (2018) propose a 2.5D graph dataset based on ScanNet (Dai et al. 2017), Armeni et al. (2019), on the other hand, suggest hierarchical 3D scene graphs. They split the different components of a scene into 4 different layers: cameras, objects, buildings and rooms. Rosinol et al. (2020b) propose an additional dynamic layer to model humans. Armeni et al. (2019) does not include RGB-D sequences, and more importantly, structural components such as walls or floors are not included in their graphs and they therefore lack some interinstance relationships such as support. A comparison of these -as well as related 2D datasets (Johnson et al. 2015;Krishna et al. 2017; Nathan Silberman Derek Hoiem and Fergus 2012) -is given in Table 1.
Additionally to the data, Armeni et al. (2019) and Gay et al. (2018) propose graph prediction methods. Armeni et al. (2019) sample images from a panoramic camera and apply a regularization technique to 2D mask predictions aiming to obtain improved 3D object nodes. Gay et al. (2018) on the other hand, feed object features extracted from a continuous image sequence into a recurrent neural network. They operate in 2.5D on a static setup while 3RScan (Wald et al. 2019) has dynamically changing scenes which enables new, challenging tasks such as the newly introduced 2D-3D scene retrieval.
In our previous publication, Wald et al. (2020a), we proposed a meth-od to predict 3D scene graphs by incorporating the ground truth (class-agnostic) segmentation. In this work, we remove this assumption and operate on the raw point cloud directly. We use a 3D network on the full scene and extract object-level features from point features instead of parsing each ground truth object one by one. This makes our method more scalable, especially on dense graphs. In contrast to Wald et al. (2020a), where only scene graphs are predicted, our method additionally estimates 3D semantic and node segmentation and does not require any prior knowledge about the scene, therefore it is directly applicable in real-world scenarios.
3D Scene Retrieval Visual retrieval has a long history in computer vision, and is often embedded in other tasks such as object detection, object or scene alignment or camera pose estimation (Gálvez-López and Tardós 2011; Torii et al. 2015;Glocker et al. 2015;Anoosheh et al. 2019;Arandjelović et al. 2016;Deng et al. 2016;Lu et al. 2016b). Retrieving a source given some query data becomes most challenging when they do not share the same domain Abdul-Rashid et al. 2018Avetisyan et al. 2019Avetisyan et al. , 2020 including natural scene changes (Wald et al. 2020b(Wald et al. , 2019). An extensive literature exists on retrieval of CAD models given an image (Izadinia et al. 2017;Sun et al. 2018) or a 3D scene (Avetisyan et al. , 2020. Reviewing the full literature goes beyond the scope of this paper. To motive the Table 1 Comparison of 3DSSG with related 2D and 3D scene graph datasets. Dense 3D instances Semantic usage and suitability of our 3D graphs in high-level tasks we use them as an intermediate representation for 2D-3D scene retrieval of changing indoor scenes.

3D Semantic Scene Graphs
The method proposed in this work is trained and evaluated on 3DSSG which is based on 3RScan (Wald et al. 2019), a collection of every day 3D indoor scenes with 1482 sequences with semantically segmented 3D models. It features approximately 450 unique, diverse indoor environments, captured over a long period of time with change annotations (Wald et al. 2019). Unlike any other dataset, this allows reasoning about object instances and their changes but also about their relationships. Besides the full scene graphs, a smaller dataset that features a subset of objects and predicates is also made available. 3 Additionally to 3DSSG that has been released in Wald et al. (2020a) we also provide 2D semantic scene graphs for the camera re-localization benchmark, RIO10 (Wald et al. 2020b). In summary, we offer (a) scene graphs, (b) RGB-D sequences with camera poses and intrinsics, (c) textured 3D models with point coordinates and surface normals where p i = (x, y, z, n x , n y , n z ) ∈ R 6 , and an instance-level semantic segmentation defined as where l i describes the label of p i . Finally, the data provides d) change annotations such as scene and object alignments and bounding boxes.
Formally, semantic scene graphs G are defined by a set of nodes N with attributes A and edge triplets G = (N , E).
, are consistent across re-scans of the same environment and correspond to the instance IDs in {l i }. Specifically, each node n i is described by a set of properties A, as well as by a hierarchy of classes c = (c 1 , ..., c d ), where c ∈ C and C is a set of all valid class labels. Given the 3D instance segmentation, 3D geometry and depth can be obtained for each node. Some of the nodes are connected by means of edges based a set of predicates P such as standing on, hanging on or more comfortable than. Fig. 2), directionally connects a subject node n s ∈ N to an object node n o ∈ N such that (1) In the following we provide a detailed overview about the nodes and their attributes (Sect. 3.1) as well as the graph edges (Sect. 3.2). More details about the annotation procedure of 3RScan and 3DSSG can be found in Wald et al. (2019) and Wald et al. (2020a) respectively.

Nodes and Attributes
The most important graph entities are the nodes described by coarse-to-fine class labels e.g. armchair → chair → seat → furniture → artefact. The human annotation represents the lowest and finest level of our class hierarchy which is recursively parsed based on a lexical dictionary extracted from WordNet (Fellbaum 1998).
Object properties give more details about the visual and physical appearance of an instance. Overall, 3DSSG consists of 93 different attributes and are split into 3 different groups: static properties that stay the same over time, dynamic attributes that possibly change and affordances that describe the functionality of an object. While some properties require manual annotation, others are obtained automatically from the object's geometry. In the following paragraphs, more details are given about the different types of attributes. Static and Dynamic Properties The first category describe the visual appearance of an entity. This includes geometric properties such as shape, size and rigidity, as well as color and texture. The size of the object class relative to all other objects of the same category is computed automatically by comparing instance masks while other more complex attributes are manually annotated, including the texture, material, color or shape using a custom annotation interface. While Static properties, e.g. an object's appearance usually do not change, dynamic attributes can change over time. They describe the state of an entity such as open/closed or on/off. Some state categories are class specific e.g., appliances such as televisions, refrigerators or ovens can be turned on and off while a bed cannot. Interestingly, dynamic properties provide insights about potential human activity, see Fig. 3.
Affordances The interaction possibilities of a scene entity can been described by using affordances (Gibson 1979;Xia et al. 2018;Armeni et al. 2019). In 3DSSG they are associated to object classes, e.g. a seat is for sitting. Notably, some affordances are only viable if the object is in a specific state e.g. only a closed door can be opened, which is a direct link to the dynamic attributes and is of relevance in presence of scene dynamics.

Relationships
2D scene graph datasets often describe a human action occurring in an image e.g. girl-throwing-frisbee or boyreading-book. Since our scenes do not include humans directly the attention shifts away from those actions to the following three main relationship categories: a) support b) proximity and c) comparative relationships; all described in the following. Support Relationships are important connections between objects (Nathan Silberman Derek Hoiem and Fergus 2012) as they give hints about physical stability and object dependencies and are therefore of relevance in robotics applications, where robot-scene interaction is carried out. By definition, all entities are supported by at least one other node, excluding the floor, which is the root node in our representation and, as such, does not require any support. The support relationships in 3DSSG are assigned by automatically computing a list of support candidates, followed by a manual correction and semantic annotation to produce desired relationship tuplesso-called semantic support relations -such as chair-standing on-floor or cabinet-hanging on-wall.
Proximity Relationships describe the spatial arrangement of objects on the same support level. Proximity relationships such as left or right require a reference view, which we choose to be a top-down bird view with +x as right and +y as front. Spatial relationships are automatically computed and only valid in 3D and therefore require re-computation in 2D.
Comparative Relationships are connections related to object properties, see Sect. 3.1. They are computed from node annotations and include, but are not limited to comparisons of size (e.g. bigger/ higher than), shape or material (e.g. same shape/material) color (e.g. darker than, same color) or state (e.g. cleaner than).

2D Scene Graphs
Since ground truth instance segmentation as well as RGB-D image sequences with corresponding camera poses P i are provided, 2D graphs G 2D are directly obtainable from the 3D counterpart G 3D using a simple rendering procedure, see Fig.  4. Given the 2D instance image I s,i , rendering the 3D graph implies filtering out nodes and edges that do not include a visible entity. Support and comparative relationships can be directly transferred while proximity relationships need to be recomputed, since they are viewpoint-dependent.
The rendering pipeline and 2D graphs for the camera re-localization benchmark, RIO10 (Wald et al. 2020b) will be made publicly available 4 . Compared to some other 2D scene graph datasets (Krishna et al. 2017) our data provides depth images and semantic instance masks -additionally to bounding boxes -similarly to the much smaller NYUv2 (Nathan Silberman Derek Hoiem and Fergus 2012).

Methodology
In the following section we first introduce the problem statement (see Sect. 4.1) giving a high-level overview of the different tasks and challenges involved when predicting 3D semantic scene graphs, before diving into details of our proposed method (see Sect. 4.2).

Problem Statement
Scene graph prediction from 3D data is a complex problem that requires solving several interdependent tasks. While our 3D reconstructions do not capture on-camera motion, a few other factors make this task challenging: The 3D graphs are -compared to the 2D counterpart -quite large and densely connected while the underlying data usually covers a relatively large space. Simply applying techniques developed for images to the 3D domain is therefore often unfeasible.
3D scene graph prediction, first and foremost, requires the 3D space to be encoded with meaningful features that incorporate long-range semantic information (P1). This feature space is the foundation of the scene graph prediction which relies on the identification of scene entities including objects and scene structure. Our scene graphs are dense, therefore, every single 3D point has to be assigned to a node in the graph 4 https://waldjohannau.github.io/RIOSG while the number of nodes is unknown (P2). Finally, semantics is obtained by classifying the detected nodes and their connecting edges given a list of object class and predicate labels (P3). Notably, 3DSSG features a long list of different object classes with an unbalanced long-tail distribution: a few labels occur regularly, while the majority is relatively rare (P4). Specifically, the top-12 most common object classes appear approximately as often as all the remaining classes together.
Ultimately, the goal is to end up with scene graphs that are rich and meaningful enough for high-level tasks e.g. visual question answering or scene retrieval (P5).

3D Scene Graph Embedding Network
An overview of our method is given in Fig. 5. It operates end-to-end and consists of two main parts; a 3D network and a graph network. Given the 3D model of a scene, we identify scene graph nodes N by learning a semantic instance segmentation (see Sect. 4.2.1). The network assigns the 3D points of the scene to an entity by processing its coordinates, surface normals and texture colors. Our segmentation aims to produce similar features for points on the same instance and different features for points on different instances, Fig. 5b. In the second stage (see Sect. 4.2.2), a graph is build from the extracted scene nodes in a fully connected fashion E = N ×N , Fig. 5d. When constructing the graph, the output features from the first stage are aggregated for the corresponding nodes and edges respectively. The object classes are refined and predicates (if any) are predicted for the object pairs, see Fig. 5c. The output of our method is a 3D semantic instance segmentation as well as a 3D scene graph, Fig. 5e.

3D Instance and Semantic Network
Instead of computing features for each object node and relationship separately, as in Wald et al. (2020a), we process the entire scene at once and obtain object-level features from the points by grouping them afterwards using instance masks. This makes our features more descriptive than Wald et al. (2020a). Furthermore, such a procedure is more scalable since the points are only processed once and not redundantly for all its edges which is particularly favorable in dense graphs.
We augment each element of the point cloud with its surface normal. Additionally, the color value of our textured 3D models is extracted by querying the image texture at the associated pixel coordinates. Our input is therefore a (N ×9)dimensional feature, where N is the cardinality of the point cloud. 3RScan consists of high-poly meshes of real-world reconstructions which contain faces and vertices of the scene surface. While the RGB extraction operates on meshes, our network is also able to process colored point clouds directly. The meshes are reconstructions of real-world environments and therefore consist of several thousands of polygons and vertices. During training, the data loader randomly samples 3D points from the data. We feed this input into a sparse convolutional encoder-decoder network with a backbone size of 256, similarly to the one proposed by Najibi et al. (2020). The output feature dimensionality of the 3 × 3 × 3 convolutional layers is 64, 96, 128, 160, 192 and 224 respectively. After each layer of the encoder a max pool operation is applied.
In the decoder, features are upsampled and corresponding encoder features are concatenated via skip connections à la U-Net (Ronneberger et al. 2015;Çiçek et al. 2016). Conversely to other instance segmentation works, we do not consider any points as background (e.g. walls and floors), instead we purposely include them as unique instances since we want to represent structural components as independent nodes in the graph. We predict semantic logits f i and instance embeddings e i , {( f i , e i )} N i=1 for all given points without any masking. They are obtained by applying two sparse convolutional layers on the output of the decoder with a ReLu and batch norm.
Inspired by the success of bottom-up instance segmentation, we similarly learn an embedding space and obtain instances by clustering its features. During training, we uniformly sample point indices |V | on all object instances to counteract data unbalance caused by objects of varying sizes before an N-pair metric learning loss (Sohn 2016) is computed The loss uses a pairwise similarity metric s(i, j) between i and j such as .
where i and j are sampled point indices in the input and e i and e j represent their respective embedding vectors. Figure 5b, Fig. 6b visualize the learned point features on example scenes. We map the embedding vectors to RGB color space by applying PCA for visualization purposes. It can be seen that our features are able to distinguish points on different object instances. Additionally to learning the embedding space, we use a cross-entropy loss L s to learn semantic classes per point. We jointly learn the semantic and instance embeddings and weight their losses equally such that The final semantic segmentation (Sect. 5.1) is obtained by applying a regularization technique that averages the semantic outputs of regions with similar embeddings. During training we randomly sample points on the point clouds and process them in a voxelized fashion with a resolution of (0.02m × 0.02m × 0.02m) within our sparse convolutional network. While our network processes data in a discretized fashion, voxels are mapped back to the input points during test time. Translation augmentation is applied with a maximum offset of 0.1m, scenes are scaled with a factor between 0.9 and 1.1, and rotational augmentation around the z axis ranges from −180 • and 180 • . We implement our network with tensorflow (Abadi et al. 2015

Scene Graph Network
During training the scene graph is constructed with ground truth instance segmentation l = {l i } N i=1 . Contrarily, at test time, a clustering function ρ is used to cluster points into instance estimateŝ In our implementation we use kmeans++ (Arthur and Vassilvitskii 2007), though any clustering function could be applied to our embedding space. Similar to other graph prediction works (Lu et al. 2016a;Xu et al. 2017;Yang et al. 2018) visual features are extracted for each node φ k and edge φ r respectively. We experiment with different node encodings obtained from aggregated and concatenated features of the 3D network. In our experiments we set φ k = [ē k ,f k ] where the 3D features are averaged for all the points of an instance k such that withf k is computed in a similar fashion. We arrange the nodes in a graph structure, building relationship triples (subject, predicate, object) to form a fully connected graph with subject φ s , object φ o and predicate units φ r . φ r is derived from the subject φ s and object φ o node features, their relative position -computed from the centroidsp s andp o -as well as their relative bounding boxes such that Class labels are learned with a cross-entropy loss that instead of predicting only a single semantic class per node has two semantic output heads that use different semantic class sets learned jointly in a coarse-to-fine fashion. This means our network learns a very coarse semantic class e.g. other furniture (from NYUv2 (Nathan Silberman Derek Hoiem and Fergus 2012)) as well as uncommon but more descriptive classes such as baby bed or storage unit. In a real-world setup, certain object pairs might have multiple valid relationships that describe their interactions. In Fig. 2, e.g. one chair can be behind the other and simultaneously have the same visual appearance, represented as a same as relationship. We therefore formulate L pred as a per-class binary cross entropy. This way, it is judged independently whether an edge should be assigned a certain label (e.g. standing on) or no label (none). In summary, we train our graph model end-to-end and optimize the object L obj and predicate classification loss L pred jointly where λ is a weighting factor between object and relationship prediction and set to 0.5 in our experiments. We ensure consistency of proximity relationships, like left or right, across re-scans by avoiding rotation augmentation during training. Instead, re-scans are aligned with the respective references using the provided scan-to-scene transformations. Notably, our method does not require any filtering of the ground truth graph data and is able to process all nodes and edges of a scene at once. The object and relationship predictors have four and six fully-connected layers followed by batch norm and ReLu. For training we use an SGD optimizer with a learning rate of 10 −2 . Please note that, while in practice both networks can be trained jointly, we train them separately for the sake of easier convergence. Specifically, we first train the 3D network and then freeze its layers when training the scene graph prediction.

Evaluation
In this section we present different experiments to analyse the main aspects of our proposal. First, we evaluate our 3D semantic segmentation network on 3RScan (Wald et al. 2019), see Sect. 5.1 to validate the quality of our underlying features (see Sect. 4.1.P1) and then ablate the node segmentation quality (see Sect. 4.1.P2) and the effect of the embedding  (Wald et al. 2020a) and report per-class evaluation and and analysis of rare and often occurring class labels, see Sect. 5.3. Finally, we show the application of scene graph prediction (see Sect. 4.1.P5) for the task of scene retrieval in Sect. 5.4.

3D Semantic Segmentation
Following the evaluation scheme of Liu et al. (2020), Tables 2  and 4 lists the average IoU of the 3D semantic segmentation on 3RScan (Wald et al. 2019) using 27 object categories. We compare our method against Liu et al. (2020), as it was also evaluated on this dataset and outperform them by a small margin. Table 3 reports the per-class performance using the NYU40 (Nathan Silberman Derek Hoiem and Fergus 2012) class set. Notably, our method is able to successfully segment challenging classes such as doors and windows. Qualitative results of our 3D semantic segmentation on 3RScan are shown in Fig. 6. (a) is the input of our method, (b) visualizes the learned point embeddings mapped to color space and (c) and d) are the predicted and ground truth semantic segmentation respectively. This experiment shows that our 3D network produces meaningful features to be further processed within the graph prediction network.

3D Node Segmentation
In the following we evaluate how well scene nodes are detected in 3D scenes using different embedding dimensions, see Table 4. It can be observed that the dimensionality of the embedding vector has only a small impact on the segmentation quality. The embedding dimension of 256 gives the best performance, although the margin is relatively small. In the experiment, we adapt the commonly used Mean Average Precision metric where the Average Precision (AP) determines the area under the precision-recall curve. First, the IoU is computed between each ground truth and predicted segment of the same class. Each prediction mask is compared with the ground truth to obtain an IoU and is considered true positive or false positive based on a thresh-  Bold indicates the best performing model/method old (25% for mAP25 and 50% for mAP50). The evaluation scheme was adapted from ScanNet (Dai et al. 2017) and CityScape (Cordts et al. 2016) which is inspired by the evaluation scheme of COCO (Lin et al. 2014). The corresponding precision and recall per class is given in Table 5. As expected, the method achieves best scores for common and distinctive classes chairs, tables and toilets and worst for the categories counter, desk and bookshelf which are either ambiguous (desk vs. table, counter vs. cabinet) or cluttered (shelf, counter).
We further evaluate the effect of different features on segmentation quality. In Tables 6 and 8 we cluster instances based on point embeddings alone (1), embeddings and the semantic features (2), embeddings and the point cloud coordinates (3) as well as all features combined (4) while keeping the embedding dimension 256 fixed.
The clustering seems to be most accurate when only embedding features are used which might sound counterintuitive at first. Our intuition is that spatial information potentially cause over-segmentation of bigger objects e.g. couches or walls and adding semantic information might result in unwanted merging of close-by instances of the same classes e.g. chairs, see Fig. 7. Please note that the embedding network was specifically trained to produces a distinctive feature space by parsing (a) color, and (b) geometry and jointly learns (c) semantics, and (d) instance embeddings.
Even for challenging scenes with many semantically and visually similar objects, the feature space is quite distinctive, see also Fig. 8.
An additional qualitative evaluation of the node features can be found in Fig. 9 where we use t-SNE (van der Maaten and Hinton 2008) to qualitatively visualize the features of our scene nodes. Each element in the plot represents a scene entity in the validation set and its color corresponds to the respective NYU40 class. We can observe how objects of the same category are nicely clustered together and away from categories including different shapes, e.g. (a) toilets or (b) curtains and similar classes (desks and tables) are closer together.

Semantic Scene Graph Prediction
In the following we compare the performance of our proposed scene graph embedding network ➂ with two variants of our ➁ Table 5 Precision and recall per NYU40 category of 3D node segmentation on the validation split of 3RScan Average Wall Floor Cabinet Bed Chair Sofa Table Door   (1) embedding 9.4 21.7 44.0 (2) embedding, semantic 3.8 9.8 27.7 (3) embedding, spatial 6.9 16.4 41.3 (4) embedding, semantic, spatial 4.9 11.7 28.0 Bold indicates the best performing model/method Fig. 7 a Colored 3D model and b embedding, c spatial and d semantic input features used for 3D node segmentation of an example scene, see Table 6 scene graph point network (Wald et al. 2020a) and a baseline ➀ inspired by the visual relationship prediction proposed by Lu et al. (2016a). We re-implemented (Lu et al. 2016a) and adapted the method to operate on 3D using an underlying PointNet backbone. Similar to Wald et al. (2020a), node and edge features are extracted from the point cloud for each node and edge respectively. We follow the same train and validation split proposed by Wald et al. (2019). Evaluating semantic scene graphs is non-trivial as it involves several interdependent tasks: detection and segmentation of object instances, prediction of the semantic class labels as well as visual relationship detection. When evaluating large 3D scene graphs we indeed evaluate object, predicate and relation (triples) independently as commonly done in 2D scene graph literature. Even though, our method does not require any ground truth segmentation, we utilize it in this experiment to fairly compare against the methods in Wald et al. (2020a).
Previous graph prediction works propose complex evaluation schemes (Lu et al. 2016a;Xu et al. 2017;Yang et al. 2018;Wald et al. 2020a) that consider a match correct if it ranks within the top-n predictions. Such an evaluation scheme helps in challenging scenarios or when dealing with ambiguous class categories and large label sets. In this work, we only adopt the strictest top-1 metric where a sample is considered correct iff it exactly matches the ground truth, see Table 7 where the top-1 score for objects and predicates as well as relations is reported -where the latter one evaluates (subject, predicate, object) triplets. In this experiment, our network outperforms all other methods when predicting object and predicate labels as well as relations but still leaves some room for improvements due to the challenging setup and high variability of the graphs. To further analyse the networks' predictions we report a broken down performance score of the predicates, see Table 8. Rare predicates e.g. cover, hanging in, lying in are most challenging which is likely related to the unbalanced data distribution and complexity of the graph data.
We furthermore report the node classification score of the 20 most and least occurring object classes, see Table 9 to better understand the networks behaviour when working with diverse and imbalance data. It is no surprise that great scores are accomplished for common classes while the network fails when confronted with rare entities. Interestingly, misclassifications of small objects e.g. pan, scale, bread, napkins or papers are often confused with categories where they most commonly appear, accidental taking too much context into consideration e.g. kitchen counter, table or shelf.
A similar experiment where common and uncommon relations are analysed shows the same results, see Table 10. While the network achieves remarkable performance in predicting common relationships such as chair-stand-ing on-floor it struggles to predict any of the less common relations correctly. Interestingly, the network has learned strong and meaningful priors from the data and is therefore able to produce meaningful relations such as pillow-lying on-sofa or clothes-hanging on-wall for the relations box-supported bysofa and flowers-hanging on-wall (rare occurrence, noisy reconstruction or an inconsistency in the ground truth).
To complement this set of experiment we analysed the most common mispedictions of relation triplets, see Table  11. Similarly, specific labels such as side table or shower Fig. 9 Learned embedding space of the scene graph nodes. Colors correspond to the semantic NYU colors used in Fig. 6 a toilets, b curtains,  Bold indicates the best performing model/method wall are confused with their more general counterparts table and wall. In terms of predicates, the method seems to confuse smaller than and lower than and bigger than and higher than.
Finally, Fig. 10 shows example scene graph predictions generated by our method. Since dense graphs on large scenes are hard to parse and visualise, we only show support edges. For incorrect predictions, the ground truth is appended in round brackets after the predicted label. In the following section we finally show that our graphs are accurate and representative to be used within a retrieval application, see sect. 5.4.

Scene Retrieval
In the following we utilize graphs for image-based 3D scene retrieval in changing indoor environments as proposed in Wald et al. (2020a). The aim of the task is to identify the corresponding scene from a pool of 3D scans given a single 2D image acquired at a different point in time. The query and target data are not only from different domains but additionally undergo scene changes, e.g. moving objects, different illumination. To bridge the domain gap between images and 3D data, this task is carried out using scene graphs. We use different similarity metrics to match the graphs to the correct 3D scenes using object semantics as well as scene context. Computing the minimum edit-distance between two graphs is a complex problem, we therefore map our graphs to node and edge multi-sets -containing potential repetitive elements that occur more than once in the scene. Based on two graphs G and G , a similarity function τ is computed. In our tests we explore the Jaccard τ J and Szymkiewicz-Simpson τ S coefficient such that whereĜ is an augmented graphĜ = (N , E, R) with binary edges E. Table 12 reports the matching accuracy using single 2D images or a 3D re-scan of an indoor scene. To do so, we compute the scene graph similarity between each re-scan (2D or 3D) and the target reference scans. We then order the matches by their similarity and report the top-n metric, i.e. the percentage of the true positive assignments, placed in the top-n matches from our algorithm. The size of image and 3D scene graphs are significantly different, the Szymkiewicz-Simp-son coefficient is therefore used in 2D-3D matching while f S is chosen in the 3D-3D scenario. It can be observed that our scene graphs ➂ significantly improve matching accuracy in 2D, as well as 3D, compared to our previous work ➁ and the baseline model ➀ (see Sect. 5.3).  Note that for the purpose of this experiment, predicted 2D graphs are obtained by rendering the predicted 3D graphs as described in Sect. 3.

Future Work
3D semantic scene graphs are a rich and compact representation for holistic 3D scene understanding and we believe they are an excellent representation for persistent 3D mapping of long-term/dynamic 3D environments. The prediction of entities is an essential requirement of persistent mapping where the representation of a space is updated based on new observations and detected changes. Dynamics could be captured by learning persistent features for association across time and augmenting the graphs with poses. A localization algorithm would then need to jointly detect changes   in the graph and estimate the poses in this dynamic setup. This long-term knowledge could then be stored as additional connections in the graph structure. Another interesting direction is represented by augmenting the resulting 3D temporal scene graphs with comprehensive semantics beyond simple class labels and 6DoF object poses. This could provide the representation needed for robust and efficient persistent map-ping. In this context, it might be worth exploring hierarchical scene graphs where the labels of parent nodes are carried on to the children enforcing it to be more specific e.g. using a novel loss. Notably, building scene graphs requires longrange attention; it is therefore important to rely on efficient networks and techniques e.g. sparse convolutions, especially when handling large-scale outdoor scenes. Ultimately, finding a scalable, weakly/self-or even unsupervised and generic solution for persistent mapping could solve many of the challenges of long-term and dynamic 3D scene understanding and eventually help bring some of the theoretical models into practice.

Conclusion
This work goes beyond classical object-level scene understanding and explores regression of 3D scene graph with a neural network. A novel graph prediction method is proposed based on the semantically rich scene graph dataset 3DSSG which is build upon 3RScan (Wald et al. 2019). Our method predicts nodes and edges representing the objects' semantic classes and their relationships, by directly operating on 3D scans of scenes. Notably, our work explores regressing these graphs from real-world 3D data without any priors. Our experiments show that the features learned with our 3D network enable the detection and segmentation of graph nodes while the underlying features are very descriptive and therefore useful for semantic scene graph prediction. We show that the unbalanced distribution and large number of different categories in real-world scenes introduces additional challenges, which require learning an even richer, fine-grained feature space given only a few training samples. We believe scene graphs could ultimately serve as a persistent representation for long term 3D scene understanding and are useful to bridge domain gaps, as shown in the cross-domain task, image-based 3D scene retrieval in changing indoor environments. They could potentially even enable new useful applications of scene understanding, such as text-based search or VQA (Visual Question & Answer).
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.