Abstract
3D object geometry reconstruction remains a challenge when working with transparent, occluded, or highly reflective surfaces. While recent methods classify shape features using raw audio, we present a multimodal neural network optimized for estimating an object’s geometry and material. Our networks use spectrograms of recorded and synthesized object impact sounds and voxelized shape estimates to extend the capabilities of vision-based reconstruction. We evaluate our method on multiple datasets of both recorded and synthesized sounds. We further present an interactive application for real-time scene reconstruction in which a user can strike objects, producing sound that can instantly classify and segment the struck object, even if the object is transparent or visually occluded.
Keywords
- Impact Sounds
- Object Geometry
- Striking Object
- Vision-based Reconstruction
- VoxNet
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download conference paper PDF
1 Introduction
The problem of object detection, classification, and segmentation are central to understanding complex scenes. Detection of objects is typically approached using visual cues [1, 2]. Classification techniques have steadily improved, advancing our ability to accurately label an object by class given its depth image [3], voxelization [4], and/or RGB-D data [5]. Segmentation of objects from scenes provides contextual understanding of scenes [6, 7]. While these state-of-the-art techniques often result in high accuracy for common scenes and environments, there is still room for improvement when accounting for different object materials, textures, lighting, and other variable conditions.
The challenges introduced by transparent and highly reflective objects remain open research areas in 3D object classification. Common vision-based approaches cannot gain information about the internal structure of objects, however audio-augmented techniques may contribute that missing information. Sound as a modality of input has the potential to close the audio-visual feedback loop and enhance object classification. It has been demonstrated that sound can augment visual information-gathering techniques, providing additional clues for classification of material and general shape features [8, 9]. However, previous work has not focused on identifying complete object geometries. Identifying object geometry from a combined audiovisual approach expands the capabilities of scene understanding.
In this paper, we consider identification of rigid objects such as tableware, tools, and furniture that are common in indoor scenes. Each object is identified by its geometry and its material. A discriminative factor for object classification is the sound that these objects produce when struck, referred to as an impact sound. This sound depends on a combination of the object’s material composition and geometric model. Impact sounds are distinguished as object discriminators from video in that they reflect the internal structure of the object, providing clues about parts of an opaque or transparent object that cannot be seen visually. Impact sounds, therefore, complement video as an input to object recognition problems by addressing the some inherent limitations of incomplete or partial visual data.
Main Results: In this paper, we introduce an audio-only Impact Sound Neural Network (ISNN-A) and a multimodal audio-visual neural network (ISNN-AV). These networks:
-
Are the first networks to show high classification accuracy of both an object’s geometry and material based on its impact sound;
-
Use impact sound spectrograms as input to reduce overfitting and improve accuracy and generalizability;
-
Merge multimodal inputs through bilinear models, which have not been previously applied to audio-visual networks yet result in higher accuracy as demonstrated in Table 4;
-
Provide state-of-the-art results on geometry classification; and
-
Enable real time, interactive scene reconstruction in which users can strike objects to automatically insert the appropriate object into the scene.
2 Previous Work
3D Object Datasets. Thanks to a plethora of 3D scene and object datasets such as BigBIRD [10] and RGB-D Object Dataset [11], neural network models have been trained to label objects based on their visual representation. 3D ShapeNets [3] also provides two sets of object categories for object classification referred to as ModelNet10 and ModelNet40, which are common benchmarks for evaluation [12]. Scene-based datasets have also been built from RGB-D reconstruction scans of entire spaces, allowing for semantic data such as object and room relationships. For instance, NYU Depth Dataset [13] and SUNCG [14] enable indoor segmentation and semantic scene completion from depth images.
2.1 3D Reconstruction
Structure from Motion (SFM) [15], Multi-View Stereo (MVS) [16], and Shape from Shading [17] are all techniques to estimate shape properties of a scanned scene. Although these methods alone do not achieve a segmented representation of the objects within the scene, they serve as a foundation for many algorithms. RGB-D depth-based, active reconstruction methods can also be used to generate 3D geometrical models of static [6, 18] and dynamic [19, 20] scenes using commodity sensors such as the Microsoft Kinect and GPU hardware in real-time. Techniques have also been developed to overcome some limitations of vision-based reconstruction techniques [21] such as scene lighting, occlusions, clutter, and overlapping transparent objects. When limited solely to visual input, these challenges remain. Additional modalities, such as the impact sounds we explore in this paper, have the potential to address these issues.
Alternate Modalities. While image and depth-based techniques cover the majority of reconstruction use cases, edge cases have motivated research to explore alternative modalities that may procure the level of detail that vision-based techniques alone cannot. The dip transform for 3D shape reconstruction [22] uses fluid displacement of an object to obtain shape information. Time-of-Flight cameras introduce another modality to better classify materials and correct the depth of transparent objects [23]. This work uses both recorded and synthetic audio as additional modalities to complement vision-based reconstruction.
2.2 Environmental Sound Classification
Audio descriptors have been primarily explored in the context of environmental sound classification. Multiple datasets have been established for evaluating classification of various environmental sounds [24,25,26]. Traditional techniques use a variety of features extracted from sounds, such as Mel frequency spectral coefficients and spectral shape descriptors [27, 28]. Similar approaches are used to classify an environment based on the sounds heard within it [29].
Convolutional neural networks have also been applied to these problems, producing improved results [30, 31]. Recently, some interest has been given to exploring the performance of different network structures [32, 33]. Impact sounds are a specific category of environmental sounds, and in this paper we perform fine-grained object classification between perceptually similar sounds.
2.3 Object/Scene Understanding Through Sound
Sound can be used as a source of information for deeper understanding of 3D scenes and objects. Specifically considering impact sounds, sound can be used to estimate the material of objects using iterative optimization-based parameter identification techniques [34, 35]. Sound has also been used to obtain information about the physical properties of objects involved in impact simulations [36]. Shape primitives were included as a part of these physical properties, but were not representative of real-world object geometries. However, it has been proven that any given impact sound may have come from one of multiple possible object geometries, and thus cannot be uniquely reconstructed [37]. Previous work has not attempted complete object reconstruction. In contrast, we constrain the outputs to known objects, making the problem tractable in this work.
Sound and video are intrinsically linked modalities for understanding the same scene, object, or event. Using visual and audio information, it is possible to predict the sound corresponding to a visual image or video [38, 39]. Sound prediction from video has also been specifically explored for impact sounds [40].
Multimodal Fusion. Other works have fused audio and visual cues to better understand objects and scenes. Sparse auditory clues can supplement the ability of random fields to obtain material labels and perform segmentation [9]. Neural networks have proven valuable in fusing audio-visual input to emulate the sensory interactions of human information processing [8]. While multimodal methods have succeeded in fusing input streams to capture material and low-level shape properties to aid segmentation, they have not attempted to identify specific object geometries.
Early attempts at multimodal fusion in neural networks focused on increasing classification specificity by combining the individual classification results of separate input streams [41]. Bilinear modeling can model the multiplicative interactions of differing input types, and has been applied as a method of pooling input streams in neural networks [42, 43]. Bilinear methods have been further developed to reduce complexity and increase speed, while other approaches to modeling multiplicative interactions have also been explored [44,45,46]. Bilinear methods have not yet been applied to merging audio-visual networks, and our ISNN-AV network is the first to do so.
3 Audio and Visual Datasets
To perform multimodal classification of object geometries, we need datasets containing appropriate multimodal information. Visual object reconstruction can provide a rough approximation of object geometry, serving as one form of input. Impact audio produced from real or simulated object vibrations provide information about internal and occluded object structure, making for an effective second input. Figure 2 provides examples of object geometries, while the corresponding spectrograms model the sounds that provide another input modality.
Appropriate audio can be found in some existing datasets, but the corresponding geometries are difficult to model. AudioSet contains impact sounds in its “Generic impact sounds” and “{Bell, Wood, Glass}” categories [24], while ESC-50 has specific categories including “Door knock” and “Church bells” [25]. The Greatest Hits sound dataset comes closest to our needs, containing impact sounds labeled according to the type of object [40]. However, many of the categories do not contain rigid objects (e.g. cloth, water, grass) or contain complex structures that cannot be represented with one geometric model of one material (e.g. a stump with roots embedded in the ground).
We want to use an impact sound as one input to identify a specific geometric model that could have created that sound. A classifier for this purpose could be trained on a large number of recorded sounds produced from struck objects. However, it is difficult and time-consuming to obtain a representative sample of real-world objects of all shapes and sizes. It is much easier to create a large dataset of synthetic sounds using geometric shapes and materials which can be applied to the objects. We now describe our methodology for generating the data used for training, as visualized in Fig. 3.
3.1 Audio Data
We create a large amount of our training data by simulating the vibrations of rigid-body objects and the sounds that they produce. Modal sound synthesis is an established method for synthesizing these sounds. We refer readers to the Supplemental Document (Sect. 1) for a mathematical overview and previous work for the full derivation of the algorithm [8, 47, 48]. Modal sound synthesis can be broken up into two steps: a preprocessing modal analysis step to process the inputs and a faster modal synthesis step to synthesize individual sounds.
We build multimodal datasets through separate processing flows. Modal sound synthesis produces spectrograms used for audio input. Voxelization as another modality provides a first estimate of shape. Incorporating audio features improves classification accuracy through understanding of how objects vibrate.
Modal Analysis. Modal analysis is a process for modeling and understanding the vibrations of objects in response to external forces. Vibrations in an object can be modeled with the wave equation [49], but in order to handle arbitrary geometries with unknown analytical solutions, it is more common to perform finite element analysis on a discretized representation of the object [47, 50].
Starting with a watertight triangle mesh representation of the object’s surface, the interior volume of the object is filled with a tetrahedral volumetric mesh. A finite element model can then be constructed to represent the free vibrations of the object. Damping within the object causes vibrations to decay over time; we use Rayleigh damping to model this effect.
Given this representation of a vibrating object, we are interested in determining the frequencies at which it vibrates. This can be accomplished through a generalized eigen-decomposition of the finite element matrices. Finally, the system can be decoupled into linearly separable modes of vibration. Each mode has a solution in the form of a damped sinusoid, each having a different frequency and rate of decay. This modal analysis step is performed once per object, and is a computationally-intensive task. The resulting modes’ frequencies of vibration and damping rates are saved to be used in modal synthesis.
Modal Synthesis. Striking an object excites its modes of vibration, causing a change in pressure waves. For a simulated object, an impulse in object-space can be converted to mode impulses to determine initial amplitudes for the corresponding sinusoids. The sinusoids for the modes are then sampled through time and added to produce the final sound. This process can be repeated for different materials, geometries, and hit points to create a set of synthetic impact sounds.
3.2 Audio Augmentations
Modal sound synthesis produces the set of frequencies, damping rates, and initial amplitudes of an object’s surface vibrations. However, since we are attempting to imitate real-world sounds, there are some additional auditory effects to take into account: acoustic radiance, room acoustics, background noise, and time variance.
Acoustic Radiance. Sound waves produced by the object must propagate through the air to reach a listener or microphone position. Even in an empty space, the resulting sound will change with different listener positions depending on the vibrational mode shapes; this is the acoustic radiance of the object [51]. This effect has a high computational cost for each geometric model, and since we use datasets with relatively large numbers of models, we do not include it in our simulations.
Room Acoustics. In an enclosed space, sound waves bounce off walls to produce early echo-like reflections and noisy late reverberations; this is the effect of room acoustics. We created a set of room impulse responses in rooms of different sizes and materials using a real-time sound propagation simulator, GSound [52]. Each modal sound is convolved with a randomly selected room impulse response.
Background Noise. In most real-world situations, background noise will also be present in any recording. We simulate background noise through addition of a random segment of environmental audio from the DEMAND database [53]. These noise samples come from diverse indoor and outdoor environments and contain around 1.5 h of recordings.
Time Variance. Finally, we slightly randomize the start time of each modal sound. This reflects the imperfect timing of any real-world recording process. Together, these augmentations make the synthesized sounds more accurately simulate recordings that would be taken in the real world.
3.3 Visual Data
Our visual data consists of datasets of geometric models of rigid objects, ranging from small to large and of varying complexity. Given these geometric models, we can simulate synthesized sounds for a set of possible materials. During evaluation, object classification results were tested using multiple scenarios of voxelization, scale, and material assignment (Sect. 5.2).
4 Impact Sound Neural Network (Audio and Audio-Visual)
Given the impact sounds and representation described in Sect. 3, we now examine their ability to identify materials and geometric models. We begin with an analysis of the distributions of the features themselves as proper feature selection is a key component in classifier construction.
4.1 Input Features and Analysis
Audio Features. In environmental sound classification tasks, classification accuracy can be affected by the input sound’s form of representation [28, 33]. A one-dimensional time series of audio samples over time can be used as features [39], but they do not capture the spectral properties of sound. A frequency dimension can be introduced to create a time-frequency representation and better represent the differentiating features of audio signals.
In this work, we use a mel-scaled spectrogram as input. Spectrograms have demonstrated high performance in CNNs for other tasks [33]. A given sound, originally represented as a waveform of audio samples over time, is first trimmed to one second in length since impact sounds are generally transient. The sound is resampled to 44.1 kHz, the Nyquist rate for the full range of audible frequencies up to 22.05 kHz. We compute the short-time Fourier transform of the sound, using a Hann window function with 2048 samples and an overlap of 25%. The result is squared to produce a canonical “spectrogram”, then the frequencies are mapped into mel-scaled bins to provide appropriate weights matching the logarithmic perception of frequency. Each spectrogram is individually normalized to reduce the effects of loudness and microphone distance. To create the final input features for the classifier, we downsample the mel-spectrogram to a size of 64 frequency bins by 25 time bins.
We performed principal component analysis on a small sample of synthesized impact sounds to demonstrate the advantage of mel-spectrograms as input features for audio of this type. We used 70 models and 6 materials with a single hit per combination to synthesize a total of 420 impact sounds for this analysis. Figure 4 displays the first two principal components as mel-spectrograms, describing important distinguishing factors in our dataset. The first component identifies damping in higher frequencies, while the second component identifies specific frequency bins.
Visual Features. As in VoxNet [4], visual data serves as an input into classification models based on a 30x30x30 voxelized representation of the object geometry. We voxelize models from our real and synthetic dataset and ShapeNets ModelNet10 and ModelNet40. All objects were voxelized using the same voxel and grid size. We generated audio and visual data for our dataset and up to 200 objects (train and test) per ModelNet class.
4.2 Model Architecture
Using our audio and visual features, our approach to performing object geometry classification uses convolutional neural networks (CNNs) due to their high accuracy in a wide variety of tasks, with the specific motivation that convolutional kernels should be able to capture the recurring patterns underlying the structure of our sounds.
Audio-Only Network (ISNN-A). We first developed a network structure to perform object classification using audio only. Our audio Impact Sound Neural Network (ISNN-A) is based on optimization performed over a search space combining general network structure, such as the number of convolutional layers, and hyperparameter values. This optimization was performed using the TPE algorithm [54]. We found a single convolutional layer followed by two dense layers performs optimally on our classification tasks. This network structure utilizes a convolution kernel with increased frequency resolution to more effectively recognize spectral patterns across a range of frequencies [30]. Our generally low number of filters and narrower layer sizes aim to reduce overfitting by encouraging the learning of generalizable geometric properties.
Sample activations (a–e) of ISNN convolution layer. Filters identify characteristic patterns in frequencies (a) (d), damping rates (b) (c), and high-frequency noise (e). The distinguishing characteristics in these activations match the expected factors discovered in the PCA analysis in Fig. 4. An audio input spectrogram (f) and activation maximization (g) learned by the ISNN network for the toilet ModelNet10 class show correctly-learned patterns.
Figure 5 shows sample activations of a convolutional layer of the ISNN-A network. Based on the PCA and modal analysis we performed, we expect that the differences between geometries primarily manifest as different sets of modal frequencies, as well as different sets of initial mode amplitudes and damping rates. These activations corroborate our expectations. In Fig. 4a, we see that damping is an important discriminating feature, which has been learned by filters (b) and (c) in Fig. 5. Similarly, the frequency patterns that we expected because of Fig. 4b can be seen in filters (a) and (d). This demonstrates that our model is learning statistically optimal kernels with high discriminatory power.
Multimodal Audio-Visual Network (ISNN-AV). Our audio-visual network, as shown in Fig. 1, consists of our audio-only network combined with a visual network based on VoxNet [4] using either a concatenation, addition, multiplicative fusion, or bilinear pooling operation. Concatenation and addition serve as our baseline operations, in which the outputs of the first dense layers are concatenated or added before performing final classification. These operations are not ideal because they fail to emulate the interactions that occur between multiple forms of input. On the other hand, multiplicative interactions allow the input streams to modulate each other, providing a more accurate model.
We evaluate two multiplicative merging techniques to better model such interactions. Multiplicative fusion calculates element-wise products between inputs, while projecting the interactions into a lower-dimensional space to reduce dimensionality [46]. Multimodal factorized bilinear pooling takes advantage of optimizations in size and complexity, and is our final merged model [45]. This method builds on the basic idea of multiplicative fusion by performing a sequence of pooling and regularization steps after the initial element-wise multiplication.
5 Results
We now present our training and evaluation methodology along with final results. For each of the datasets, we evaluate the network architectures described in Sect. 4.2. We compare against several baselines: a K Nearest Neighbor classifier, a linear SVM trained through SGD [55], VoxNet [4], and SoundNet [39]. Our multimodal networks combined VoxNet with either ISNN-A or SoundNet8 and were merged through either concatenation (MergeCat), element-wise addition (MergeAdd), multiplicative fusion (MergeMultFuse) [46], or multimodal factorized bilinear pooling (MergeMFB) [45]. Training was performed using an Adam optimizer [56] and run with a batch size of 64, with remaining hyperparameters hand-tuned on a validation set before final evaluation on a test set.
5.1 RSAudio Evaluation
Our “RSAudio” dataset was constructed from real and synthesized sounds. When performing geometry classification, each geometric model is its own class; given a query sound, the network returns the geometric model that would produce the most similar sound. RSAudio combines real and synthetic sounds to increase dataset size and improve accuracy. Specific details about the recording process for the real sounds can be found in Supplemental Document (Sect. 4). The dataset is publicly available at http://gamma.cs.unc.edu/ISNN/.
The results for geometry classification are presented in Tables 1, 2, 3, and 4. For RSAudio synthetic (S) and real (R), ISNN-A provides competitive results with all other tested algorithms. For real sounds, where issues of recordings are most problematic, ISNN-A significantly outperforms all other algorithms, with an accuracy of 92.37%. On the merged RSAudio dataset of real and synthetic sounds, all models actually produce higher accuracy than on either synthetic or real alone, indicating that training on both sets improves generalizability. As an additional baseline, we classified 100 ImageNet RGB transparent object images based on the VGG16 pre-trained model and obtained 73.27% accuracy with top 5 labels and an average confidence of 46.64%. While the accuracy is not directly comparable with ModelNet and RSAudio results, it provides a preliminary suggestion that a second modality could further improve results.
5.2 ModelNet Evaluation
In Tables 2, 3, and 4, ModelNet results are categorized by input: audio (A), voxel (V), or both (AV). The “MN10” dataset consists of 119,620 total synthetic sounds: multiple sounds at different hit points for each geometry and material combination. The “o” suffix (e.g. “MN10o”) indicates that only one sound per model was produced, and all models were assigned one identical material. The “s” suffix (e.g. “MN10os”) indicates that each ModelNet class was assigned a realistic and normally distributed scale before synthesizing sounds. The “m” suffix (e.g. “MN10om”) indicates that each ModelNet class was assigned a realistic material.
By assigning a material and scale to each ModelNet10 class (MN10osm), classification performance achieved 71.50% for ISNN-A. Real-world objects within a class will tend to be made of a similar material and scale, so MN10osm is likely more reflective of performance in real-world settings where these factors provide increased potential for classification. However, for the multimodal ISNN-AV, material and scale assignments do not improve accuracy. In MN10o, larger geometric features will correspond to lower-pitched sounds (i.e. a large object will produce a deeper sound than a small object), and the multimodal fusion of those cues produces higher accuracy. However, when models are given materials and scales in MN10o{s,m,sm}, the voxel inputs remain unchanged, weakening the relationship between voxel and audio inputs. Scaling the voxel representation as well as the model used for sound synthesis may reduce this issue.
Assigning scale and material improve ModelNet40 accuracy (MN40osm) because its object classes differ more in size and material than ModelNet10. The merged audio-visual networks outperform the separate audio or visual networks in every case except for MN10os, as discussed above. Across all ModelNet10 datasets, ISNN-AV with multimodal factorized bilinear pooling produces the highest accuracy on MN10o, at 91.80%. Similarly, ModelNet40 produces optimal results using ISNN-AV with multiplicative fusion on MN40osm, at 93.24%. Entries with a “—” were not completed due to prohibitive time or memory costs when using the large MN10 dataset.
5.3 Additional Evaluations
We evaluated on additional datasets such as Arnab et. al [9]. This dataset consists of audio of tabletop objects being struck, with ground-truth object labels provided. ISNN-A produces 89.29% accuracy, the highest of all evaluated algorithms. This accuracy is slightly lower than ISNN-A’s accuracy on RSAudio’s real sounds, likely due to the loosened constraints on the recording environment and striking methodology. The same networks were also considered for material classification, and results can be seen in Supplemental Document (Sect. 6).
We also evaluate the ability of synthetic sounds to supplement a smaller number of real sounds for training, which would reduce necessary human effort in obtaining sounds. Figure 6 shows classification accuracy on a real subset of our RSAudio dataset for ISNN-A trained on a combination of real and synthetic sounds. The training sets have identical total sizes but are created with specific percentages of real and synthetic sounds, then networks are trained on either the combined dataset or the real sounds independently. We find that the addition of synthetic sounds to the dataset improves accuracy by up to 11%. With only 30% real sounds (Point A), accuracy begins to plateau, reaching over 90% accuracy with only 60% real sounds (Point B). These indicate that synthetic audio can supplement a smaller amount of recorded audio to improve accuracy.
Classification accuracy on a test set of real sounds using ISNN trained on a combination of real and synthetic sounds. (a) When trained on combined real and synthetic sounds (Real+Synth), classification accuracy is upto 11% higher than when trained on the real sounds alone (Real). (b) When insufficient real sounds are provided, synthetic sounds further reduce loss. (c) Our method has been able to correctly classify impact sounds with voxel data across ModelNet40 classes, as displayed by the MN40osm confusion matrix, for instance.
Augmentations in Subsect. 3.2 were designed to enhance the realism of synthetic audio for improved transfer learning from synthetic to real sounds. However, we were unable to find an instance when these augmentations significantly improved test accuracy of RSAudio real when trained on RSAudio synthetic. This indicates that modal components of sounds (frequencies, amplitudes) are sufficient and most critical in object classification, and that acoustic radiance, noise, and propagation effects produce little, if any, impact on accuracy.
5.4 Application: Audio-Guided 3D Reconstruction
A primary use case of the method described in this paper is to improve reconstruction of transparent, occluded, or reflective objects. We have constructed a demo application in which our method enables real-time scene reconstruction and augmentation. We enhance open-source 3D reconstruction software [6, 7] by adding an audio-based selector function. Figure 7 illustrates the application pipeline. Further details are in the Supplemental Document (Sect. 2) and demo video at http://gamma.cs.unc.edu/ISNN/.
6 Conclusion
In this paper, we have presented a novel approach for improving the reconstruction of 3D objects using audio-visual data. Given impact sound as an additional input, ISNN-A and ISNN-AV have been optimized to achieve high accuracy on object classification tasks. The use of spectrogram representations of input reduce overfitting by directly inputting spectral information to the networks. ISNN has further shown higher performance when using a dataset with combined synthetic and real audio. Sound provides additional cues, allowing us to estimate the object’s material class, provide segmentation, and enhance scene reconstruction.
Limitations and Future Work: While VoxNet serves as a strong baseline for the visual component of ISNN-AV, different visual networks in its place could identify more optimal network pairings. As with existing learning methods, VoxNet is limited to performing classifications of known geometries. However, impact sounds hold potential of identifying correct geometry, even when a model database is not provided, allowing for accurate 3D reconstructions or hole-filling.
References
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Washington, DC, USA, pp. 580–587. IEEE Computer Society (2014)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shape modeling. In: Proceedings of 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Maturana, D., Scherer, S.: VoxNet: a 3D convolutional neural network for real-time object recognition. In: IROS (2015)
Socher, R., Huval, B., Bhat, B., Manning, C.D., Ng, A.Y.: Convolutional-recursive deep learning for 3D object classification. In: Conference on Neural Information Processing Systems (NIPS) (2012)
Golodetz, S., et al.: SemanticPaint: a framework for the interactive segmentation of 3D scenes. Technical report TVG-2015-1, Department of Engineering Science, University of Oxford, Oct 2015. Released as arXiv e-print arXiv:1510.03727
Valentin, J., et al.: SemanticPaint: interactive 3D labeling and learning at your fingertips. ACM Trans. Graph. 34(5) (2015)
Zhang, Z., et al.: Generative modeling of audible shapes for object perception. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Arnab, A., et al.: Joint object-material category segmentation from audio-visual cues. In: Proceedings of the British Machine Vision Conference (BMVC) (2015)
Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: BigBIRD: a large-scale 3D database of object instances. In: IEEE International Conference on Robotics and Automation (ICRA) (2014)
Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: IEEE International Conference on Robotics and Automation (ICRA) (2011)
Kanezaki, A., Matsushita, Y., Nishida, Y.: Rotationnet: learning object classification using unsupervised viewpoint estimation. CoRR abs/1603.06208 (2016)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition (2017)
Westoby, M., Brasington, J., Glasser, N., Hambrey, M., Reynolds, J.: structure-from-motion photogrammetry: a low-cost, effective tool for geoscience applications. Geomorphology 179, 300–314 (2012)
Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2006)
Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 690–706 (1999)
Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: International Symposium on Mixed and Augmented Reality (ISMAR) (2011)
Newcombe, R., Fox, D., Seitz, S.: DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Dai, A., Niessner, M., Zollhofer, M., Izadi, S., Theobalt, C.: BundleFusion: real-time globally consistent 3D reconstruction using on-the-fly surface re-integration. In: SIGGRAPH (2017)
Lysenkov, I., Eruhimov, V., Bradski, G.: Recognition and pose estimation of rigid transparent objects with a kinect sensor. In: Robotics: Science and Systems Conference (RSS) (2013)
Aberman, K., et al.: Dip transform for 3D shape reconstruction. In: SIGGRAPH (2017)
Tanaka, K., Mukaigawa, Y., Funatomi, T., Kubo, H., Matsushita, Y., Yagi, Y.: Material classification using frequency- and depth-dependent time-of-flight distortion. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 79–88, July 2017
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of the IEEE ICASSP 2017, New Orleans, LA (2017)
Piczak, K.J.: Esc: dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, MM 2015, New York, NY, USA, pp. 1015–1018. ACM (2015)
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM International Conference on Multimedia, MM 2014, New York, NY, USA, pp. 1041–1044. ACM (2014)
Büchler, M., Allegro, S., Launer, S., Dillier, N.: Sound classification in hearing aids inspired by auditory scene analysis. EURASIP J. Adv. Signal Process. 2005(18), 387845 (2005)
Cowling, M., Sitte, R.: Comparison of techniques for environmental sound recognition. Pattern Recognit. Lett. 24(15), 2895–2907 (2003)
Barchiesi, D., Giannoulis, D., Stowell, D., Plumbley, M.D.: Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Process. Mag. 32(3), 16–34 (2015)
Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, Sept 2015
Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135, Mar 2017
Huzaifah, M.: Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. CoRR abs/1706.07156 (2017)
Ren, Z., Yeh, H., Lin, M.C.: Example-guided physically based modal sound synthesis. ACM Trans. Graph. 32(1), 1:1–1:16 (2013)
Sterling, A., Lin, M.C.: Interactive modal sound synthesis using generalized proportional damping. In: Proceedings of the 20th ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D 2016, New York, NY, USA, pp. 79–86. ACM (2016)
Zhang, Z., Li, Q., Huang, Z., Wu, J., Tenenbaum, J., Freeman, B.: Shape and material from sound. In: Guyon, I. (eds.) Advances in Neural Information Processing Systems, vol. 30. pp. 1278–1288. Curran Associates, Inc. (2017)
Kac, M.: Can one hear the shape of a drum? Am. Math. Mon. 73(4), 1–23 (1966)
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems, pp. 892–900. (2016)
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413 (2016)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 568–576. Curran Associates, Inc. (2014)
Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural Comput. 12(6), 1247–1283 (2000)
Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: International Conference on Computer Vision (ICCV) (2015)
Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–326 (2016)
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE International Conference on Computer Vision (ICCV), pp. 1839–1848 (2017)
Park, E., Han, X., Berg, T.L., Berg, A.C.: Combining multiple sources of knowledge in deep CNNs for action recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016)
O’Brien, J.F., Shen, C., Gatchalian, C.M.: Synthesizing sounds from rigid-body simulations. In: Proceedings of the 2002 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2002, New York, NY, USA, pp. 175–181. ACM (2002)
Raghuvanshi, N., Lin, M.C.: Interactive sound synthesis for large scale environments. In: Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games, I3D 2006, New York, NY, USA, pp. 101–108. ACM (2006)
van den Doel, K., Pai, D.K.: The sounds of physical shapes. Presence 7, 382–395 (1996)
Morrison, J.D., Adrien, J.M.: Mosaic: a framework for modal synthesis. Comput. Music. J. 17(1), 45–56 (1993)
James, D.L., Barbič, J., Pai, D.K.: Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources. ACM Trans. Graph. (TOG) 25, 987–995 (2006)
Schissler, C., Manocha, D.: Gsound: interactive sound propagation for games. In: Audio Engineering Society Conference: 41st International Conference: Audio for Games, Feb 2011
Thiemann, J., Ito, N., Vincent, E.: The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings. Proc. Meet. Acoust. 19(1), 035081 (2013)
Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 2546–2554. Curran Associates, Inc. (2011)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G., (eds.) Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT 2010), Paris, France, pp. 177–187. Springer, Aug 2010. https://doi.org/10.1007/978-3-7908-2604-3_16
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Sterling, A., Wilson, J., Lowe, S., Lin, M.C. (2018). ISNN: Impact Sound Neural Network for Audio-Visual Object Classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11219. Springer, Cham. https://doi.org/10.1007/978-3-030-01267-0_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-01267-0_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01266-3
Online ISBN: 978-3-030-01267-0
eBook Packages: Computer ScienceComputer Science (R0)