Multimedia Analysis in Police–Citizen Communication: Supporting Daily Policing Tasks

  • Peter LeškovskýEmail author
  • Santiago Prieto
  • Aratz Puerto
  • Jorge García
  • Luis Unzueta
  • Nerea Aranjuelo
  • Haritz Arzelus
  • Aitor Álvarez
Part of the Security Informatics and Law Enforcement book series (SILE)


This chapter describes an approach for improved multimedia analysis as part of an ICT-based tool for community policing. It includes technology for automatic processing of audio, image and video contents sent as evidence by the citizens to the police. In addition to technical details of their development, results of their performance within initial pilots simulating nearly real crime situations are presented and discussed.


More than 50% of Europeans, aged from 16 to 74, benefit from communication over social media (Eurostat 2018). Social media provide them with the possibility of agile communication with individuals and groups. In the context of police–citizen communication, the social media channels play a vital role in strengthening the relations and collaboration between the police authorities and communities. A direct interaction with the citizens over such channels allows police to obtain valuable information that contributes to crime prevention and solving.

In this sense, social media is often used for distributing videos of petty crimes, violent acts and other public security related issues (often in private circles or publicly as a third party observer). Such material, if available, is nowadays collected and used by the police as evidence when resolving crimes. An example are the 2013 Boston Marathon bombings, where the police asked citizens to send photos and videos from the area in order to identify the suspects. An automatic processing of multimedia material may thus help in processing or searching for related evidence.

Nevertheless, the current multimedia sharing with police departments and its processing is limited. The available applications for community policing like CitizenCOP (CitizenCOP Foundation n.d.), CommunityOnPatrol (Miami-Dade County 2018), MyPD (WiredBlue n.d.), myRPD (The Reno Police Department n.d.) or AlertCops (Ministerio del Interior 2018) merely allow to send images, audios and/or videos but do not consider any automatic processing of the material.

In this chapter, the INSPEC2T platform approach is presented, a tool for advanced police and citizen communication via modern information and communication technologies (ICT). The INSPEC2T platform brings instantaneous safety issues and crime reporting via web access or mobile application to the citizens, and a platform for report intake and case management to the police (see also Chap.  12). Internally, it includes methods for automatic processing of multimedia contents, including images, audio and video. The aim of this platform is threefold:
  • Simplify the information intake process at the time of reporting.

  • Help in assessing the importance of each message.

  • Help to handle the information overload of the police dispatch centre.

To this end, person detection and reidentification, automatic rich transcription and keyword search in spoken contents are the main selected technologies that should aid in achieving the above goals.

Multimedia Processing for Policing Tasks

Within the INSPEC2T project and according to the stakeholders’ preferences (Sargsyan and Stoter 2016), the following types of multimedia analysis were identified as of higher importance and value: audio speech to text transcription (e.g. for interviews or incident intake annotations), audio based security event detection, person detection and reidentification, action recognition (e.g. running, fight, burglary), security event detection (e.g. fire, smoke, panic) and entity specific object detection (e.g. car, policeman).

The final use case scenarios of the INSPEC2T project envisioned that the citizens report events over their mobile phone by photographing, filming or audio-recording them. As this may limit the quality or comprehensibility of the input (e.g. for the cases of action and event recognition) three approaches were selected and implemented:
  • Audio processing for rich transcription of spoken contents.

  • Audio processing for keyword search of spoken contents.

  • Person reidentification across multiple images or videos.

These technologies allow for agile text extraction from recorded speech (supporting goal 1 of the INSPEC2T platform) necessary for indexing and search (goal 3), keyword search for automatic alarm raising (goal 2), person recognition for automatic notification of missing or suspect persons seen (goal 2) and automatic report correlation and grouping based on the “same person seen” rule (goal 3).

In the following sections, the developed technologies are explained in detail.

Audio Processing

Audio forensics involve the acquisition , analysis and evaluation of sound recordings that may ultimately be presented as admissible evidence in the law court or some other official venue (Maher 2009). This field includes many topics focused on the study of several automatic techniques for the processing of audio and spoken contents such as: establishing the authenticity of audio evidence (Koenig and Lacey 2015), performing speech enhancement to improve audibility and intelligibility (Ikram and Malik 2010), identifying the source device regardless of the speaker and speech content (Garcia-Romero and Espy-Wilson 2010), analyzing the acoustic environment to identify the place where the recording was made (Malik 2013), recognizing speakers (Campbell et al. 2009) and transcribing speech (Mattys et al. 2012). Among these, automatic speech transcription is probably the least applied technology in the forensic domain due to the artefacts involved (adverse acoustic conditions, low audio quality, spontaneous speech, overlapping, etc.), which makes it more difficult to obtain suitable results.

Nevertheless, in the field of community policing, where the citizens communicate incidents through voice calls, messaging (SMS, MMS) and other web and social media groups and frameworks, the analysis of spoken content becomes critical for the police , both at transcription and keyword-search level. The automatic transcription allows to obtain the complete spoken message in text format for further linguistic processing, whilst the keyword search serves as a useful technique for fast indexing and recovering purposes through the detection of specific words and key phrases in the audio.

This way, technology for automatic rich transcription and keyword search for English and Spanish languages was implemented and integrated in the INSPEC2T platform, where the input audios were thought to be recorded by the user using the microphone of a mobile device at a short distance. This situation helped to improve the quality of the audio to be analyzed later. The technology developed used carefully trained deep neural networks for speech and language processing, taking advantage of the last modelling paradigms in the scientific community. Rich transcription involves the tasks of speech recognition, capitalization and punctuation. Speech recognition focuses only on segments containing speech and transcribes them to raw text. The capitalization module detects named entities and proper names and capitalizes them. Finally, the punctuation module adds full stops and commas to the capitalized text. On the other hand, the keyword-search technique exploits the generated lattice during the speech recognition process to seek for specific search terms given by the user. In order to guarantee a good robustness, the systems were trained with a great variety of acoustic and text data, ensuring a good performance of the acoustic and language models to adverse conditions.

Data Compilation for Model Training

The data compilation process involved the acquisition, transformation, processing and normalization of acoustic and text data to (1) train the acoustic and language models and generate the vocabulary of the speech recognition systems (rich transcription and keyword search), (2) build the capitalization components and (3) construct the punctuation modules. The data were collected for both English and Spanish languages.

Regarding the English language, the CSTR VCTK (Veaux et al. 2017) corpus was used for training and evaluation purposes. This corpus perfectly fits the needs of the INSPEC2T platform, since it contains speech data (43 h) uttered by 109 native English speakers with various accents. With the aim of generating more training data and adapting the data to different acoustic environments, noise samples from restaurants, streets and shopping centres were collected from the Freesound website (Freesound Org n.d.), and mixed with the clean audios of the corpus. Thus, the amount of acoustic data for English summed up a total of 216 h. For the English text data, news from the crime domain were crawled from digital newspapers , totalling 47 million words (after normalization). This text corpus was then used to extract more data from a generic English text corpus using data selection techniques. This way, the final English text corpus was composed of a total of 92.8 million words.

The English capitalization module was trained using the final text described above (92.8 million words), whilst the English punctuation module was built using the same text and new acoustic data (234 h and 53 min) from already transcribed TED talks (TED Conferences n.d.).

Concerning the Spanish language, the SAVAS corpus (del Pozo et al. 2014) was used as a basis. It is composed of contents from the broadcast news domain, and the acoustic data is divided in clean speech (40 h) and noisy speech (100 h). Similar to the English language case, the clean speech was mixed with the noise samples, obtaining a total acoustic corpus of 220 h for acoustic modelling training. The Spanish text data was also obtained through data selection techniques, employing text from the crime domain gathered from digital newspapers (84.8 million words) and from the general domain. The final Spanish text was composed of 127.4 million words. Finally, the Spanish capitalization and punctuation modules were trained using the described acoustic (220 h) and text (127.4 million words) data.

Description of the System

The technology for automatic rich transcription and keyword search was constructed using the same tools and modelling paradigms for both English and Spanish languages. The only difference corresponded to the language-dependent grapheme-to-phoneme tool (G2P). For English, a statistical G2P tool trained with the Beep pronunciation dictionary (Hunt 1996) and the Sequitur G2P tool (Bisani and Ney 2008) were used, whilst a rule-based transcriptor inspired by the work of López (2004) was employed for Spanish.

The speech recognition systems for automatic rich transcription and keyword search were built using the Kaldi toolkit (Povey et al. 2011). The acoustic models corresponded to an hybrid deep unidirectional long short-term memory (LSTM) Hidden Markov Models (HMM) implementation, where LSTMs were trained to provide posterior probability estimates for the HMM states. Two types of language models (LM) were integrated per language: bigram Arpa-format LM for decoding and 5-gram constant Arpa-format LM for rescoring of the lattices. The decoding LMs were estimated with Kneser-Ney modified smoothing using the KenLM toolkit (Heafield 2011).

The capitalization modules were trained using the recasing tool provided by the Moses open-source toolkit (Koehn et al. 2007), whilst the punctuation modules were modelled as unidirectional LSTM Recurrent Neural Networks (Tilk and Alum 2015).

Finally, regarding the keyword-search technique, the lattices generated by the above described speech recognition systems were processed using the lattice indexing technique described in Can and Saraclar (2011).

Image and Video Processing

Main forensics video analysis currently operates on image level, working on video quality enhancements by sharpening, de-blurring, image stabilization (Amped 2018) and super-resolution (Bevilacqua et al. 2013); functionalities that offer higher quality images to the police investigators in order to proceed with further identification of objects, persons or actions. In the field of video surveillance, the focus has been on person detection and tracking as well as basic action recognition (e.g. persons entering restricted area, following a specific route, loitering, running, crouching or falling, BOSCH n.d.). Nevertheless, the current commercial solutions are tuned to specific use cases, scenarios and, most importantly, acquisition scenes that are usually well known a priori, considering that static CCTV cameras are used. Unrestricted processing of free-hand acquisitions and evaluation of complex events or actions is part of cutting edge research technologies. Scene understanding algorithms based on novel deep learning technologies show promising results in scene tagging and categorization (Zhou et al. 2014), event or action classification (Varol and Salah 2015) and object identification or dense image description (Johnson et al. 2016).

From all the topics, accurate person reidentification is currently one of the most required features that lately managed its breakthrough in the commercial sphere with applications that allow to process payments by verifying the identity of the customer by a selfie image (Petroff 2016) or tools for face identity management (Animetrics 2018) to law enforcement agents, corporations, night bars, etc. Considering the elevated maturity of these technologies, person reidentification was the primary implementation goal of the visual multimedia analysis within the INSPEC2T project. The envisioned use cases for the police case management system are: detecting missing or suspect persons, detect repetitive offenders and their typical whereabouts, check for possible implications of given offenders in the latest history of incidents, indirectly find a common witness of criminal or vandal acts and compile a concise list of persons present in given video.

To recognize the same person in images and videos, robust face and person detection models and open world metrics were employed. These capture the characteristics of an individual across different cameras, views, conditions and most of all generalize to an unlimited set of inputs. Face and whole body person recognition was considered. Global appearance measures were used that do not reveal the real identity of the subject, respecting thus the relevant personal data protection issues.

The principal steps of a general object reidentification include object detection, region normalization, feature extraction and final feature comparison for determining the identity of the object (Fig. 13.1). The detection task was solved by a classical machine learning approach by training deterministic classifiers on face and person regions. Currently, the Histogram Of Gradients (HOG) as well as convolutional neural network (CNN) features are used. The detection task involves a candidate region proposal that was based on a sliding window approach, run at multiple scales. Acceleration of the detection process can be obtained if the geometry of the scene is known, limiting the detection to certain regions and scales. In our case, the scale selection considered the minimal and maximal expected object sizes, set to a quarter of the image size for face detection. In case of CNN, a holistic approach was used, skipping the region proposal task, but including the region activation and evaluation within the layers of the neural network.
Fig. 13.1

Workflow of the person reidentification process based on facial images. The columns correspond to new person detection, matched person and no match, respectively. The rows show detected faces, facial landmarks detection, face region normalization via alignment and the extracted feature vectors (displayed as a matrix of grey values), respectively

After faces or people are detected, they can be analyzed for reidentification purposes. Nevertheless, an indispensable part of the process is the image normalization. Besides colourspace normalization, the facial regions were extracted and rectified so that the faces became centralized and aligned vertically. This process uses multiple facial landmarks (e.g. labial or eye corners) to estimate and correct for the inclination of the head. The detection of the facial landmarks used the method described in Kazemi and Sullivan (2014) and was pretrained using the iBUG 300-W dataset (Sagonas et al. 2013).

Finally, the reidentification task was based on the extraction of feature vectors, specific to each person or face, and its comparison in the multidimensional space. A 256-dimensional feature vector was extracted by a deep CNN for the face reidentification (Wu et al. 2015), whilst the popular histogram based features of 2784 dimensions (including colour and texture patterns obtained via a set of Gabor and Schmidt filterbank responses ) were applied for the person reidentification, following the work in Zheng et al. (2016). The CNN features are preferred instead of handcrafted expert features, since in general they can better represent the input data and their characteristics (i.e. facial image) thanks to the high level of abstraction they consider.

For our purposes, a CNN was trained to extract the characteristic features that help to discern the facial images of two distinct people, thus maximizing their discriminability. The CNN was trained on the CASIA-WebFace database (Yi et al. 2014), consisting of images of more than 10,000 people with more than 15 images captured per person. For whole body reidentification, a distance measure was trained on a set of public datasets. Upon reidentification, the input image is compared with gallery images and the results are interpreted as a ranking list in which the first person of the ranking represents the gallery image with the most similarity. In order to output only one candidate, a distance threshold was adopted for face recognition and a binary distance metric trained through logistic regression for whole-body recognition.


The evaluation of the multimedia processing technologies has been based on publicly available databases as well as on domain-adopted data gathered from the system tests and first pilot executions. In the following the evaluation of audio and video processes will be described.

Audio Processing Evaluation

For the English language, the automatic speech recognition (ASR) and keyword-search (KWS) technologies were evaluated with a test partition of the CSTR VCTK corpus, which was composed of a total of 2 h and 48 min of synthetically generated noisy speech data. For the evaluation of the KWS system, 100 keywords related to the crime domain were manually selected from this corpus. The English capitalization and punctuation modules were evaluated over a test partition of the text compiled for the LM estimation (194 K words) and over contents from the TED corpus, which lasted 1 h and 31 min, respectively.

For Spanish, a test portion of the SAVAS corpus (1 h) was employed for the evaluation of the ASR, KWS and punctuation systems. Similar to the English language case, a set of 100 keywords close to the crime domain was selected for the KWS evaluation. The capitalization module was tested with a test partition of the text compiled for the Spanish LM estimation (159 K words).

The evaluations were performed using the Word Error Rate (WER) metric for the ASR, Actual Term-Weighted Value (ATWV) for the KWS, and the F1-Score measure for the capitalization and punctuation modules. The results for the punctuation module are presented separately for the period (full stop) and comma punctuation marks. Table 13.1 presents the performance for each system and language.
Table 13.1

Performance of the INSPEC2T audio systems using the WER, ATWV and F1-Score metrics for ASR, KWS and Capitalization/Punctuation, respectively


ASR (%)

KWS (%)

Capitalization (%)

Punctuation (%)















As can be seen in Table 13.1, the differences of results between both languages are noticable in terms of speech recognition and keyword-search detection. Even if the English corpus is more challenging given the different accents and acoustic environments involved, the ASR system performed very satisfactorily, with an error rate of only 4.87%. Besides, the KWS system based on the lattices generated by the recognition process reached an accuracy of 96.86%, which demonstrates that almost all the keywords were recovered from the audios. The performance of the Spanish ASR system was mainly influenced by the spontaneous speech included in the test data, which contains many street interviews recorded in adverse acoustic conditions. However, more than three in four keywords (75.45%) were recovered from the lattices generated during decoding through the Spanish KWS system.

Regarding the Capitalization module, it was shown that the technological approach based on the SMT recasing model performed well for both languages. Finally, given that both Punctuation modules were evaluated with data containing spontaneous speech, the results reached through unidirectional LSTM RNN models can be considered interesting, even the ones obtained for the comma mark, usually influenced by the subjectivity of the person who employs it.

Image and Video Processing Evaluation

The face recognition task was tested and fine-tuned on the Labelled Faces in the Wild database (Huang et al. 2007), for four different distance measures : the Chebyshev (INF), Manhattan (L1), Euclidean (L2) and cosine (CS) norms. The results Fig. 13.2 demonstrate that the cosine norm performed best. Consequently, the cosine norm with thresholds corresponding to equal error rate was used.
Fig. 13.2

ROC curve of the face reidentification model, considering several distance measures and evaluated on the LFW dataset

In practice, the reidentification model was trained on face images of resolution 120 × 120 pixels. In order to obtain such resolution on mid-range mobile phones,1 the person would need to be at a maximum distance of about 6 m (considering an approximate face region height of 25 cm). Following internal experiments, solid performance was still obtained for lower resolutions, reaching 80 × 80 pixels. Such resolution allows to capture images of persons up to 9 m. Nevertheless, at these distances the image blur, caused by motion or out-of-focus during the acquisition, and low contrast, due to lighting conditions (e.g. in a garage), present more challenges and limitations for correct face detection and reidentification.

The detection task is the principal component that influences the correct run of the image and video analysis within the INSPEC2T platform. For face detection four approaches, based on Cascade Classifiers, SVM-HOG, Max Margin Object Detection (MMOD) and CNN MMOD (Davis 2009), were tested on the face detection benchmark FDDB (Jain and Learned-Miller 2010). The results are given in Table 13.2.
Table 13.2

Performance of the face detection algorithms on the FDDB database


Recall (%)

Recall 80 (%)

False positives

Average runtime




















The column Recall 80 refers to evaluation limited to face regions bigger than 80 × 80 pixels. The average runtime is expressed as a multiple of the fastest method. Discontinuous measure of the FDDB was applied

aThe runtime of the MMOD CNN method, when run on GPU, was 1.25

Considering that the face reidentification module was limited to face images with a resolution of 80 × 80 pixels, the corresponding face detection results, evaluated on faces bigger than 80 × 80 pixels, are listed in column Recall 80. The best performance was given by the CNN approach. Nevertheless, this is the most computationally expensive method that gives run-time performance only if GPU processing is applied. The traditional approaches, Cascade and SVM-HOG, as implemented in Itseez Inc (2015), show an unreasonably high false-positive rate. This problem can be contributed to a low precision of the underlying models. The best results, considering also the running time, were provided by the MMOD method as implemented within the public library (Davis 2009).

From the use case perspective, few problems have been experienced during the pilot execution for the face detection task: no false detections , and missed detection were mostly caused by the limitations on the model size, that is, the minimum resolution of the captured face and the blur of the image.

The person recognition task was tested on the VIPeR dataset (Gray and Tao 2008; Fig. 13.3). Nevertheless, three more datasets, the CAVIAR4ReID (Cheng et al. 2011), 3DPes (Baltieri et al. 2011) and GRID (Loy 2017), have been used for training to evaluate the cross-domain influence.
Fig. 13.3

Ranking curves of the person reidentification task in comparison to the method proposed in Zheng et al. (2016) (left) and when trained on additional datasets (right)

It can be observed that the ratios of the best person match are too low to be applied in practice. The results worsen for input data out of the training set, as would be the case for application within the INSPEC2T platform. The main problems stem from vast backgrounds and pose variations.


This chapter discusses multimedia analysis methods, applied within a ICT-based community policing platform, that help to automatize the analysis of audio and video files sent as evidence by the users to the police authorities.

The systems for audio analysis within this specific domain entails several difficulties related to (1) the variability in microphones and devices for audio capture, (2) the several adverse acoustic conditions, (3) the emotional state of the speakers or (4) the different accents, among others. In order to minimize their impact on the system’s performance, multi-environment noisy synthetic data have been generated from corpora composed of utterances of speakers with various accents, besides using technology built with the latest modelling paradigms based on deep learning algorithms. Although the developed systems have achieved competitive results with the current test corpora, the future work will involve the acquisition of simulated recordings from pilots to evaluate the technology in nearly real environments.

The experience of the person reidentification functionalities within the first pilot executions (cp. Chap.  12) was positive, with little to no classification errors observed. Nevertheless, the real-world testing points on the necessity of the user’s awareness on the favourable conditions and directives for capturing a video or a photo. The following are of major importance for obtaining the best results on the person reidentification task: (1) the camera should be held still, (2) the camera should have good focus on the captured subject, (3) the person’s view with respect to the camera should be as frontal as possible, (4) the ambient illumination should be uniform and (5) the image resolution of the captured person should be of sufficient, preferably taken from within 9 m and with at least a 8MP camera resolution. Finally, the whole body appearance recognition system was not applied within the pilot tests due to its high false-positive ratio. In the future, additional image and pose normalization methods will be added in order to boost its performance and bring it closer to practical application.


  1. 1.

    Google Nexus 5x was selected as a reference for the INSPEC2T project; it includes an 8MP camera with a size of 1.4 micron pixels and 4.0 mm focal length.



This work has been supported by the EU project INSPEC2T under the H2020-FCT-2014 programme (GA 653749).


  1. Animetrics. (2018). Advanced 2D-to-3D algorithms for face recognition applications. Animetrics. Retrieved October, 2018, from
  2. Amped. (2018). Amped Five. Amped SRL. Retrieved October, 2018, from
  3. Baltieri, D., Vezzani, R., & Cucchiara, R. (2011). 3DPes: 3D People Dataset for Surveillance and Forensics. In Proceedings of the 1st International ACM Workshop on Multimedia access to 3D Human Objects, pp. 59–64Google Scholar
  4. Bevilacqua, M., Roumy, A., Guillemot, C., & Marie-Line. A. M. (2013). Video super-resolution via sparse combinations of key-frame patches in a compression context. In: 30th Picture Coding Symposium (PCS)Google Scholar
  5. Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. J Speech communication, 50(5), 434–451.Google Scholar
  6. BOSCH. (n.d.). Video analytics at the edge. Bosch Sicherheitssysteme GmbH. Retrieved October, 2018, from
  7. Campbell, J. P., Shen, W., Campbell, W. M., et al. (2009). Forensic speaker recognition. J IEEE Signal Processing Magazine, 26(2), 95.CrossRefGoogle Scholar
  8. Can, D., & Saraclar, M. (2011). Lattice indexing for spoken term detection. J IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347.Google Scholar
  9. Cheng, D. S., Cristani, M., Stoppa, M., Bazzani, L., & Murino, V. (2011). Custom pictorial structures for re-identification. In: British Machine Vision Conference (BMVC).Google Scholar
  10. CitizenCOP Foundation. (n.d.). CitizenCOP APP. CitizenCOP Foundation. Retrieved October, 2018, from
  11. Davis, E. K. (2009). Dlib-ml: A machine learning toolkit. J Machine Learning Research, 10, 1755–1758.Google Scholar
  12. del Pozo, A., Aliprandi, C., & Álvarez, A. Mendes, C., Neto, J., Paulo, S., Piccinini, N., Raffaelli, M. (2014) SAVAS: Collecting, annotating and sharing audiovisual language resources for automatic subtitling. In: Ninth international conference on language resources and evaluation (LREC).Google Scholar
  13. Eurostat. (2018). Individuals using the internet for participating in social networks, code: tin00127, Eurostat. Retrieved October, 2018, from
  14. Freesound Org. (n.d.). Freesound, Freesound Org. Retrieved October, 2018, from
  15. Garcia-Romero, D., & Espy-Wilson, C. (2010). Speech forensics: Automatic acquisition device identification. The Journal of the Acoustical Society of America, 127(3), 2044–2044.CrossRefGoogle Scholar
  16. Gray, D., & Tao, H. (2008). Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: 10th European Conference on Computer Vision (ECCV).Google Scholar
  17. Heafield, K. (2011). KenLM: Faster and smaller language model queries. In: Sixth workshop on statistical machine translation. Association for Computational Linguistics.Google Scholar
  18. Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts.Google Scholar
  19. Hunt, A. (1996). BEEP dictionary. Speech Applications Group, Sun Microsystems Laboratories. Retrieved October, 2018, from
  20. Ikram, S., & Malik, H. (2010). Digital audio forensics using background noise. In: IEEE International Conference on Multimedia and Expo (ICME).Google Scholar
  21. Itseez Inc. (2015). Open source computer vision library. Retrieved from
  22. Jain, V., & Learned-Miller, E. (2010). FDDB: A benchmark for face detection in unconstrained settings. Technical report UM-CS-2010-009, University of Massachusetts.Google Scholar
  23. Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. In: IEEE Conf. on Computer Vision and Pattern Recognition.Google Scholar
  24. Kazemi, V., & Sullivan, J. (2014). One Millisecond Face Alignment with an Ensemble of Regression Trees. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  25. Koehn, P., Hoang, H., & Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E. (2007) Moses: Open source toolkit for statistical machine translation. In: 45th annual meeting of the ACL on interactive poster and demonstration sessions, Association for Computational Linguistics.Google Scholar
  26. Koenig, B. E., & Lacey, D. S. (2015). Forensic authentication of digital audio and video files. In Handbook of digital forensics of multimedia data and devices, (pp. 133–181).Google Scholar
  27. Loy CC (2017) QMUL underGround re-IDentification (GRID) dataset, School of Computer Science and Engineering, Nanyang Technological University, Singapore. Retrieved October, 2018, from
  28. López Morràs, X. (2004). Transcriptor fonético automático del español. Retrieved October, 2018, from
  29. Maher, R. C. (2009). Audio forensic examination. IEEE Signal Processing Magazine, 26(2), 84–94.CrossRefGoogle Scholar
  30. Malik, H. (2013). Acoustic environment identification and its applications to audio forensics. J IEEE Transactions on Information Forensics and Security, 8(11), 1827–1837.CrossRefGoogle Scholar
  31. Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review. J Language and Cognitive Processes, 27(7–8), 953–978.Google Scholar
  32. Miami-Dade County. (2018). Download the COP app. Miami-Dade County. Retrieved October, 2018, from
  33. Ministerio del Interior. (2018). AlertCops: Law Enforcement Agencies App, Ministerio del Interior Gobierno de España. Retrieved October, 2018, from
  34. Panayotov, V., Chen, G., Povey, D.,& Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books.In: IEEE international conference on acoustics, Speech and Signal Processing (ICASSP).Google Scholar
  35. Petroff, A. (2016). MasterCard launching selfie payments. Cable News Network. Retrieved October, 2018, from
  36. Povey, D., Ghoshal, A., & Boulianne, G., et al. (2011). The Kaldi speech recognition toolkit. In: IEEE workshop on automatic speech recognition and understanding (ASRU), IEEE Signal Processing SocietyGoogle Scholar
  37. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 Faces in-the-wild challenge: The first facial landmark localization challenge. In: IEEE Intl Conf. On computer vision.Google Scholar
  38. Sargsyan, G., & Stoter, A. (2016). D3.4 2nd SAG Meeting Report. INSPEC2T consortum public deliverableGoogle Scholar
  39. TED Conferences. (n.d.). TED Ideas worth spreading. TED Conferences. Retrieved October, 2018, from
  40. The Reno Police Department. (n.d.). myRPD App. The Reno police department. Retrieved October, 2018, from
  41. Tilk, O., & Alum, T. (2015). LSTM for punctuation restoration in speech transcripts. In: 16th annual Conf. Of the international speech communication association (INTERSPEECH).Google Scholar
  42. Varol, G., & Salah, A. A. (2015). Efficient large-scale action recognition in videos using extreme learning machines. J Expert Systems with Applications, 42(21), 8274.Google Scholar
  43. Veaux, C., Yamagishi, J., & MacDonald, K., et al. (2017). CSTR VCTK Corpus: English multi-speaker Corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR).Google Scholar
  44. WiredBlue. (n.d.). My Police Deapartment App. WiredBlue. Retrieved October, 2018, from
  45. Wu, X., He, R., & Sun, Z. (2015). A lightened CNN for deep face representation. In: CoRR arXiv:1511.02683.
  46. Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Learning face representation from scratch. In: CoRR. arXiv:1411.7923Google Scholar
  47. Zheng, W. S., Gong, S., & Xiang, T. (2016). Towards open-world person re-identification by one-shot group-based verification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3), 591–606.CrossRefGoogle Scholar
  48. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. Advances in neural information processing systems 27.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Peter Leškovský
    • 1
    Email author
  • Santiago Prieto
    • 1
  • Aratz Puerto
    • 1
  • Jorge García
    • 1
  • Luis Unzueta
    • 1
  • Nerea Aranjuelo
    • 1
  • Haritz Arzelus
    • 1
  • Aitor Álvarez
    • 1
  1. 1.VicomtechSan SebastianSpain

Personalised recommendations