Towards end-to-end synthetic speech detection

Guang Hua, Andrew Beng Jin Teoh, Haijian Zhang

Research output: Contribution to journalArticlepeer-review

19 Citations (Scopus)


The constant Q transform (CQT) has been shown to be one of the most effective speech signal pre-transforms to facilitate synthetic speech detection, followed by either hand-crafted (subband) constant Q cepstral coefficient (CQCC) feature extraction and a back-end binary classifier, or a deep neural network (DNN) directly for further feature extraction and classification. Despite the rich literature on such a pipeline, we show in this paper that the pre-transform and hand-crafted features could simply be replaced by end-to-end DNNs. Specifically, we experimentally verify that by only using standard components, a light-weight neural network could outperform the state-of-the-art methods for the ASVspoof2019 challenge. The proposed model is termed Time-domain Synthetic Speech Detection Net (TSSDNet), having ResNet- or Inception-style structures. We further demonstrate that the proposed models also have attractive generalization capability. Trained on ASVspoof2019, they could achieve promising detection performance when tested on disjoint ASVspoof2015, significantly better than the existing cross-dataset results. This paper reveals the great potential of end-to-end DNNs for synthetic speech detection, without hand-crafted features.

Original languageEnglish
Article number9456037
Pages (from-to)1265-1269
Number of pages5
JournalIEEE Signal Processing Letters
Publication statusPublished - 2021

Bibliographical note

Funding Information:
Manuscript received April 12, 2021; revised June 3, 2021; accepted June 10, 2021. Date of publication June 15, 2021; date of current version June 28, 2021. This work was supported in part by the 2020–2021 International Scholar Exchange Fellowship (ISEF) Program at the Chey Institute for Advanced Studies, South Korea, and in part by the National Natural Science Foundation of China under Grant 61802284. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Yu Tsao. (Corresponding author: Andrew Beng Jin Teoh.) Guang Hua and Haijian Zhang are with the School of Electronic Information, Wuhan University, Wuhan 430072, China (e-mail:;

Publisher Copyright:
© 2021 IEEE.

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Applied Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Towards end-to-end synthetic speech detection'. Together they form a unique fingerprint.

Cite this