Robust audio-visual speech recognition based on late integration

Jong Seok Lee, Cheol Hoon Park

Research output: Contribution to journalArticle

33 Citations (Scopus)

Abstract

Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech has received attention because of its robustness in noisy environments. In this paper, we present a late integration scheme-based AVSR system whose robustness under various noise conditions is improved by enhancing the performance of the three parts composing the system. First, we improve the performance of the visual subsystem by using the stochastic optimization method for the hidden Markov models as the speech recognizer. Second, we propose a new method of considering dynamic characteristics of speech for improved robustness of the acoustic subsystem. Third, the acoustic and the visual subsystems are effectively integrated to produce final robust recognition results by using neural networks. We demonstrate the performance of the proposed methods via speaker-independent isolated word recognition experiments. The results show that the proposed system improves robustness over the conventional system under various noise conditions without a priori knowledge about the noise contained in the speech.

Original languageEnglish
Article number4540195
Pages (from-to)767-779
Number of pages13
JournalIEEE Transactions on Multimedia
Volume10
Issue number5
DOIs
Publication statusPublished - 2008 Aug 1

Fingerprint

Speech recognition
Acoustics
Hidden Markov models
Acoustic noise
Neural networks
Experiments

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Cite this

@article{dcfe91885dd8428e94a7956083ce1754,
title = "Robust audio-visual speech recognition based on late integration",
abstract = "Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech has received attention because of its robustness in noisy environments. In this paper, we present a late integration scheme-based AVSR system whose robustness under various noise conditions is improved by enhancing the performance of the three parts composing the system. First, we improve the performance of the visual subsystem by using the stochastic optimization method for the hidden Markov models as the speech recognizer. Second, we propose a new method of considering dynamic characteristics of speech for improved robustness of the acoustic subsystem. Third, the acoustic and the visual subsystems are effectively integrated to produce final robust recognition results by using neural networks. We demonstrate the performance of the proposed methods via speaker-independent isolated word recognition experiments. The results show that the proposed system improves robustness over the conventional system under various noise conditions without a priori knowledge about the noise contained in the speech.",
author = "Lee, {Jong Seok} and Park, {Cheol Hoon}",
year = "2008",
month = "8",
day = "1",
doi = "10.1109/TMM.2008.922789",
language = "English",
volume = "10",
pages = "767--779",
journal = "IEEE Transactions on Multimedia",
issn = "1520-9210",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "5",

}

Robust audio-visual speech recognition based on late integration. / Lee, Jong Seok; Park, Cheol Hoon.

In: IEEE Transactions on Multimedia, Vol. 10, No. 5, 4540195, 01.08.2008, p. 767-779.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Robust audio-visual speech recognition based on late integration

AU - Lee, Jong Seok

AU - Park, Cheol Hoon

PY - 2008/8/1

Y1 - 2008/8/1

N2 - Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech has received attention because of its robustness in noisy environments. In this paper, we present a late integration scheme-based AVSR system whose robustness under various noise conditions is improved by enhancing the performance of the three parts composing the system. First, we improve the performance of the visual subsystem by using the stochastic optimization method for the hidden Markov models as the speech recognizer. Second, we propose a new method of considering dynamic characteristics of speech for improved robustness of the acoustic subsystem. Third, the acoustic and the visual subsystems are effectively integrated to produce final robust recognition results by using neural networks. We demonstrate the performance of the proposed methods via speaker-independent isolated word recognition experiments. The results show that the proposed system improves robustness over the conventional system under various noise conditions without a priori knowledge about the noise contained in the speech.

AB - Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech has received attention because of its robustness in noisy environments. In this paper, we present a late integration scheme-based AVSR system whose robustness under various noise conditions is improved by enhancing the performance of the three parts composing the system. First, we improve the performance of the visual subsystem by using the stochastic optimization method for the hidden Markov models as the speech recognizer. Second, we propose a new method of considering dynamic characteristics of speech for improved robustness of the acoustic subsystem. Third, the acoustic and the visual subsystems are effectively integrated to produce final robust recognition results by using neural networks. We demonstrate the performance of the proposed methods via speaker-independent isolated word recognition experiments. The results show that the proposed system improves robustness over the conventional system under various noise conditions without a priori knowledge about the noise contained in the speech.

UR - http://www.scopus.com/inward/record.url?scp=47649103796&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=47649103796&partnerID=8YFLogxK

U2 - 10.1109/TMM.2008.922789

DO - 10.1109/TMM.2008.922789

M3 - Article

AN - SCOPUS:47649103796

VL - 10

SP - 767

EP - 779

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

SN - 1520-9210

IS - 5

M1 - 4540195

ER -