Audio, Speech and Language Processing
Biomedical Signal Processing
Statistical and Bio-Inspired Machine Learning
Data Mining
Audio, Speech and Language Processing
Segmentation and indexing of sound signals
> MALIS Home > Research Topics > Machine Learning > Audio, Speech and Language Processing > Segmentation and indexing of sound signals

MAchine Learning and Interactive Systems

Segmentation and indexing of sound signals
  by Stéphane Rossignol


Over the last few years, Stéphane Rossignol has been designing and
programming sound processing applications. Half-automatic and
automatic indexing of sound signals are concerned.

These works deal mostly with the temporal segmentation and indexing
of musical sound signals. Three interdependent schemes of
segmentation are defined, each corresponding to a different level of
signal attributes.

- 1) The first scheme, the so-called "source" scheme, concerns
the distinction between speech/music/different noises/etc.
These sounds are coming for instance from movie sound tracks and
from radio broadcasts.

"Features" are examined. They intend to measure and underline
distinct properties of speech signal, of musical signal, etc.
They are combined into the multidimensional classification
frameworks described in the literature. The performance obtained
for each combination of features and using each classification
system is discussed.

- 2) The second segmentation scheme, the so-called "characteristics"
scheme, refers to labels such as : silence/sound, voiced/unvoiced,
harmonic/inharmonic, monophonic/polyphonic, with vibrato/without
vibrato, with tremolo/without tremolo, violin/piano/etc. Most of
these characteristics are considered as features by themselves
when the third segmentation scheme, which is described in details
below, is performed.

Vibrato detection, vibrato parameter (its frequency and its
magnitude) estimation, and vibrato extraction from the
fundamental frequency trajectory are particularly studied.
Several techniques are developed. The performance of the system
is discussed on real world data.

The vibrato is extracted from the fundamental frequency
trajectory in order to obtain a "flat" melodic evolution. This
"flat" fundamental frequency trajectory can thus be used for
the segmentation of musical excerpts into notes (third
segmentation scheme), and can also be used for the modification
and/or further processings of these sounds.

The vibrato detection is operated only if the source "music" is
identified when the first segmentation scheme is performed.

- 3) The third scheme leads to the segmentation into "notes or into
phones or more generally into steady state parts", according to
the nature of the sound : instrumental part, singing voice excerpt,
speech, percussive part...

The analysis can be cutted out in four steps. This point of view
is too straightforward, but worthy informative. The first step is
to extract a large set of features. A feature will be all the
more appropriate as its time evolution presents strong and short
peaks when transitions occur, and as its variance and its mean
remain at very low levels when describing a steady state part.
Three kinds of transitions exist : fundamental frequency
transients, energy transients and frequency content transients ;
each of them corresponds to one of the criterions used to
psycho-acoustically differentiate sounds from each other. Secondly,
each of these features is automatically thresholded. Thirdly, a
final decision function, based on the set of the thresholded
features, is derived. It provides the segmentation marks. Within
this framework, data fusion techniques are studied. Lastly, for
monophonic and harmonic sounds, the automatic transcription is
performed. The performance of the system on real world samples
is discussed.

The data obtained in a given scheme are propagated from lower
numbered to higher numbered schemes in order to improve their

The length of the segments provided by the "sources" segmentation
scheme can be of the order of a few minutes. The length of the
segments provided by the "characteristics" segmentation scheme is
commonly smaller : it is of the order of a few dozens of seconds,
say. The length of the segments provided by the "steady state
parts" segmentation scheme is most often smaller than a second.

The unification and maintaining of the developped softwares, as
new techniques are used, are carried out. Especially, these
softwares are organized into five sets of softwares :

- 1) "Segmentation", with the main goal of performing the
segmentation into notes (music) or into phones (music and speech).

Notably, new "features" are continuously developed and evaluated.
For instance, recently (2007), features based on kernel methods,
such as the SVMs, have been studied.

- 2) "Sources", with the goal of performing the source segmentation.

- 3) Various "pitch-trackers/partial-trackers".

- 4) The "characteristics" processing, and particularly the vibrato

- 5) A "user interface", with the goal of being, ideally, multimodal,
this in order to be as ergonomic as possible, allowing thus the
manual indexing process being as flexible and fast as possible.
The main goal of this user interface is to allow a half-automatic
indexing process, this because the automatic segmentation/indexing
can not be fully performed yet. It’s a question of helping to build
large databases of indexed musical sound signals. This kind of
databases exist for speech, but they are much less numerous for
musical sound signals.