My Picture

           

John R. Hershey

Ph.D. in Cognitive Science from UCSD
A founding member of the Machine Perception Laboratory
in the Institute for Neural Computation
Currently Research Staff Member at IBM T. J. Watson Research Center

john (at) johnhershey (dot) com
Telephone: (914) 945-1814


Bio   Papers   Demos   Patents   Teaching  



Bio

I completed my Ph D. in the Department of Cognitive Science, where I was a founding member of the Machine Perception Laboratory (MPLab) at the University of California San Diego. My thesis in the field of machine perception explores the use of generative graphical models for speech enhancement, face-tracking and combinations of the two. During my time at UCSD, I interned extensively in the Machine Learning and Applied Statistics Group at Microsoft Research in Seattle, and at Mitsubishi Electric Research Lab in Boston. In 2004, I spent a year as a visiting researcher in the Speech Group at Microsoft Research. Since 2005 I have been at IBM T. J. Watson Research Center in New York, where I am a research staff member in the Pervasive Speech Technology group.


Papers

Tim K. Marks, John Hershey, J. Cooper Roddey, Javier R. Movellan
Joint Tracking of Pose, Expression, and Texture (in press)
in Advances in Neural Information Processing Systems 17, 2005

John Hershey, Trausti Kristjansson, Zhengyou Zhang
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition,
ISCA Workshop on Statistical and Perceptual Audio Processing 2004

Javier Movellan, John Hershey, Tim Marks, and J. Cooper Roddey,
3D Tracking of Morphable Objects Using Conditionally Gaussian Nonlinear Filters,
CVPR Workshop on Generative Models for Vision 2004

Trausti Kristjansson, Hagai Attias, John Hershey,
Stereo Based 3D Tracking and Scene Learning, employing Particle Filtering within EM,
European Conference on Computer Vision (ECCV) 2004

Trausti Kristjansson, John Hershey, Hagai Attias,
Single Microphone Source Separation using High Resolution Signal Reconstruction,
IEEE International Conference on Acoustics, Speech and Signal Processing, 2004

Javier Movellan, Josh Susskind, John Hershey,
Large-Scale Convolutional HMMs for Real-Time Video Tracking,
Computer Vision and Pattern Recognition (CVPR) 2004

John Hershey, Hagai Attias, Nebojsa Jojic, Trausti Kristjansson,
Audio-Visual Graphical Models for Speech Processing,
IEEE International Conference on Acoustics, Speech and Signal Processing, 2004

Trausti Kristjansson, John Hershey,
High Resolution Signal Reconstruction,
Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding, 2003

John Hershey and Mike Casey,
Audio-Visual Sound Separation Via Hidden Markov Models,
in Advances in Neural Information Processing Systems 14, 2002

John Hershey and Javier R. Movellan,
Audio Vision: Using Audio-Visual Synchrony to Locate Sounds,
in Advances in Neural Information Processing Systems 12, 2000



Demos


3D Face Tracking:
Here we are tracking three-dimensional face model parameters. This project stems from work I did with Matt Brand on his "flexible flow" algorthim at MERL. The G-flow model unifies optic flow and template tracking using a Rao-Blackwellized particle filter combined with an extended Kalman filter that supplies an proposal distribution for sampling. Some of us like to refer to the combination as a "smarticle filter." (Joint work with Javier Movellan, Tim Marks and J. Cooper Roddey, 2003)
Demo: red dots are superimposed by the algorithm
(MPEG-1)     (MPEG-4)

Paper describing how the system works. (Pdf File)

Single Mic Speech Separation:
We used a factorial mixture model to perform single mic speech separation for our upcoming ICASSP paper. Check out the demo!
spectrogram of mixture of one and two
A related speech-denoising paper with Trausti Kristiansson at MSR was recently accepted to ASRU 2003. The demo is here.

In earlier work I separated speech sounds from a monaural mixture using a factorial excitation-filter model of speech. (Collaboration with Mike Casey at MERL, 2001)
Demo: (html) (ppt)
The paper, which also explores using video lip-reading is here

Sincle Mic Sound Localization:
In some early work I used pixel level audio-visual synchrony to locate a sound. (with Javier Movellan, 1999)
Demo: (mpeg movie)

The paper is here


Patents

Hershey, J., Zhang, Z.,
"IRGB Camera: Multispectral Near-Infrared Red Green Blue CCD/CMOS for Machine Vision" (US Patent Pending)

Hershey, J., Kristjansson, T., Attias, H.,
"Method and Apparatus for High-Resolution Speech Reconstruction" (US Patent Pending)

Kristjansson, T., Attias, H., Hershey, J.,
"Method and Apparatus for Scene Learning and Three-Dimensional Tracking Using Stereo Video Cameras" (US Patent Pending)

Hershey, J., Kristjansson, T., Attias, H., Jojic, N.,
"Speech Detection And Enhancement Using Audio/Video Fusion" (US Patent Pending)