3D visualisation of ultrasound tongue imaging

Conventional ultrasound tongue imaging is powerful, but the speckled greyscale image can sometimes be difficult to interpret, even for experienced users. LinguaSound 3D uses deep learning to estimate tongue surface contours from live ultrasound images and animates a rotatable 3D model of the tongue, hard palate and teeth in real time. The result is a view of tongue position and movement that is easier to interpret for clinicians, researchers and the speakers themselves.

Who uses LinguaSound 3D

LinguaSound 3D can be used by speech professionals, phonetics researchers, pronunciation instructors and human-computer interaction (HCI) engineers.

Clinical


LinguaSound 3D provides speech professionals with a clearer view of tongue position and movement during speech. The 3D model makes it easier to observe articulation patterns and provides real-time visual biofeedback to patients across a range of presentations including speech sound disorders, apraxia of speech and cleft palate. Synchronised audio recording correlates tongue movement with speech sounds, aiding interpretation. Hard palate estimation, which is not possible with conventional ultrasound imaging alone, provides an additional anatomical reference point.

Research


LinguaSound 3D can be used by phoneticians, linguists and speech scientists investigating articulatory phonetics, language-specific tongue gestures and speech production across populations. The software is language independent, making it suitable for research across any language or dialect. Synchronised audio and ultrasound recording supports detailed acoustic and articulatory analysis. Hard palate estimation provides an additional articulatory reference point not available with conventional ultrasound tongue imaging. Real-time tongue surface contour coordinates can also be exported for use in custom research and multimodal data collection.

Pronunciation teaching


LinguaSound 3D gives language instructors a precise view of tongue position during speech. Unlike conventional ultrasound tongue imaging, the 3D model requires no specialist knowledge to interpret, making it accessible to instructors and learners alike. It can be used to demonstrate correct tongue placement in real time and provide visual biofeedback during pronunciation practice. Video recordings of tongue movement can also be screen captured directly from the software for use as teaching materials.

Engineering and HCI


LinguaSound 3D outputs real-time tongue surface contour coordinates and confidence scores in CSV format at up to 100 frames per second, together with the individual ultrasound video frames. This makes it a practical tool for researchers developing assistive technology, silent speech interfaces, and human-computer interaction systems that respond to tongue position and movement.

How LinguaSound 3D works

Ultrasound tongue imaging (UTI) is a non-invasive technique used to view the shape, position and movements of the tongue during speech using high-frequency sound waves. A layer of acoustic gel is applied to an ultrasound probe placed beneath the chin. The probe emits sound waves that travel through the tissues and are reflected back when they encounter different structures within the tongue. The ultrasound scanner analyses these reflections to generate dynamic images of tongue shape and motion, which are transferred to the host computer via USB.

How enhanced ultrasound tongue imaging works

LinguaSound 3D uses a deep learning model to generate tongue contour estimations from the ultrasound images. The model was trained on thousands of hand-labelled midsagittal ultrasound images of both children and adults, using the DeepLabCut framework. These contours are then used to drive an animated 3D model of the tongue, hard palate and teeth in real time. The 3D model represents tongue position and movement based on the estimated contour, rather than a direct image of the speaker's tongue.

Hard palate estimation

Due to the air above the tongue surface reflecting sound waves back to the probe, conventional ultrasound tongue imaging cannot directly capture the hard palate. LinguaSound 3D estimates its position and size by detecting the maximum extent of the tongue contour during contact. Hard palate estimation is performed as a brief setup sequence at the start of each session and can be repeated to refine the position as required. For best results, the speaker should produce sounds that bring the tongue into contact with the alveolar, palatal and velar regions. This procedure is illustrated in the following video:

Comparing conventional ultrasound tongue imaging with LinguaSound 3D

LinguaSound 3D makes conventional ultrasound tongue imaging easier to interpret. A comparison of both imaging techniques is illustrated below for a typical English speaker:

Conventional ultrasound tongue imaging LinguaSound 3D
Velar Stop /k/ Ultrasound tongue imaging /k/ LinguaSound 3D /k/
Alveolar Stop /d/ Ultrasound tongue imaging /d/ LinguaSound 3D /d/
Retroflex /r/ Ultrasound tongue imaging /r/ LinguaSound 3D /r/
High-Front Vowel /i/ Ultrasound tongue imaging /i/ LinguaSound 3D /i/
Low-Back Vowel /a/ Ultrasound tongue imaging /a/ LinguaSound 3D /a/

Compatible hardware

LinguaSound 3D is currently compatible with the Telemed MicrUs EXT-1H scanner and MC4-2R20S-3 20mm convex probe. The scanner connects to any compatible Windows laptop or desktop via USB. Compact, lightweight and fanless, it is suitable for use in clinical, classroom and laboratory settings. Contact us to discuss compatibility with your existing equipment or for purchasing guidance.

Telemed MicrUs EXT-1H PC-based ultrasound scanner

LinguaSound 3D software

LinguaSound 3D is a Windows 64-bit application that combines real-time ultrasound imaging, tongue contour estimation and 3D visualisation in a single user interface.

  • Real-time 3D tongue, hard palate and teeth visualisation
  • Real-time tongue surface contour estimation
  • Hard palate estimation
  • 360° photorealistic and stylised views
  • Adjustable model transparency
  • Synchronised audio recording and playback
  • Record at up to 100 FPS*
  • Preview, record and playback modes
  • Real-time export of tongue contour coordinates and confidence scores to CSV
  • Export individual ultrasound video frames as JPG
  • Adjustable target for biofeedback guidance
  • Configurable ultrasound scanner and probe settings

* Test system: Windows 11, Intel® Core™ i7 13700HX CPU, NVIDIA® GeForce RTX™ 4060 (8 GB) GPU, 16 GB RAM

System requirements

Parameter Value
Operating system Windows 10 and 11 (64-bit)
Supported computers Desktop and laptop
Mac support Not currently supported
Processor Intel Core i7 or i9
Graphics NVIDIA RTX series GPU (8GB)
Memory 16GB RAM
Connectivity USB 2.0 or 3.0
Ultrasound scanner Telemed MicrUs EXT-1H
Ultrasound probe Telemed MC4-2R20S-3 (20mm convex)
LinguaSound 3D box

Package contents

LinguaSound 3D is supplied as a perpetual software licence and includes:

  • Secure download link to the latest software version
  • Software registration key
  • Priority technical support
  • Free software updates
  • 30-day money back guarantee

Note: ultrasound scanner and probe sold separately.

Frequently asked questions

LinguaSound 3D is currently compatible with the Telemed MicrUs EXT-1H scanner and MC4-2R20S-3 20mm convex probe. Contact us to discuss compatibility with your existing equipment or for purchasing guidance.
LinguaSound 3D currently requires the Telemed MicrUs EXT-1H scanner. However, if you have an existing ultrasound setup, contact us to discuss your requirements — we are actively working to extend compatibility to additional scanners.
LinguaSound 3D does not currently support the import of video files recorded from other ultrasound systems. Contact us to discuss your requirements — we are actively working to extend compatibility.
LinguaSound 3D is only available for Windows 10 and 11 (64-bit). Mac support is not currently available.
LinguaSound 3D requires an NVIDIA RTX series GPU with 8GB of video memory for real-time performance. Without a compatible GPU, frame rates of 10–20 FPS can still be achieved, which may be sufficient for sustained phonemes and slower articulatory movements.
LinguaSound 3D exports two types of data.

Individual ultrasound images are exported in JPG format. The image size is dependent on the resolution set in the ultrasound settings.

Tongue surface contour data is exported to a CSV file. Eleven labels are used to describe the shape of the tongue surface, along with additional anatomical landmarks including the hyoid bone, mandible base and mental spine. The XY coordinates of each label are exported relative to the upper left of the associated image. A confidence score between 0 and 1 is also exported for each label, representing the likelihood that the contour estimation is correct.

Data export is supported in Preview, Record and Playback modes.
LinguaSound 3D can achieve frame rates of up to 100 FPS with a compatible NVIDIA RTX series GPU. Frame rate is dependent on scan depth and system specification. Without a compatible GPU, frame rates of 10–20 FPS can still be achieved, which may be sufficient for sustained phonemes and slower articulatory movements.
LinguaSound 3D has no language-specific processing. The deep learning model estimates tongue surface contours directly from ultrasound images, making it suitable for use with speakers of any language or dialect.
LinguaSound 3D provides a confidence score for each contour estimation. Frames where the confidence score falls below a user-defined threshold are automatically ignored, ensuring that only reliable contour estimations are used to animate the 3D model. The rejection threshold can be adjusted in the software settings to suit the quality of the ultrasound image. If contour estimation accuracy is consistently poor, we offer a service to add additional hand-labelled frames to the model to improve performance. Contact us for details.
Yes. LinguaSound 3D uses the DeepLabCut framework, which allows advanced users to train tongue contour estimation models using their own hand-labelled ultrasound images. Custom models can then be loaded directly into LinguaSound 3D. We recommend contacting us before undertaking this process — we can provide guidance on labelling conventions, model training and integration.
Conventional ultrasound tongue imaging produces a greyscale image of the tongue that can be difficult to interpret, even for experienced users. LinguaSound 3D uses deep learning to estimate tongue surface contours from live ultrasound images and animates a rotatable 3D model of the tongue, hard palate and teeth in real time. The result is a view of tongue position and movement that is easier to interpret for clinicians, researchers and speakers alike.
Yes. LinguaSound 3D can be used in academic research and cited in publications. The recommended citation is: LinguaSound 3D, icSpeech, a division of Rose Medical Solutions Ltd., Canterbury, UK.

The software uses the DeepLabCut framework for tongue contour estimation. If you use DeepLabCut as part of your research pipeline, please also cite: Mathis et al. (2018). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 21, 1281–1289.

The tongue contour estimation approach is based on: Wrench, A., and Balch-Tomes, J. (2022). Beyond the Edge: Markerless Pose Estimation of Speech Articulators from Ultrasound and Camera Images Using DeepLabCut. Sensors, 22, 1133.
Yes. icSpeech supplies universities, hospitals, and research institutions globally. We provide formal quotations suitable for institutional procurement processes and grant applications. Please use the Get Quote form and indicate your institution type. We will respond promptly with appropriate pricing and documentation.
Yes. LinguaSound 3D is available worldwide as a software download. Please use the Get Quote form to request pricing and purchasing information for your country.
Due to the air above the tongue surface reflecting sound waves back to the probe, conventional ultrasound tongue imaging cannot directly capture the hard palate. LinguaSound 3D detects when the tongue contour reaches its maximum extent during contact with the hard palate and uses this position to estimate the location and size of the palate. For best results, the speaker should produce sounds that bring the tongue into contact with the alveolar, palatal and velar regions of the hard palate. This provides an additional anatomical reference point not available with conventional ultrasound tongue imaging.
LinguaSound 3D includes three deep learning models trained on thousands of hand-labelled midsagittal ultrasound images recorded at multiple scan depths. The training data included speakers across a range of ages, including children and adults. Three models are available, offering different speed and accuracy trade-offs:

  • ResNet-50 — highest accuracy, lowest speed
  • MobileNet V2 1.0 — midrange accuracy and speed (default)
  • MobileNet V2 0.35 — lowest accuracy, highest speed
The ResNet-50 model achieves a mean test error of 3.22 pixels at 320x240 resolution on held-out frames. A confidence score is provided for each contour estimation, allowing frames with low confidence to be automatically excluded.

Intended use: LinguaSound 3D is speech visualisation software for use by speech and language professionals, researchers, and educators. It is not intended for the diagnosis, prevention, monitoring, or treatment of disease.