3D visualisation of ultrasound tongue imaging
Conventional ultrasound tongue imaging is powerful, but the speckled greyscale image can sometimes be difficult to interpret, even for experienced users. LinguaSound 3D uses deep learning to estimate tongue surface contours from live ultrasound images and animates a rotatable 3D model of the tongue, hard palate and teeth in real time. The result is a view of tongue position and movement that is easier to interpret for clinicians, researchers and the speakers themselves.
Who uses LinguaSound 3D
LinguaSound 3D can be used by speech professionals, phonetics researchers, pronunciation instructors and human-computer interaction (HCI) engineers.
Clinical
LinguaSound 3D provides speech professionals with a clearer view of tongue position and movement during speech. The 3D model makes it easier to observe articulation patterns and provides real-time visual biofeedback to patients across a range of presentations including speech sound disorders, apraxia of speech and cleft palate. Synchronised audio recording correlates tongue movement with speech sounds, aiding interpretation. Hard palate estimation, which is not possible with conventional ultrasound imaging alone, provides an additional anatomical reference point.
Research
LinguaSound 3D can be used by phoneticians, linguists and speech scientists investigating articulatory phonetics, language-specific tongue gestures and speech production across populations. The software is language independent, making it suitable for research across any language or dialect. Synchronised audio and ultrasound recording supports detailed acoustic and articulatory analysis. Hard palate estimation provides an additional articulatory reference point not available with conventional ultrasound tongue imaging. Real-time tongue surface contour coordinates can also be exported for use in custom research and multimodal data collection.
Pronunciation teaching
LinguaSound 3D gives language instructors a precise view of tongue position during speech. Unlike conventional ultrasound tongue imaging, the 3D model requires no specialist knowledge to interpret, making it accessible to instructors and learners alike. It can be used to demonstrate correct tongue placement in real time and provide visual biofeedback during pronunciation practice. Video recordings of tongue movement can also be screen captured directly from the software for use as teaching materials.
Engineering and HCI
LinguaSound 3D outputs real-time tongue surface contour coordinates and confidence scores in CSV format at up to 100 frames per second, together with the individual ultrasound video frames. This makes it a practical tool for researchers developing assistive technology, silent speech interfaces, and human-computer interaction systems that respond to tongue position and movement.
How LinguaSound 3D works
Ultrasound tongue imaging (UTI) is a non-invasive technique used to view the shape, position and movements of the tongue during speech using high-frequency sound waves. A layer of acoustic gel is applied to an ultrasound probe placed beneath the chin. The probe emits sound waves that travel through the tissues and are reflected back when they encounter different structures within the tongue. The ultrasound scanner analyses these reflections to generate dynamic images of tongue shape and motion, which are transferred to the host computer via USB.
LinguaSound 3D uses a deep learning model to generate tongue contour estimations from the ultrasound images. The model was trained on thousands of hand-labelled midsagittal ultrasound images of both children and adults, using the DeepLabCut framework. These contours are then used to drive an animated 3D model of the tongue, hard palate and teeth in real time. The 3D model represents tongue position and movement based on the estimated contour, rather than a direct image of the speaker's tongue.
Hard palate estimation
Due to the air above the tongue surface reflecting sound waves back to the probe, conventional ultrasound tongue imaging cannot directly capture the hard palate. LinguaSound 3D estimates its position and size by detecting the maximum extent of the tongue contour during contact. Hard palate estimation is performed as a brief setup sequence at the start of each session and can be repeated to refine the position as required. For best results, the speaker should produce sounds that bring the tongue into contact with the alveolar, palatal and velar regions. This procedure is illustrated in the following video:
Comparing conventional ultrasound tongue imaging with LinguaSound 3D
LinguaSound 3D makes conventional ultrasound tongue imaging easier to interpret. A comparison of both imaging techniques is illustrated below for a typical English speaker:
| Conventional ultrasound tongue imaging | LinguaSound 3D | |
|---|---|---|
| Velar Stop /k/ | ![]() |
![]() |
| Alveolar Stop /d/ | ![]() |
![]() |
| Retroflex /r/ | ![]() |
![]() |
| High-Front Vowel /i/ | ![]() |
![]() |
| Low-Back Vowel /a/ | ![]() |
![]() |
Compatible hardware
LinguaSound 3D is currently compatible with the Telemed MicrUs EXT-1H scanner and MC4-2R20S-3 20mm convex probe. The scanner connects to any compatible Windows laptop or desktop via USB. Compact, lightweight and fanless, it is suitable for use in clinical, classroom and laboratory settings. Contact us to discuss compatibility with your existing equipment or for purchasing guidance.
LinguaSound 3D software
LinguaSound 3D is a Windows 64-bit application that combines real-time ultrasound imaging, tongue contour estimation and 3D visualisation in a single user interface.
- Real-time 3D tongue, hard palate and teeth visualisation
- Real-time tongue surface contour estimation
- Hard palate estimation
- 360° photorealistic and stylised views
- Adjustable model transparency
- Synchronised audio recording and playback
- Record at up to 100 FPS*
- Preview, record and playback modes
- Real-time export of tongue contour coordinates and confidence scores to CSV
- Export individual ultrasound video frames as JPG
- Adjustable target for biofeedback guidance
- Configurable ultrasound scanner and probe settings
* Test system: Windows 11, Intel® Core™ i7 13700HX CPU, NVIDIA® GeForce RTX™ 4060 (8 GB) GPU, 16 GB RAM
System requirements
| Parameter | Value |
|---|---|
| Operating system | Windows 10 and 11 (64-bit) |
| Supported computers | Desktop and laptop |
| Mac support | Not currently supported |
| Processor | Intel Core i7 or i9 |
| Graphics | NVIDIA RTX series GPU (8GB) |
| Memory | 16GB RAM |
| Connectivity | USB 2.0 or 3.0 |
| Ultrasound scanner | Telemed MicrUs EXT-1H |
| Ultrasound probe | Telemed MC4-2R20S-3 (20mm convex) |
Package contents
LinguaSound 3D is supplied as a perpetual software licence and includes:
- Secure download link to the latest software version
- Software registration key
- Priority technical support
- Free software updates
- 30-day money back guarantee
Note: ultrasound scanner and probe sold separately.
Frequently asked questions
Individual ultrasound images are exported in JPG format. The image size is dependent on the resolution set in the ultrasound settings.
Tongue surface contour data is exported to a CSV file. Eleven labels are used to describe the shape of the tongue surface, along with additional anatomical landmarks including the hyoid bone, mandible base and mental spine. The XY coordinates of each label are exported relative to the upper left of the associated image. A confidence score between 0 and 1 is also exported for each label, representing the likelihood that the contour estimation is correct.
Data export is supported in Preview, Record and Playback modes.
The software uses the DeepLabCut framework for tongue contour estimation. If you use DeepLabCut as part of your research pipeline, please also cite: Mathis et al. (2018). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 21, 1281–1289.
The tongue contour estimation approach is based on: Wrench, A., and Balch-Tomes, J. (2022). Beyond the Edge: Markerless Pose Estimation of Speech Articulators from Ultrasound and Camera Images Using DeepLabCut. Sensors, 22, 1133.
- ResNet-50 — highest accuracy, lowest speed
- MobileNet V2 1.0 — midrange accuracy and speed (default)
- MobileNet V2 0.35 — lowest accuracy, highest speed
Intended use: LinguaSound 3D is speech visualisation software for use by speech and language professionals, researchers, and educators. It is not intended for the diagnosis, prevention, monitoring, or treatment of disease.









