Machine Learning for Voice and Speech Science – NCVS – National Center for Voice and Speech

by Dr. Anil Palaparthi

Machine Learning is a subfield of artificial intelligence that enables computers to learn patterns or models from data and improve with experience without programming explicitly¹. In traditional programming, to perform any task, the programmer provides the input data and the model (logic or algorithm) to the computer. The computer (program) then applies the input data to the model and obtains the output (Fig. 1a). On the contrary, the goal of machine learning is to develop the model (algorithm) to perform a task, given both data and expected output from the model as inputs (Fig. 1b)².

Figure 1. (a) Traditional Program (b) Machine Learning model training and inference.

Types of Machine Learning

Machine Learning can be broadly subdivided into supervised learning, unsupervised learning, and reinforcement learning³. Supervised learning is the most common type of machine learning. It uses both inputs (data or features of the data) and outputs (often in the form of verbal descriptors, or labels) to learn the pattern between inputs and outputs as a model. The user then uses the learned model on new data to predict new outputs, which is reliable as long as the new inputs are similar to the inputs that were used to train the model (Fig. 1b). In voice and speech science research, supervised learning has been predominantly used for the automatic detection of disorders^4-6, improving the computational efficiency of simulators⁷, and estimation of voice control parameters from acoustic output signals^8,9.

Automatic Detection of Disorders

Continuous and long-term monitoring of patients is now possible with the use of the cloud and the Internet of Things (IoT) in healthcare. At the same time, it is very easy to acquire voice and speech samples from patients using IoT devices such as mobile phones. Prior research has shown that the detection of some disorders is possible from voice and speech signals^4-6. Therefore, machine learning techniques can be used on the features of speech signals for the automatic detection of disorders.

To automatically detect voice disorders using machine learning models, experts first label the speech samples as either normal or belonging to a particular disorder. Supervised learning algorithms then learn the relation between the input samples and their corresponding labels. The trained algorithms then predict whether a new sample belongs to normal phonation or disordered phonation. Such supervised learning is being used to detect laryngeal cancer, dysphonia, vocal fold nodules, polyps, edema, vocal fold paralysis, and neuromuscular disorders from voice and speech samples. Supervised learning is also being used to objectively detect GRBAS (grade, roughness, breathiness, asthenia, and strain) voice quality features¹⁰

Improving Computational Efficiency of Simulator

Voice and Speech simulators are widely used for understanding the physiology of voice production, validation of therapies, and prediction and optimization of surgical interventions¹¹. Accurate simulation of voice production for clinical purposes requires patient-specific geometries, anisotropic material properties, and solving complex 3D fluid-structure interactions⁷. Solving these complex fluid-structure interactions in patient-specific geometry is computationally expensive, taking supercomputers multiple weeks to produce one second of speech output. Such high computational complexity is preventing the use of voice simulators for widespread clinical use. Thus, faster machine learning models trained on accurate flow and pressure data obtained from solving Navier-Stokes equations can replace the traditional computational methods and provide the necessary speed and accuracy. Machine learning models are also being used to estimate some poorly known physiological control parameters of the vocal system from expected outcomes. This may include vocal fold geometry, stiffness, and subglottal pressure from the produced acoustics, aerodynamics, and vocal fold vibration. Such estimation can help in providing quantitative information to clinicians for better diagnosis of voice disorders^7,8.

Unsupervised Learning

The supervised learning methods require data with accurate labels (verbal descriptors) or outputs for better performance. However, it is often not easy and requires experts to generate accurate labels, which can be highly subjective and prone to errors. Unsupervised learning, on the other hand, does not use labeled data. Instead, it automatically learns patterns in the input data and groups them into multiple categories. Unsupervised learning is currently being used for disorder detection¹², emotion recognition¹³, and voice quality detection using voice and speech samples.

Reinforcement Learning

In Reinforcement learning, training data is not needed ahead of time¹⁴. The model interacts with the physical plant (vocal system) in a trial-and-error manner and learns to control the physical plant. This method can be used to learn the neural processes that control the vocal system. These neural control systems try to mimic how the brain controls the vocal system.

When used with voice simulators, such neural control systems can provide insights into neuromuscular disorders such as vocal tremor, Parkinson’s disease, and spasmodic dysphonia. The use of reinforcement learning is still in its infancy in voice and speech science research. Even though the DIVA model¹⁵ and other neural controllers of the vocal system^16-18 do not use reinforcement learning in its true sense, they fall under its broader category. The controllers (reinforcement learning model) take our vocal intentions (how high in pitch the voice should be, how loud the voice should be, how rough or periodic the voice should be, and what syllable to produce) as inputs and generate corresponding muscle activations as output. These muscle activations are provided as input to the vocal system, which then produces phonation at the desired vocal intentions. The auditory and somatosensory feedback from the vocal system is continuously used to train the reinforcement Learning Models after every interaction. This continuous interaction through feedback allows the model to improve over time (Fig.2).

Figure 2. Neural control system for the vocal system

Limitations of Machine Learning

The machine learning models are as good as their data. Their accuracy relies on the quality and quantity of the data that is available for training. Their performance degrades if the data are incomplete, biased, or contains errors. They can only learn the patterns in the input data and lack expandability or improvisation typically seen in humans. For example, if the new data are outside the ranges of the training data, the machine learning models will have a hard time predicting the correct output, even for the simplest of problems. Some machine learning models, such as deep neural networks, can be highly complex and the results may be difficult to explain. This could be a limitation, especially in the healthcare sector where the ability to explain how a decision was made is important¹⁹.

Conclusion

Machine Learning approaches are being used in a wide range of applications, including voice and speech science. The approaches will get better and more powerful in the future with more databases, accurate modeling, and wide distribution of software, allowing researchers and practitioners to take better advantage of them. The profound question is, will human intelligence and human learning be advanced with artificial intelligence, or will a trust in machine learning diminish a deeper understanding of the communication sciences and disorders.

References

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT press, pp. 1-7, 2018.
Turner, R. Machine Learning: The ultimate beginner’s guide to learn machine learning, artificial intelligence and neural networks step by step. Publishing Factory LLC, 2020.
Ayodele, T.O. Types of machine learning algorithms. New advances in machine learning, vol. 3, pp. 19-48, 2010.
Al-Dhief, F.T., Latiff, N.M.A., Malik, N.N.N.A., Salim, S. N., Baki, M.M., Albadr, M.A.A., et al. A survey of voice pathology surveillance systems based on Internet of Things and machine learning algorithms. IEEE Access, vol. 8, pp. 64514-64533, 2020.
Verde, L., De Pietro, G. and Sannino, G. Voice disorder identification by using machine learning techniques. IEEE Access, vol. 6, pp. 16246-16255, 2018.
Hegde, S., Shetty, S., Rai, S. and Dodderi, T. A survey of machine learning approaches for automatic detection of voice disorders. Journal of Voice, vol. 33(6), pp. 947.e11-947.e33, 2019.
Zhang, Y., Zheng, X. and Xue, Q. A deep neural network based glottal flow model for predicting fluid-structure interactions during voice production. Applied Sciences (Basel), vol. 10(2): 705, 2020.
Zhang, Z. Voice feature selection to improve performance of machine learning models for voice production inversion. Journal of Voice, 2021.
Zhang, Z. Estimation of vocal fold physiology from voice acoustics using machine learning. The Journal of the Acoustical Society of America, vol. 147(3), pp. EL264-EL270, 2020.
Kojima, T., Fujimura, S., Hasebe, K., Okanoue, Y., Shuya, O., Yuki, R. et al. Objective assessment of pathological voice using artificial intelligence based on the GRBAS scale. Journal of Voice,
Titze, I.R. and Lucero, J.C. Voice simulation: The next generation. Applied Sciences, vol. 12(22): 11720, 2022.
Rueda, A. and Krishnan, S. Clustering Parkinson’s and age-related voice impairment signal features for unsupervised learning. Advances in Data Science and Adaptive Analysis, vol 10(2):1840007, 2018.
Zhang, Z., Weninger, F., Wollmer, M. and Schuller, B. Unsupervised learning in cross-corpus acoustic emotion recognition. IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 523-528, 2011.
Sutton, R.S. and Barto, A.G. Reinforcement learning: An introduction, MIT press, 2018.
Guenther, F.H. Neural control of speech. Cambridge, MA. MIT press, 2016.
Kroger, B.J., Kannampuzha, J. and Rube, C.N. Towards a neurocomputational model of speech production and perception. Speech Communication, vol. 51, pp. 793-808, 2009.
Hickok, G. Computational neuroanatomy of speech production. Nature reviews neuroscience, vol. 13(2), pp. 135-145, 2012.
Palaparthi, A. Computational motor learning and control of the vocal source for voice production. D. dissertation, University of Utah, Salt Lake City, UT, 2021.
Ribeiro, M.T., Singh, S. and Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the 22^nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135-1144, 2016.

How to Cite

Palaparthi, A. (2023), Machine Learning for Voice and Speech Science. NCVS Insights, Vol 1(1), pp. 3-4. DOI: https://doi.org/10.62736/ncvs189226