The emergence of virtual assistants such as Siri and Alexa has made automated speech recognition systems more widely used and developed. Automatic speech recognition (ASR) is a process of converting spoken language into text. The technology is constantly being used in instant messaging applications, search engines, vehicle systems, and home automation. Although all of these systems rely on slightly different technical processes, the first step in all of these systems is the same: capture voice data and convert it into machine-readable text. But how does the ASR system work? How does it learn to recognize speech? ASR systems: how do they work? Therefore, from a basic perspective, we know that automatic speech recognition looks as follows: audio data input, text data output. However, from input to output, audio data needs to become machine-readable data. This means that data is sent through acoustic models and language models. These two processes are like this: the acoustic model determines the relationship between audio signals and speech units in the language, and the language model matches the sound to words and word sequences. These two models allow the ASR system to perform a probabilistic check on the audio input to predict the words and sentences in it. The system then selects the prediction with the highest confidence level. ** Sometimes language models can prioritize certain predictions that are considered more likely due to other factors. Therefore, if the phrase is run through the ASR system, it will do the following: make a voice input: "Hey Siri, what time is it now?" Run the voice data through the acoustic model and break it down into voice parts. Run the data through the language model. Output text data: "Hey Siri, what time is it?" Here, it is worth mentioning that if the automatic speech recognition system is part of the speech user interface, the ASR model will not be the only machine learning model that is running. Many automatic speech recognition systems are used in conjunction with natural language processing (NLP) and text-to-speech (TTS) systems to perform their given roles. In other words, in-depth study of voice user interface is a complete topic in itself. To learn more, check this article. So, now you know how the ASR system works, but what do you need to build? The key is data. Establish an ASR system: The importance of data. A good ASR system should have flexibility. It needs to recognize various audio inputs (speech samples) and make accurate text output based on the data in order to respond accordingly. To achieve this, the data required by the ASR system are labeled speech samples and transcribed forms. It's a bit more complicated than this (for example, the data marking process is very important and often overlooked), but to make it clear to everyone, it's simplified here. ASR systems require large amounts of audio data. why? Because the language is complicated. There are many ways to tell the same thing, and the meaning of the sentence changes with the position and emphasis of the word. Also consider that there are many different languages in the world. In these languages, the pronunciation and word selection may vary depending on factors such as geographic location and accent. Oh, do n’t forget that language also varies with age and gender! With this in mind, the more speech samples provided for the ASR system, the better it is at recognizing and classifying new speech input. The more samples taken from a variety of sounds and environments, the more the system can recognize sounds in these environments. Through specialized fine-tuning and maintenance, the automatic speech recognition system will be improved during use. Therefore, from the most basic point of view, the more data, the better. It is true that the current research is related to optimizing smaller data sets, but most models currently require large amounts of data to perform well. Fortunately, thanks to the data set repository and dedicated data collection service, the collection of audio data becomes easier. This in turn increases the speed of technological development. Then, let's take a brief look at the areas where automatic speech recognition can show its future. ASR technology has been integrated into society. Virtual assistants, in-vehicle systems and home automation all make daily life more convenient, and the range of applications may also expand. As more and more people accept these services, technology will further develop.