JARVIS - A Simple Voice Control

JARVIS Basics and Setup

I think everyone knows J.A.R.V.I.S. (Just A Rather Very Intelligent System) , the very cool artificial intelligence from the iron-man movies. This JARVIS is of course not as cool and intelligent as the version from the movies, it is a relative simple prototype for a simple command control by speech recognition which is based on a pre-trained artificial neural network. The neural network can be downloaded for free from "Deepspeech". JARVIS is able to recognize and respond to spoken commands in real time, the commands and the corresponding actions and text responses are stored in a JSON file.

The implementation is done in Python. Thus, JARVIS can be run on any system for which a python environment is available, i.e. on normal computers and additionally on a variety of single - board computers, such as the Raspberry Pi.

Two main components are necessary for real time speech recognition: "PyAudio" and "Deeepspeech".

Installing PyAudio and DeepSpeech

PyAudio provides Python bindings for PortAudio v19, the cross-platform audio I/O library. With PyAudio, you can easily use Python to play and record audio on a variety of platforms, such as GNU/Linux, Microsoft Windows, and Apple macOS. PyAudio is distributed under the MIT License. Use

python -m pip install pyaudio

to install pyAudio. This installs the precompiled PyAudio library with PortAudio v19 19.7.0 included. The library is compiled with support for Windows MME API, DirectSound, WASAPI, and WDM-KS.

DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper. Project DeepSpeech uses Google’s TensorFlow. Use

# Install DeepSpeech
pip3 install deepspeech

# Download pre-trained English model files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm

to install DeepSpeech.

Basic structure of the Program

The main class is called JARVIS. In the Init method the language model is loaded, pyaudio is initialized and an audio input device (microphone) is searched for.

class JARVIS:
    def __init__(self):
        self.model = deepspeech.Model("deepspeech-0.9.3-models.pbmm")
        self.model.setBeamWidth(512)
        # direct audio processing
        self.audio = pyaudio.PyAudio()
        self.index, name = self.findAudioDevice(self.audio, 'pulse')
        print("selected audio device : ", name)

The "findAudioDevice" method searches for all microphones on the system.

def findAudioDevice(self, pyaudio, device_name):
        ''' find specific device or return default input device'''
        default = pyaudio.get_default_input_device_info()
        for i in range(pyaudio.get_device_count()):
            name = pyaudio.get_device_info_by_index(i)['name']
            if name == device_name:
                return (i, name)
        return (default['index'], default['name'])

The "waitForCommand" method is the core of the system. This method starts an audio stream via PyAudio. The stream is then directly converted to text with the Deepspeech method "intermediateDecode"() and then printed to the console. As soon as the command "stop" is recognized, the method is terminated.

def waitForCommand(self):
        # important: create new queue to make sure it is empty
        self.buffer_queue = SimpleQueue()
        self.stream = self.model.createStream()
        buffer_size = self.model.sampleRate()
        self.audio_stream = self.audio.open(rate=self.model.sampleRate(),
                                  channels=1,
                                  format=self.audio.get_format_from_width(2, unsigned=False),
                                  input_device_index=self.index,
                                  input=True,
                                  frames_per_buffer=buffer_size,
                                  stream_callback=self.audio_callback)

        while self.audio_stream.is_active():
            self.stream.feedAudioContent(self.buffer_queue.get())
            text = self.stream.intermediateDecode()
            print(">>" , text)
            if text.find('stop') >= 0:
                break
        self.stream.finishStream()
        self.audio_stream.close()
    
        # callback for new data in the audio stream
    def audio_callback(self, in_data, frame_count, time_info, status_flags):
        self.buffer_queue.put(np.frombuffer(in_data, dtype='int16'))
        return (None, pyaudio.paContinue)

Now JARVIS just needs to be instantiated and is ready for a first test.