Designing with Voice: How Speech-to-Text Unlocks New Possibilities in eLearning

#SpeechToText #AIinLearning #VoiceInput #eLearningDesign #Storyline #AzureSpeech #AccessibleLearning

AI Create Index 550 (meaning that although the copy is original, AI was used in the research for the content, and the copy was refined and improved using AI. AI was alsoused in generation of some graphics) THIS IS 50% ORIGINAL CONTENT.

Introduction

If we are honest about it, we know that a lot of eLearning still boils down to watching a screen and clicking through a set of options.. Even when we incorporate branching scenarios or multimedia, learner input is still largely limited to multiple-choice, drag-and-drop, or typing short responses into form fields.

But as voice interfaces become more mainstream – on phones, smart speakers, and even productivity apps – it’s time we ask: why not in eLearning? Voice isn’t just a user interface trend. It represents a more natural, intuitive, and human way for learners to interact with content. And the first step toward voice-powered learning isn’t necessarily full-blown AI conversation — it may be something much simpler and more easily achievable: speech-to-text.

Whilst previous articles in this series have focused on AI generated speech, we now explore how speech input and AI can enhance learning design by making interactions more engaging, accessible, and reflective of how people communicate in the real world. We’ll share how we’ve implemented this in recent courses, and why we believe this approach opens the door to far more dynamic, learner-centered experiences.

Case Study: Adding Voice Input to a Storyline Module

In one of our recent projects built in Articulate Storyline, we gave learners the option to input notes at various points in the course and these were incorporated into a pdf handout at the end of the module. We realised that for the notes to be meaningful, typing them out would be time-consuming for the learner. The solution was obvious – allow the user the option of using voice instead of the keyboard to complete free-text tasks. Although our primary motivation was to make it faster for the learner to complete their input, we quickly realised the benefits in improving accessibility.

Click the play button on the video to see a demonstration:

So, where users were prompted to provide written input to text entry fields, we placed a web object incorporating a microphone icon that activated the user’s microphone and began recording.Here’s how it worked under the hood:

The web object included a JavaScript event listener that detected when the learner clicked the mic icon to start or stop recording.
Once recording stopped, the audio file (a temporary .wav) was sent to a short server-side app that used Azure’s Speech-to-Text API for transcription.
Within seconds, the learner’s spoken input appeared in the text field — editable, just like typed text.

This was more than a novelty. It gave learners choice and flexibility — and in some cases, reduced the barrier to engagement. Whether for accessibility reasons, convenience, or simple preference, being able to speak instead of type changed how learners interacted with the content.

And critically, it didn’t change the instructional logic. The same branching, feedback, and assessment applied whether the learner typed or spoke their input.

Why Speech-to-Text Matters for Learning Design

This example demonstrates how the ability to accept voice input opens up new layers of possibility for instructional designers — not as a gimmick, but as a genuinely learner-centric design enhancement.

Accessibility For learners with motor impairments, dyslexia, or temporary physical limitations (like an injured hand), typing can be a challenge. Voice input offers a more inclusive alternative that allows them to fully participate.

Engagement and Presence Speaking feels more active and personal than typing. It mimics real-world interactions, especially in scenario-based learning, where the learner might be playing the role of a team leader, customer service rep, or coach. Saying a response out loud creates a sense of presence that typing often lacks.

Efficiency In some cases, speech input is simply faster. For learners on mobile or tablet devices, typing long responses can be awkward. Voice speeds things up and lowers friction — especially for quick exercises or reflection prompts.

Expressiveness When learners speak freely, they tend to be less formal, more direct, and more expressive. That can lead to richer data for formative assessment — and it gives AI tools more natural language to work with when providing feedback or adapting responses.

Beyond Input: Batch Processing and AI Integration

While real-time voice input already enhances engagement, the real power of speech-to-text in eLearning emerges when we look beyond the immediate interaction. Once you’ve transcribed a learner’s spoken response into text, that text becomes data — and data can be processed, analyzed, or even fed into intelligent systems.

This opens up a wide range of design possibilities:

Send to an AI API for Feedback or Response

A learner explains how they’d handle a customer complaint → GPT analyzes the answer and offers coaching tips.
A leadership scenario asks for a spoken decision rationale → AI replies with potential outcomes or peer comparisons.

This creates a feedback loop that’s contextual, immediate, and adaptive — a step closer to personalized learning at scale.

Store and Reuse Responses Later

Transcribed responses can be:

Reused in future lessons for reflection.
Analyzed for learning patterns or sentiment.
Used to personalize branching, content delivery, or peer feedback.

Batch Process Voice Inputs

You can collect and convert large volumes of spoken input — from role-play exercises, presentations, or workshops — into searchable transcripts. This is useful for:

Reviewing learner performance.
Conducting AI-driven analysis.
Supporting compliance and auditing in regulated industries.

Voice-Driven Branching

By combining speech recognition with logic triggers, you can design learning paths that respond to spoken keywords, sentiment, or intent. This allows for voice-controlled simulations or assessments — without relying on mouse clicks or menus.

Final Thoughts

As learning designers, we often look for ways to make content feel more human, more engaging, and more inclusive. Speech-to-text is a deceptively simple tool – but when used well, it opens the door to a more natural and learner-centered experience.

In our own projects, we’ve seen how adding voice input to a standard eLearning course can shift the tone of interaction. It invites participation. It lowers friction. And it gives learners another way to express themselves – one that feels closer to real-world communication.

But perhaps most importantly, it lays the groundwork for what’s coming next.

Once voice becomes a two-way channel – not just for input, but for conversation – everything changes. In our next article, we’ll explore how OpenAI’s new real-time voice API brings this to life, enabling learners to speak with an AI that listens, understands, and responds in real time.

🚀 Want expert advice on integrating AI technology into your eLearning content? Profile Learning Technologies specialises in AI-driven learning solutions tailored to your business needs.

📩 Contact us today to explore how AI speech-to-text can enhance your eLearning strategy.

Please feel free to share this article by clicking the buttons provided and don’t forget to follow our company page on LinkedIn for news of further articles or free courses on this site by using the link in the footer below.