As a privacy-focused company, we implemented an on-premise transcription service instead of a hosted AI solution. Here's why.
Kaweh Ebrahimi-Far
Artificial Intelligence (AI) is transforming industries, and we’ve taken our first step by introducing AI-powered voicemail transcriptions in our communications platform, Freedom. This feature is just the beginning of a broader AI roadmap, and it serves as a pilot project for the other innovations we plan to roll out. However, unlike many companies, we’re prioritizing privacy over speed. This approach allows us to explore options that provide greater data control, such as on-premise AI solutions.
In this blog post, we’ll share insights from our experience implementing an on-premise transcription service, discuss the pros and cons of hosted vs. on-premise AI solutions, and explain why privacy-focused companies should consider an on-premise approach for their AI needs.
When it comes to integrating AI into your product, businesses generally have two options: Software as a Service (SaaS) or on-premise solutions. Each has its own benefits and drawbacks, depending on your organization’s priorities.
Hosted SaaS solutions are widely popular. Major providers offer APIs that process and transcribe audio, making it easy for businesses to add AI features with minimal effort.
Advantages:
Disadvantages:
On-premise hosting allows you to run AI models on your own infrastructure, giving you complete control over data privacy and processing.
Advantages:
Disadvantages:
Choosing between SaaS and on-premise depends on your specific needs regarding privacy, control, and costs. For us, privacy was the deciding factor, leading us to select an on-premise solution.
We implemented our transcription service using Holodeck, our Kubernetes-based microservice platform. This setup allowed us to quickly build and test the new service. We chose to incorporate Whisper from OpenAI due to its strong performance, ease of use, and MIT license.
However, in practice, we discovered that the base OpenAI Whisper model wasn’t fast enough to meet our needs. To process around 45 minutes of voicemail per minute, the original model required significant GPU resources, leading us to explore faster alternatives.
We found success by transitioning to Faster-Whisper, which provided a 5x speed improvement over the original model. By modifying Whisper-large-v3 and converting it to a CTranslate2 model, we achieved our performance goals without needing an extensive GPU farm.
To avoid frequent container rebuilds when updating our microservice, we separated the model from the service and dataset repository. This approach allows us to update the model using the dataset repository without constantly rebuilding large containers.
Additionally, we opted out of Git LFS in favor of Blob storage as our primary version control system. This decision was driven by cost-effectiveness and the infrequency of model updates, making Blob storage a practical solution.
Although we didn’t train the model ourselves – OpenAI handled that – there are still opportunities for improvement. One potential enhancement is to fine-tune the model using transformers from Huggingface. However, we found that the cost of allocating GPUs for minor Word Error Rate (WER) improvements didn’t justify the investment.
Another potential optimization is refining our tokenizer to give more weight to proper names in voicemail messages. This could help the model more accurately identify names, rather than making incorrect guesses.
Ever been frustrated when your phone’s voice assistant misunderstands you? The accuracy of speech recognition is measured using Word Error Rate (WER), which calculates the percentage of words that were incorrectly transcribed.
WER is calculated based on:
WER provides a general sense of a model’s accuracy, but it’s not perfect. It treats all mistakes equally, whether they’re minor or significant, and doesn’t account for regional accents or dialects. Improving WER remains a goal as we continue refining our service.
Our AI model runs on AWS Kubernetes nodes with NVIDIA T4 GPUs, optimized for AI inferencing. To ensure these nodes are dedicated to transcription tasks, we reserved them solely for transcriber workloads.
We use NVIDIA’s k8s-device-plugin to expose the GPUs to the pods, which also detects the number of GPUs available and communicates this information to the Kubernetes scheduler.
To optimize performance, we plan to implement a Horizontal Pod Autoscaler that adjusts the number of worker pods based on the queue size of pending voicemail messages. The cluster-autoscaler service will then scale the number of nodes accordingly, ensuring efficient use of resources.
Currently, we are transcribing 65,800 minutes of voicemails per month, with server costs totaling $960.68. Here’s how the costs break down across different solutions:
While the OpenAI API offers the lowest cost per minute, our on-premise solution still provides a competitive alternative when considering total capacity. However, if you can’t fully utilize on-premise capacity, it can become one of the most expensive options.
Real-world implementation always brings new insights, and our journey with on-premise AI was no exception. Here are a few key learnings:
Given the choice between the OpenAI API and our on-premise solution, the API offers 2.4x lower running costs. However, as developers and entrepreneurs shaping the future of the internet, we have a responsibility to consider the broader implications of our technology choices. Supporting the growth of large AI monopolies may not be in the best interests of our users.
For us, on-premise AI is worth the additional effort and cost because it enables us to prioritize privacy without sacrificing user convenience. While it may have been cheaper to use a hosted API, the long-term benefits of safeguarding user data make on-premise the right choice for our company.
As an independent company facing competitive pressures, it’s tempting to adopt the quickest and cheapest solutions available. However, I urge other companies to consider on-premise AI options that prioritize privacy and provide better long-term outcomes for users. By making these choices today, we can create a more secure and equitable future for all.
On our blog we post about a lot of stuff, just go for it and read some posts for your own fun.
from 16 August 2024