MoM AI is a web application that automates the process of turning meeting recordings into structured minutes. You upload an audio or video file, the system transcribes it using a speech provider of your choice (Deepgram, AssemblyAI, Sarvam AI, or ElevenLabs), and then sends the transcript to an LLM to generate organized Minutes of Meeting.
The output includes a meeting summary, key discussion points, decisions made, action items, and next steps. Everything is presented in a clean, readable format inside the browser. The app handles the entire pipeline in the background, so you can upload a recording and come back when it is done.
After attending several team meetings, I noticed that writing minutes manually was repetitive, time-consuming, and often inconsistent. Notes were scattered, key decisions got missed, and action items were forgotten.
I built this as a side project to solve that problem for myself first. It was also a good opportunity to learn how to integrate multiple AI APIs into a single pipeline, work with background task processing, and understand how different speech-to-text providers handle real-world audio.
The project is open source because I believe the pattern of combining speech-to-text with LLM summarization is useful for many developers. Sharing the full implementation felt more valuable than keeping it private.
Writing meeting minutes manually has a few common issues:
MoM AI reduces this work to a single upload. The output follows a consistent structure every time: summary, attendees, discussion points, decisions, action items, and next steps. This makes it easier to review meetings, track responsibilities, and share outcomes with the team.
Supports Deepgram, AssemblyAI, Sarvam AI, and ElevenLabs with speaker diarization
Works with any OpenAI-compatible API - OpenAI, Groq, OpenRouter, Ollama, and others
Celery + Redis for async processing - upload and come back when it is done
Full stack runs with a single docker compose up command
The processing flow has four clear steps:
ffmpeg to convert
the uploaded file into a 16 kHz mono WAV file. This standardized format works reliably across all speech providers.
The user interface polls the backend for progress updates and displays the final transcript and minutes once processing is complete.
This section shows the key parts of the implementation. The full source is on GitHub.
This is the main task that ties everything together. It converts the audio, transcribes it, sends the transcript to the LLM, and stores the result.
@celery.task(bind=True, name="app.tasks.process.process_meeting", max_retries=2)
def process_meeting(self, meeting_id: int) -> dict:
meeting = db.session.get(Meeting, meeting_id)
wav_path = None
def _update(status, progress, **kwargs):
meeting.status = status
meeting.progress = progress
for k, v in kwargs.items():
setattr(meeting, k, v)
db.session.commit()
try:
upload_folder = current_app.config["UPLOAD_FOLDER"]
audio_path = os.path.join(upload_folder, meeting.stored_filename)
_update("processing", 10)
wav_path = convert_to_wav(audio_path, output_dir=tempfile.mkdtemp())
_update("transcribing", 35)
speech_provider = get_speech_provider()
transcript = speech_provider.transcribe_file(wav_path)
_update("summarizing", 70)
llm = get_llm_client()
minutes_md = llm.generate_minutes(transcript)
minutes_html = md_lib.markdown(minutes_md, extensions=["tables", "nl2br"])
_update("completed", 100, transcript=transcript, minutes_of_meeting=minutes_html)
return {"meeting_id": meeting_id, "status": "completed"}
except Exception as exc:
_update("failed", 0, error_message=str(exc))
raise
finally:
if wav_path and os.path.exists(wav_path):
os.remove(wav_path)
The LLM receives a structured system prompt that guides it to produce consistent, well-organized minutes every time.
def generate_minutes(self, transcript: str) -> str:
system_prompt = (
"You are an expert meeting analyst. Given a meeting transcript, "
"produce well-structured **Minutes of Meeting** in Markdown with:\n\n"
"## Meeting Summary\n"
"A concise 2-4 sentence overview.\n\n"
"## Attendees\n"
"Names/roles mentioned (or 'Not specified').\n\n"
"## Key Discussion Points\n"
"Bullet list of main topics discussed.\n\n"
"## Decisions Made\n"
"Numbered list of decisions reached.\n\n"
"## Action Items\n"
"Table with columns: | Task | Owner | Due Date |\n\n"
"## Next Steps\n"
"What happens after this meeting.\n\n"
"Be concise, professional, and accurate. "
"Use only information from the transcript."
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Transcript:\n\n{transcript}"},
]
return self.complete(messages)
Providers are swappable at runtime. The admin picks a provider from the settings panel, and this factory returns the correct implementation.
def get_speech_provider() -> BaseSpeechProvider:
provider_name = SystemConfig.get("speech_provider", "deepgram").lower()
if provider_name == "deepgram":
from app.providers.speech.deepgram_provider import DeepgramProvider
return DeepgramProvider(api_key=SystemConfig.get("deepgram_api_key", ""))
elif provider_name == "assemblyai":
from app.providers.speech.assemblyai_provider import AssemblyAIProvider
return AssemblyAIProvider(api_key=SystemConfig.get("assemblyai_api_key", ""))
elif provider_name == "sarvam":
from app.providers.speech.sarvam_provider import SarvamProvider
return SarvamProvider(api_key=SystemConfig.get("sarvam_api_key", ""))
elif provider_name == "elevenlabs":
from app.providers.speech.elevenlabs_provider import ElevenLabsProvider
return ElevenLabsProvider(api_key=SystemConfig.get("elevenlabs_api_key", ""))
All uploaded files are normalized to 16 kHz mono WAV before being sent to any speech provider. This keeps the transcription step consistent regardless of the input format.
def convert_to_wav(input_path: str, output_dir: str = None) -> str:
output_dir = output_dir or tempfile.gettempdir()
os.makedirs(output_dir, exist_ok=True)
base_name = os.path.splitext(os.path.basename(input_path))[0]
output_path = os.path.join(output_dir, f"{base_name}_converted.wav")
cmd = [
"ffmpeg", "-y",
"-i", input_path,
"-ar", "16000",
"-ac", "1",
"-vn",
output_path,
]
result = subprocess.run(cmd, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, timeout=600)
if result.returncode != 0:
raise RuntimeError(f"ffmpeg conversion failed:\n{result.stderr.decode()}")
return output_path
I designed and built this project end to end.
Handling different speech provider APIs and output formats
Each provider returns transcripts in a different structure. Some give word-level timestamps, others give
utterances, and speaker labels vary across providers. I had to normalize these outputs into a consistent
format that the LLM could work with reliably. This taught me the importance of building a clean abstraction
layer when dealing with multiple external APIs.
Managing long-running tasks without blocking the web server
Audio files can be large, and transcription plus summarization can take minutes. I set up Celery with Redis
to handle this asynchronously. Getting the progress tracking right (so the UI could show real-time status
updates via polling) required careful state management on the database side.
Prompt engineering for consistent LLM output
Early versions of the summarization prompt produced inconsistent results. Sometimes the LLM would skip
sections, use different headings, or add unnecessary commentary. Iterating on the system prompt to get a
reliable, structured output taught me that small wording changes in prompts can significantly affect the
quality and consistency of the result.
Audio format normalization across different input types
Users upload files in all sorts of formats (MP4, WebM, M4A, OGG). Not all speech providers handle all
formats well. Converting everything to a standardized 16 kHz mono WAV using ffmpeg before transcription
solved compatibility issues across all providers.
# Clone the repository
git clone https://github.com/inboxpraveen/LLM-Minutes-of-Meeting.git
cd LLM-Minutes-of-Meeting
# Start the full stack (Flask, Celery worker, Celery beat, Redis)
docker compose up
Once running, open the web interface in your browser. Configure your speech provider and LLM settings from the admin panel, then upload a meeting recording to get started.
MoM AI is open source and built to be useful. Whether you want to automate your own meeting notes, learn how to integrate speech-to-text and LLM APIs, or just explore the codebase for ideas, you are welcome to use, fork, or improve it.
If you find it helpful or have suggestions, feel free to open an issue or contribute on GitHub.