The term ‘Multi-Modal’ refers to the ability to support more than just text, encompassing images, videos, audio and files.

Voice Assistant

Chainlit let’s you access the user’s microphone audio stream and process it in real-time. This can be used to create voice assistants, transcribe audio, or even process audio in real-time.

The user will only be able to use the microphone if you implemented the @cl.on_audio_chunk decorator.

Voice Assistant Example

Check the Audio Assistant cookbook example to see how to implement a voice assistant.

Audio capture settings

You can configure audio capture the au through the Chainlit config file.

Spontaneous File Uploads

Within the Chainlit application, users have the flexibility to attach any file to their messages. This can be achieved either by utilizing the drag and drop feature or by clicking on the attach button located in the chat bar.

Attach files to a message

As a developer, you have the capability to access these attached files through the cl.on_message decorated function.

import chainlit as cl

async def on_message(msg: cl.Message):
    if not msg.elements:
        await cl.Message(content="No file attached").send()

    # Processing images exclusively
    images = [file for file in msg.elements if "image" in file.mime]

    # Read the first image
    with open(images[0].path, "r") as f:

    await cl.Message(content=f"Received {len(images)} image(s)").send()

Image Processing with Transformers

Multi-modal capabilities are being added to Large Language Model (effectively making them Large Multi Modal Models). OpenAI’s vision API and the LLaVa cookbook are good places to start for image processing with transformers.

Disabling Spontaneous File Uploads

If you wish to disable this feature (which would prevent users from attaching files to their messages), you can do so by setting features.spontaneous_file_upload.enabled=false in your Chainlit config file.