Enable streaming


Is it possible to make the response of the following request stream? I’m trying achieve the same behaviour as generative chat AI’s like OpenAI. Where the response comes in chunks instead of waiting the entire response to complete.

curl --location 'https://api.vectara.io/v1/query' \
--header 'x-api-key: API_KEY' \
--header 'customer-id: CUSTOMER_ID' \
--header 'Content-Type: application/json' \
--data '{
  "query": [
      "query": "Is a retrospective study allowed?",
      "start": 0,
      "numResults": 10,
      "corpusKey": [
          "customerId": CUSTOMER_ID,
          "corpusId": 1
      "context_config": {
          "sentences_before": 2,
          "sentences_after": 2,
          "start_tag": "%START_SNIPPET%",
          "end_tag": "%END_SNIPPET%"
      "summary": [
            "summarizerPromptName": "vectara-summary-ext-v1.2.0",
            "responseLang": "en",
            "maxSummarizedResults": 5

Hi Kassen! Thanks for bringing this up. Just so I’m clear on your goals, are you trying to build a user experience like the “streaming response” in ChatGPT, where a few words appear at a time until the response is complete? Are you trying to build this in a web app, mobile app, or something else? If a web app, would you prefer to call the Query API directly from the client-side code, or proxied behind an endpoint on a server? Thanks!

Yes, you’re right on the goal. I need it for a web app initially and later for mobile. I would prefer to call the api directly but if it’s not possible I am ok to build a web service for it.

Got it! It’s not yet possible to consume the REST API this way. However, one solution we’re exploring is to build a client-side library that you can install via NPM.

Just spitballing here, but say you could do something like this. Would you find this useful? Would it meet your needs for the web app or would it miss the mark?

import { VectaraStore } from '@vectara/stores';

const vectaraStore = new VectaraStore({
  accountId: 'your account id',

const stream = vectaraStore.queryStream({
  corpusId: ['corpus id 1', 'corpus id 2'],
  apiKey: 'apiKey'

stream.on('data', (data) => {
  // The first chunk will be the results and
  // subsequent chunks will be sequences of words
  // that compose the summary.

Yes, that would be very useful. Especially for long generative text responses. Because you don’t want the client app waiting long time for the ai to complete its generated response.

FYI, we do support this via gRPC today, just not via REST (yet) as @cjcenizal notes. It was a bit faster for us to implement via gRPC since gRPC is “already” streaming-capable natively. We’re putting together docs for this now and I hope to have them out in the next week, but the short of it in gRPC is you use the streaming APIs, get a future_id, and then you can correlate the Summary future_id.

On REST, do you have a preferred mechanism for async responses if you had a choice (e.g. websockets, server-side events, etc) or would you be pretty indifferent as long as the client-side library supported the functionality natively? (And would you mind sharing any details on preferred language(s) you have?)

Yeah, I noticed the console website uses streaming api to get the Summary in the Search page.


It looked private and I couldn’t find any documentation for it. That’s why i was asking for it here. Its fine though for now i can wait for the documentation.

For REST api. My preference would be to use Keep-Alive Connection with JSON responses. That’s what OpenAI’s (ChatGPT) is using. I find it easy to implement on the client app. Below is a mock chat api I built in NodeJS, TypeScript and hosted as Firebase Function. If the library supports streaming i would write something similar to the code below and have it stream back the responses to client app.

/// curl --location 'chatMock' \
/// --data '{"question":"My question here", "history": []}'
export const chatMock = onRequest(async (request, response): Promise<void> => {
    const { question, history } = request.body;

    if (!question) {
        response.status(400).json({ message: 'No question in the request' });

    response.writeHead(200, {
        'Content-Type': 'text/event-stream',
        'Cache-Control': 'no-cache, no-transform',
        "Connection": 'keep-alive',

    const sendData = (data: string) => {
        response.write(`data: ${data}\n\n`);

    sendData(JSON.stringify({ token: '' }));

    let currentIndex = 0;
    const fullMessage = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis eu tortor fermentum, dictum nulla sit amet, scelerisque felis. Sed volutpat rutrum lorem, nec fermentum tellus laoreet sit amet. Nulla facilisi. Etiam maximus elit ut sagittis vestibulum. Fusce hendrerit, nisl vitae vestibulum aliquet, nisl neque volutpat odio, nec interdum diam tellus a enim. Nunc feugiat pharetra iaculis. Suspendisse dignissim metus non elit dictum, vitae congue erat commodo. Donec vulputate justo et purus dictum blandit. In eleifend elit in mauris feugiat, id viverra metus semper.";

    function sendBatch() {
        const endIndex = Math.min(currentIndex + 10, fullMessage.length);
        const batchText = fullMessage.slice(currentIndex, endIndex);

        sendData(JSON.stringify({ token: batchText }));

        if (endIndex < fullMessage.length) {
            currentIndex = endIndex;

            // Schedule the next batch after the delay
            setTimeout(sendBatch, 50);
        } else {
            const lastContent = {
                text: fullMessage,
                sources: [
                        pageContent: "PAGE CONTENT HERE",
                        metadata: { source: "Sample.pdf" }

    // Start sending the batches

My preferred language in the backend side is JavaScript or TypeScript on NodeJS. I’m current building a web app using Flutter and later expand to mobile app. So Dart language would be my preferred for now.