Cannot index files via /upload

I tried using curl api as describe here: FileUpload | Vectara Docs

My first attempt, I tried to add some metadata: {“httpCode”:400,“internalCode”:400,“details”:“Metadata is not properly formatted json.”,“status”:{“code_”:400,“statusDetail_”:“Metadata is not properly formatted json.”,“memoizedIsInitialized”:1,“unknownFields”:{“fields”:{}},“memoizedSize”:-1,“memoizedHashCode”:0}}

I got this when I try to omit the metadata: {“httpCode”:500,“internalCode”:13,“details”:“Request failed; contact support if the error persists.”}

Having a lot of trouble getting anythign indexed.

Full code:

INPUT_VECTARA_ACCOUNT_NUMBER=XXX
INPUT_VECTARA_CORPUS_NUMBER=1
INPUT_VECTARA_CLIENT_ID=XXX
INPUT_VECTARA_CLIENT_SECRET=XXX
FULL_FILE_PATH=/some/file

AUTH_ENDPOINT="https://vectara-prod-$INPUT_VECTARA_ACCOUNT_NUMBER.auth.us-west-2.amazoncognito.com"
echo "::debug::Auth Endpoint: $AUTH_ENDPOINT"
VECTARA_RESPONSE=$(curl -XPOST -H 'Content-type: application/x-www-form-urlencoded' -d "grant_type=client_credentials&client_id=$INPUT_VECTARA_CLIENT_ID&client_secret=$INPUT_VECTARA_CLIENT_SECRET" "$AUTH_ENDPOINT/oauth2/token" | jq -r '.access_token' )
JWT_TOKEN=$VECTARA_RESPONSE

curl -L -X POST 'https://api.vectara.io/v1/upload?c=2958240080&o=1' \
    -H "Authorization: Bearer $JWT_TOKEN" \
    -F 'doc_metadata="{\"source\":\"git\",\"url\":\"foo\"}"' \
    -F file=@$FULL_FILE_PATH

Hi Steve,
I tried this locally (with meta-data) on a PDF file and seems to work fine. Is the file you’re trying to upload something you can share as well, so I can try to replicate with exactly this file?

Oh weird, I was able to upload a PDF just fine too. I was trying to upload a Markdown file (.md). Not something I want to share publicly.

When I try to upload the Markdown file, i get this now:

{"httpCode":400,"internalCode":400,"details":"Metadata is not properly formatted json.","status":{"code_":400,"statusDetail_":"Metadata is not properly formatted json.","memoizedIsInitialized":1,"unknownFields":{"fields":{}},"memoizedSize":-1,"memoizedHashCode":0}}

But if I try the same command with the SAME json payload with a PDF file and it works.

As far as I can tell it fails on all markdown files.

Hey Steve. I tested out your code using two different markdown files, and it worked for both of them. The first one is private, but the second one is public: vectara-answer/README.md at main · vectara/vectara-answer · GitHub

Can you test this one out and see if it works for you? If so, then it seems to be something to do with your markdown files.

Your markdown file uploaded just fine as is… BUT, if I modify its contents to be just the text “Hello”, it fails. There seems to be some required markdown format?

Weird, it works if I use a API token instead of a JWT.

curl -v -L "https://api.vectara.io/v1/upload?c=$INPUT_VECTARA_ACCOUNT_NUMBER&o=$INPUT_VECTARA_CORPUS_NUMBER" \
    -H "x-api-key: $API_KEY" \
    -F "file=@$FULL_FILE_PATH;type=text/markdown"

But I much rather use OAuth

It means this github action is broken: vectara-index-git-docs/action.yml at main · vectara/vectara-index-git-docs · GitHub

Hi Steve. We were able to identify a bug, where the indexing behavior is different depending on whether the doc_metadata field is added. If I remove that field then both the short markdown file as well as a longer one are uploaded successfully. We’ll be resolving that issue ASAP.

1 Like

Following up here. We just pushed a fix out that resolves the issue where processing of the doc_metadata field fails in some scenarios.