Website_crawler Amazon

JavaScript_Camp · February 4, 2024, 9:56am

I’m trying to parse a page with new products on Amazon, but for some reason it doesn’t work.

vectara:
  corpus_id: 6
  customer_id: 2252938633
  reindex: true

crawling:
  crawler_type: website

website_crawler:
  root_domain: https://www.amazon.com/
  urls:
    [
      'https://www.amazon.com/s?i=fashion-girls-intl-ship&bbn=16225020011&rh=n%3A7141123011%2Cn%3A16225020011%2Cn%3A1040664&s=date-desc-rank&ds=v1%3A3%2BolR%2FqXkl2wvwlA54O25aP8MqJM9RF34uzK8YeEIDI&qid=1706978479&ref=sr_st_date-desc-rank'
    ]
  pos_regex: ['https://www.amazon.com/s\?.*ref=sr_st_date-desc-rank.*']
  max_depth: 6
  delay: 1
  pages_source: crawl
  extraction: playwright

What am I doing wrong? I need to parse new ones. products are 6 pages deep. How can this be done?

ofermend · February 4, 2024, 5:38pm

yes, looks like the title extracted was “Something went wrong!” as we can see in the results.
I tried to replicate locally and it seems to work for me. Can you please try to see if you can remove the indexed documents (you can do this in Vectara console) and try to index again?

JavaScript_Camp · February 5, 2024, 5:29am

I removed and added again and get: “Something went wrong!”

How to fix?
In this video you somehow got a list of products from Amazon.

ofermend · February 5, 2024, 5:48am

I still think it’s something to do with the crawling and Amazon blocking it.
May I suggest two options:

Do you have access to antoher machine (e.g. on EC2 or similar) where you can run the vectara-ingest job from there?
As a manual override - perhaps download the HTML of the page directly and then upload via the console to Vectara - to see if that overcomes the problem initially. How many product pages are you looking to ingest in this way?

JavaScript_Camp · February 5, 2024, 6:09am

You have an account, I can try.
My task is to monitor new products on marketplaces, so the pages are three in depth. I succeeded in parsing, but I want to enable crawling for neural search.

JavaScript_Camp · February 5, 2024, 3:57pm

I tried to run the project through amplify, but got an error.

Dockerfile

FROM --platform=linux/amd64 ubuntu:22.04

ENV DEBIAN_FRONTEND noninteractive
RUN sed 's/main$/main universe/' -i /etc/apt/sources.list
RUN apt-get update
RUN apt-get upgrade -y

# Установка Python 
RUN apt-get install -y python3
RUN apt-get install -y python3-pip
RUN pip3 install --upgrade pip

RUN apt-get install -y libpq-dev

RUN pip install psycopg2-binary
# Установка psycopg2
RUN pip3 install psycopg2

# Download and install stuff
RUN apt-get install -y build-essential xorg libssl-dev libxrender-dev wget git curl
RUN apt-get install -y --no-install-recommends xvfb libfontconfig libjpeg-turbo8 xfonts-75dpi fontconfig
RUN apt-get update
RUN apt-get install -y vim
RUN apt install -y unixodbc

RUN wget http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.0g-2ubuntu4_amd64.deb
RUN dpkg -i libssl1.1_1.1.0g-2ubuntu4_amd64.deb

# Install wkhtmltopdf stuff
RUN wget --no-check-certificate https://github.com/wkhtmltopdf/packaging/releases/download/0.12.6-1/wkhtmltox_0.12.6-1.focal_amd64.deb
RUN dpkg -i wkhtmltox_0.12.6-1.focal_amd64.deb
RUN rm wkhtmltox_0.12.6-1.focal_amd64.deb



RUN apt-get update
RUN apt-get install -y poppler-utils tesseract-ocr libtesseract-dev 

ENV HOME /home/vectara
ENV XDG_RUNTIME_DIR=/tmp
WORKDIR ${HOME}

RUN pip3.10 install poetry

COPY poetry.lock pyproject.toml $HOME/
RUN poetry config virtualenvs.create false
RUN poetry install --no-dev

RUN python3 -m spacy download en_core_web_lg
RUN playwright install --with-deps firefox

COPY *.py $HOME/
COPY core/*.py $HOME/core/
COPY crawlers/ $HOME/crawlers/

EXPOSE 80

ENTRYPOINT ["/bin/bash", "-l", "-c"]
#CMD ["tail -f /dev/null"]
CMD ["python3 ingest.py $CONFIG $PROFILE"]

amplify-ecommerce on  main [✘!?] took 38s 
[I] ➜ amplify push
⠹ Fetching updates to backend environment: dev from the cloud.⠋ Buil
✔ Successfully pulled backend environment dev from the cloud.

    Current Environment: dev
    
┌──────────┬───────────────────┬───────────┬───────────────────┐
│ Category │ Resource name     │ Operation │ Provider plugin   │
├──────────┼───────────────────┼───────────┼───────────────────┤
│ Api      │ container3ee454cf │ Create    │ awscloudformation │
└──────────┴───────────────────┴───────────┴───────────────────┘
✔ Are you sure you want to continue? (Y/n) · yes

In a few moments, you can check image build status for container3ee454cf at the following URL:
https://us-east-1.console.aws.amazon.com/codesuite/codepipeline/pipelines/${Token[TOKEN.804]}-container3ee454cf-service-api-80/view

It may take a few moments for this to appear. If you have trouble with first time deployments, please try refreshing this page after a few moments and watch the CodeBuild Details for debugging information.

Deployment failed.
Deploying root stack amplifyecommerce [ ==================
        amplify-amplifyecommerce-dev-… AWS::CloudFormation::Stack
        apicontainer3ee454cf           AWS::CloudFormation::Stack

🛑 The following resources failed to deploy:
🛑 Resource is not in the state stackUpdateComplete

LOGS

[Container] 2024/02/05 14:19:00.773673 Running command aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

[Container] 2024/02/05 14:19:01.412697 Running command COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | md5sum | cut -c 1-7)

[Container] 2024/02/05 14:19:01.424327 Running command IMAGE_TAG=${COMMIT_HASH:=latest}

[Container] 2024/02/05 14:19:01.429444 Phase complete: PRE_BUILD State: SUCCEEDED
[Container] 2024/02/05 14:19:01.429458 Phase context status code:  Message: 
[Container] 2024/02/05 14:19:01.462805 Entering phase BUILD
[Container] 2024/02/05 14:19:01.463403 Running command echo Build started on `date`
Build started on Mon Feb 5 14:19:01 UTC 2024

[Container] 2024/02/05 14:19:01.471217 Running command echo Building the Docker image...
Building the Docker image...

[Container] 2024/02/05 14:19:01.475911 Running command docker build -t $api_REPOSITORY_URI:latest ./.
Sending build context to Docker daemon   1.18MB

Step 1/38 : FROM --platform=linux/amd64 ubuntu:22.04
22.04: Pulling from library/ubuntu
toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

[Container] 2024/02/05 14:19:01.788403 Command did not exit successfully docker build -t $api_REPOSITORY_URI:latest ./. exit status 1
[Container] 2024/02/05 14:19:01.795364 Phase complete: BUILD State: FAILED
[Container] 2024/02/05 14:19:01.795384 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: docker build -t $api_REPOSITORY_URI:latest ./.. Reason: exit status 1
[Container] 2024/02/05 14:19:01.833981 Entering phase POST_BUILD
[Container] 2024/02/05 14:19:01.834582 Running command echo Build completed on `date`
Build completed on Mon Feb 5 14:19:01 UTC 2024

[Container] 2024/02/05 14:19:01.840808 Running command echo Pushing the Docker images..
Pushing the Docker images..

[Container] 2024/02/05 14:19:01.845527 Running command docker push $api_REPOSITORY_URI:latest
The push refers to repository [123118798050.dkr.ecr.us-east-1.amazonaws.com/amplify-amplifyecommerce-dev-185819-api-container5e2ffb5e-api]
An image does not exist locally with the tag: 123118798050.dkr.ecr.us-east-1.amazonaws.com/amplify-amplifyecommerce-dev-185819-api-container5e2ffb5e-api

[Container] 2024/02/05 14:19:01.867469 Command did not exit successfully docker push $api_REPOSITORY_URI:latest exit status 1
[Container] 2024/02/05 14:19:01.871089 Phase complete: POST_BUILD State: FAILED
[Container] 2024/02/05 14:19:01.871107 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: docker push $api_REPOSITORY_URI:latest. Reason: exit status 1
[Container] 2024/02/05 14:19:01.964497 Expanding base directory path: .
[Container] 2024/02/05 14:19:01.966494 Assembling file list
[Container] 2024/02/05 14:19:01.966508 Expanding .
[Container] 2024/02/05 14:19:01.968280 Expanding file paths for base directory .
[Container] 2024/02/05 14:19:01.968293 Assembling file list
[Container] 2024/02/05 14:19:01.968297 Expanding imagedefinitions.json
[Container] 2024/02/05 14:19:01.969909 Skipping invalid file path imagedefinitions.json
[Container] 2024/02/05 14:19:01.969932 Phase complete: UPLOAD_ARTIFACTS State: FAILED
[Container] 2024/02/05 14:19:01.969968 Phase context status code: CLIENT_ERROR Message: no matching artifact paths found

ofermend · February 5, 2024, 4:58pm

I’m totally not an expert on Amplify, but the error seems to be:

“toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: Understanding Your Docker Hub Rate Limit | Docker”

Which might mean some rate limit on pulling from docker.com? Maybe you need to login to a docker account? Not sure

I only meant to just setup a simple VM in EC2 (not via amplify), then SSH in, clone the vectara-ingest repo, and try from there. Is that possible on your end?

JavaScript_Camp · February 5, 2024, 6:40pm

Error in VM EC2

 ---> 9235f512bd06
Step 30/37 : RUN poetry install --no-dev
 ---> Running in 1da2f92038b2
Skipping virtualenv creation, as specified in config file.
The `--no-dev` option is deprecated, use the `--only main` notation instead.
Installing dependencies from lock file
Warning: poetry.lock is not consistent with pyproject.toml. You may be getting improper dependencies. Run `poetry lock [--no-update]` to fix it.

Package operations: 228 installs, 18 updates, 0 removals

  • Installing nvidia-nvjitlink-cu12 (12.3.101)
  • Installing markupsafe (2.1.3)
  • Installing mpmath (1.3.0)
  • Installing nvidia-cublas-cu12 (12.1.3.1)
  • Installing nvidia-cusparse-cu12 (12.1.0.106)
Killed
The command '/bin/sh -c poetry install --no-dev' returned a non-zero code: 137
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'yaml'
Unable to find image 'vectara-ingest:latest' locally
docker: Error response from daemon: pull access denied for vectara-ingest, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
Ingest container failed to start. Please check the messages above.

ofermend · February 5, 2024, 7:39pm

Looks like it got killed in generating the Docker container. How much RAM is in the EC2 instance? Perhaps try with more RAM?

JavaScript_Camp · February 6, 2024, 4:28am

[ec2-user@ip-172-31-31-69 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           952M        167M        584M        436K        200M        638M
Swap:            0B          0B          0B

I’ll try to install another instance

JavaScript_Camp · February 6, 2024, 1:30pm

Can’t beat Amazon’s EC2. Please we need a deploy tutorial otherwise the work has stopped.

The following error occurred when trying to handle this error:


  ChunkedEncodingError

  ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

  at ~/.local/share/pypoetry/venv/lib64/python3.9/site-packages/requests/models.py:818 in generate
       814│             if hasattr(self.raw, "stream"):
       815│                 try:
       816│                     yield from self.raw.stream(chunk_size, decode_content=True)
       817│                 except ProtocolError as e:
    →  818│                     raise ChunkedEncodingError(e)
       819│                 except DecodeError as e:
       820│                     raise ContentDecodingError(e)
       821│                 except ReadTimeoutError as e:
       822│                     raise ConnectionError(e)

Cannot install nvidia-cusparse-cu12.

ofermend · February 6, 2024, 4:00pm

This error seems to indicate a network timeout error.
What region of EC2 is this deployed to?

JavaScript_Camp · February 6, 2024, 4:28pm

EC2 Instance - us-west-2
Amazon Linux 2023 AMI 2023.3.20240131.0 x86_64 HVM kernel-6.1
t-micro
30GB

ofermend · February 7, 2024, 4:13am

Can you please try t2-xlarge with ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230325?
(and please try the latest version of vectara-ingest)

Ofer

JavaScript_Camp · February 7, 2024, 1:49pm

Finally the project started on the server

JavaScript_Camp · February 7, 2024, 2:05pm

However, Amazon apparently blocks the scanner’s actions.

JavaScript_Camp · February 8, 2024, 12:24pm

I got Amazon parsing through Puppeteer + https://www.browserless.io/. Maybe we can use browserless in website_crawler.py?

ofermend · February 8, 2024, 10:44pm

One thing you can try is use the “pdf” option instead of “playwright” in the extraction method with vectara-ingest: vectara-ingest/crawlers/CRAWLERS.md at 14b2fa1511e6b904414d449260273256eeb791dc · vectara/vectara-ingest · GitHub

Does changing this help?
(what that does it download the page as PDF locally and then indexes it as if it was a PDF file)

JavaScript_Camp · February 9, 2024, 7:37am

I decided to use Apify. You can close the question.

Topic		Replies	Views
Apify to Airbyte to Vectara Vectara Platform Q&A	9	621	February 13, 2024
We couldn't retrieve the documents for this corpus	1	405	March 11, 2024
See document text in the UI for testing purposes Vectara Platform Feature Requests	4	307	June 25, 2024
API Response - time out Vectara Platform Q&A query	3	628	December 2, 2023
Found 0 pages in Notion Vectara Platform Q&A	4	545	February 3, 2024

Website_crawler Amazon

Related topics