How to crawl a protected site

Sid_Ravinutala · March 24, 2023, 4:40pm

Hi,

I am trying to use the web crawler to ingest data. It is a google site and is protected.

https://sites.google.com/idinsight.org/thehub/

Is is possible to pass the username and password to the webcrawler? If not, is there another way to authenticate and use the crawler?

shane · March 24, 2023, 6:27pm

Hi @Sid_Ravinutala , and welcome!

At this point, that web crawler doesn’t provide for any authentication mechanism. I’m sure it’d be possible to modify the crawler to add some authentication but the code isn’t really set up to handle that right now. You’d need to modify web-crawler/crawler.py at main · vectara/web-crawler · GitHub by either importing a session or logging in and then persisting the session throughout the crawler instance. The crawler uses pyhtml2pdf for PDF generation, and that library may need some modifications as well, depending on how the authentication is set up.

In general, I suggest that folks use web crawling as a last resort. If you happen to have the source data in some other format (a json file, etc), it’s almost always better/faster/more robust to ingest that.

Sid_Ravinutala · March 25, 2023, 1:35am

Thanks @shane! Since it is a google site, it’s not easy to extract as json. I’ll explore modifying the web-crawler (or other options).

Thanks again for the prompt reply. Excited to use Vectara!

Topic		Replies	Views
Website_crawler Amazon Vectara Platform Q&A	18	683	February 9, 2024
Not sure where to begin	1	822	August 11, 2023
File upload by providing a URL vs. file contents Vectara Platform Feature Requests indexing	6	545	August 15, 2024
Upload_file error Vectara Platform Q&A indexing	12	1885	December 29, 2022
Introducing the Personal API Key: Elevate Your Authorization Experience Announcements blog	0	499	February 14, 2024

How to crawl a protected site

Related topics