How to crawl a protected site

Hi,

I am trying to use the web crawler to ingest data. It is a google site and is protected.

https://sites.google.com/idinsight.org/thehub/

Is is possible to pass the username and password to the webcrawler? If not, is there another way to authenticate and use the crawler?

Hi @Sid_Ravinutala , and welcome!

At this point, that web crawler doesn’t provide for any authentication mechanism. I’m sure it’d be possible to modify the crawler to add some authentication but the code isn’t really set up to handle that right now. You’d need to modify web-crawler/crawler.py at main · vectara/web-crawler · GitHub by either importing a session or logging in and then persisting the session throughout the crawler instance. The crawler uses pyhtml2pdf for PDF generation, and that library may need some modifications as well, depending on how the authentication is set up.

In general, I suggest that folks use web crawling as a last resort. If you happen to have the source data in some other format (a json file, etc), it’s almost always better/faster/more robust to ingest that.

Thanks @shane! Since it is a google site, it’s not easy to extract as json. I’ll explore modifying the web-crawler (or other options).

Thanks again for the prompt reply. Excited to use Vectara!