ModuleNotFoundError: No module named 'crawlers.docusaurus_crawler'

Run script

vectara:
  corpus_id: 4
  customer_id: 2252938633
  reindex: false

crawling:
  crawler_type: docusaurus

docs_crawler:
  docs_repo: 'https://github.com/gHashTag/jscamp'
  docs_homepage: 'https://www.jscamp.app/docs'
  docs_system: 'docusaurus'
  ray_workers: 0df or playwright

Run Docker

=> => writing image sha256:40bdff873d9242cc94d6cf624d2619075d290477e1ee7ab88d22  0.0s
 => => naming to docker.io/library/vectara-ingest:latest                          0.0s

What's Next?
  View summary of image vulnerabilities and recommendations → docker scout quickview
vingest
e2d90896857f1994209a04227b037bc11760a5be6ed55db9899f50219485d268
Success! Ingest job is running.
You can try 'docker logs -f vingest' to see the progress.

[I] âžś docker logs -f vingest                           
Traceback (most recent call last):
  File "/home/vectara/ingest.py", line 151, in <module>
    main()
  File "/home/vectara/ingest.py", line 129, in main
    crawler = instantiate_crawler(Crawler, 'crawlers', f'{crawler_type.capitalize()}Crawler', cfg, endpoint, customer_id, corpus_id, api_key)
  File "/home/vectara/ingest.py", line 20, in instantiate_crawler
    module = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'crawlers.docusaurus_crawler'

What is the problem?

Hi there,
The crawler type should be “crawler_type: docs” instead of “crawler_type: docusaurus”
Please see vectara-ingest/config/docs-langchain.yaml at main · vectara/vectara-ingest · GitHub for a working example.

1 Like

Yes. Worked

vectara:
  corpus_id: 4
  customer_id: 2252938633
  reindex: false

crawling:
  crawler_type: docs

docs_crawler:
  base_urls: ['https://www.jscamp.app']
  pos_regex: ['https://www.jscamp.app/docs.*']
  docs_repo: 'https://github.com/gHashTag/jscamp'
  docs_homepage: 'https://www.jscamp.app/docs'
  docs_system: 'docusaurus'
  neg_regex: []
  extensions_to_ignore: ['md', 'rst', 'ipynb', 'txt']

Glad to hear!
Let me know if I can help with anything else

1 Like