Knowledge Base Details

This page contains the key information about your Web based / manual data ingestions.

What this page does

Define content to ingest: You provide the web addresses (URLs) that make up your knowledge base.
Start & monitor ingestion: You run a crawl and check its live status from here.

Which URLs would you like to select?

Enter one URL per line, followed by Product and Version information for the relevant product. Each URL tells the application where to gather information. The first part is the URL to ingest. The second and third parts (optional) indicate the related product and version. If your URL is not specific to a product, you can leave those labels empty for those URLs.

Format

https://docs.solverox.com/TSAIAgent/|TSAIAgent|V3.1
 https://docs.solverox.com/EasyReports/|EasyReports|V1.0

Tip: All URLs must end with a trailing / (slash), regardless of how they appear on the web.

Tip: This AI Agent classifies and prioritizes the product and version information from the users and the knowledge base. To get the best results, it is strongly recommended that this information to be provided before data ingestion.

Encrypt knowledge base artifacts with my own AWS KMS key

You can protect your information (content chunks and any JQLs that might include your customer names) with your own encryption key. If you enable this option, provide your AWS KMS ARN so the AI Agent can encrypt with your key.

Tip: Your data is encrypted even if you do not use your own key. For the strongest control (so only you can decrypt), select this option and supply your AWS KMS key.

Tip: If you change your AWS KMS key ARN, you must run a new ingestion (even for 1 page that you already have in your knowledge base is suffficient) within 48 hours for the new key to be effective. If you miss this window, please get in touch with Solverox support to help you ingest all your knowledge base again with your new key ARN.

Include information only under the base domains above

Limits ingestion to the same domain. For example, if you specified https://docs.solverox.com/TSAIAgent/ as one of your URLs, links pointing outside docs.solverox.com will be ignored.

Include information only under the same URL paths

Limits ingestion to the same path. For example, if you specified https://docs.solverox.com/TSAIAgent/, as one of your URLs, links pointing outside docs.solverox.com/TSAIAgent/ will be ignored.

Tip: These two options prevent off-topic or external pages from entering your knowledge base and improve answer quality.

Tip: If these two options are not enabled, ingested information will not be classified according to product and version information, so the quality of the answers will not be up to the highest level.

Which URLs would you like to exclude?

Provide specific URLs you do not want ingested. If the crawler encounters them, those pages are skipped. Example:

https://docs.solverox.com/TSAIAgent/settings/

Tip: Any path that is under the specified path ( for the example above, https://docs.solverox.com/TSAIAgent/settings/ and all the sub paths) will be excluded.

Which languages would you like to include?

If your site uses language codes (hreflang) and you want only selected languages ingested, list them here.

Tip: The application supports multiple languages, but the most consistent results are typically achieved in English.

Tip: Even if you have only one language or you do not use language codes, you must still write the language code. For English, you need to write "en".

Which keywords would you like to exclude?

Use this field to filter out patterns via regular expressions (regex). Separate each expression with the three-character sequence [;].

Copyright 2025 Solverox Inc. All rights reserved.[;]Copyright[;]previous[;]next[;]Solverox[;]Skip to main page

Tip: Cleaner input produces better answers. Exclude repeated boilerplate such as footers, cookie notices, or branding blocks. Some common text that needs to be avoided are "Copyright", "Previous Page", "All rights reserved", etc.

Should we include the text on your images?

If enabled, the application will extract text from all images (OCR) found during the ingestion process and add it to your knowledge base.

Tip: Use OCR only when images contain meaningful text (e.g., a screenshot of a paragraph). OCR increases ingestion time and is usually unnecessary if your images already have good metadata.

The depth from the URL provided

Choose how many link levels to follow from the starting URL. Example for https://docs.solverox.com/TSAIAgent/ with depth = 1:

https://docs.solverox.com/TSAIAgent/something.html                -> OK
https://docs.solverox.com/TSAIAgent/dept1/something1.html         -> OK
https://docs.solverox.com/TSAIAgent/dept1/something2.html         -> OK
https://docs.solverox.com/TSAIAgent/dept1/depth2/something.html   -> NOT OK

Tip: Start shallow (e.g., 1) to validate results quickly, then increase if needed.

How many pages should be included?

Set a maximum number of pages to ingest per run (e.g., 100). This caps the volume and helps you stay within your plan.

Tip: Max pages that can be intested at once is 9999. Write the number to the number of pages you think you have to keep the ingestion under control.

Tip: These settings apply to the current run only; you can change them for future runs.

Extraction Controls

Ingest Data: Starts a new ingestion. Content matching your settings starts being ingested.
Delete Data: Deletes the data from your knowledge base. It still uses the definition we mentioned above.
Check Status: Displays real-time progress. When ingestion finishes, the status turns to "idle".
Added URLs, Removed URLs and Error URLs tables: Shows the result of the ingestion/data cleaning.

Tip: When ingestion finishes, you'll see the tables and start-finish times populated, and the status returns to idle.

Tip: Include only the content you create; exclude noise and boilerplate. Control depth and page limits to manage cost, time, and relevance.

Tip: Treat crawls as repeatable jobs. Adjust scope, run again, and review the delta to keep your knowledge current.

If you want to start small

Configure a single, well-scoped URL so you can validate the ingestion pipeline before scaling out:

https://docs.solverox.com/TSAIAgent/|TSAIAgent|V3.1

Stick with the same base path and domain.
List keywords to exclude recurring boilerplate.
Avoid OCR unless the documents truly require it.
Provide realistic depth and max-page values so the crawl completes.

Once you work through this guide, switch the app ON and start ingestion. That small experiment gives you confidence in how URL ingestion behaves before you expand coverage.

Save your settings

Click Save Settings to keep the changes.