Sunday, June 10, 2018

Manage crawl rules

Manage crawl rules

You can add a crawl rule to include or exclude specific paths when you crawl content. When you include a path, you can optionally provide alternative account credentials to crawl it. In addition to adding new crawl rules, you can test, edit, delete, or reorder existing crawl rules.

Crawl rules are applied in the order that they are listed.

Note: To manage crawl rules, you must first open the Manage Crawl Rules page. On the Search Administration page, under Crawling, click Crawl Rules.

What do you want to do?

Add a crawl rule

Test crawl rules on a URL

Edit a crawl rule

Delete a crawl rule

Reorder crawl rules

Add a crawl rule

  1. On the Manage Crawl Rules page, click New Crawl Rule.

  2. On the Add Crawl Rule page, in the Path box in the Path section, type the path affected by the rule. You can use standard wildcard characters in the path, as in the following examples:

    • http://server1/folder* contains all Web resources with a URL that starts with http://server1/folder.

    • *://*.txt contains every document with a .txt extension.

  3. In the Crawl Configuration section, select one of the following:

    • Exclude all items in this path. Select this option if you want all items in the specified path to be excluded from the crawl. If you select this option, you can further refine the inclusion by selecting the following:

      • Include complex URLs (URLs that contain question marks (?)). Select this option if you want to include URLs that contain parameters that use the question mark (?) notation.

    • Include all items in this path. Select this option if you want all items in the path to be crawled. If you select this option, you can further refine the inclusion by selecting any combination of the following:

      • Follow links on the URL without crawling the URL itself. Select this option if you want to crawl links contained within the URL, but not the URL itself.

      • Crawl complex URLs (URLs that contain a question mark (?)). Select this option if you want to crawl URLs that contain parameters that use the question mark (?) notation.

      • Crawl SharePoint content as HTTP pages.. Normally, SharePoint sites are crawled by using a special protocol. Select this option if you want SharePoint sites to be crawled as HTTP pages instead. When the content is crawled by using the HTTP protocol, item permissions are not stored.

  4. In the Specify Authentication section, do one of the following:

    • To use the default content access account, select Use the default content access account.

    • If you want to use a different account, select Specify a different content access account and then do the following:

      1. In the Account box, type the account name that can access the paths defined by this crawl rule.

      2. In the Password and Confirm Password boxes, type the password for this account.

      3. To prevent Basic authentication from being used, select the Do not allow Basic Authentication check box.

        The server attempts to use Windows NTLM authentication. If NTLM authentication fails, the server attempts to use Basic authentication unless the Do not allow Basic Authentication check box is selected.

    • To use a client certificate for authentication, select Specify client certificate, and then click a certificate on the Certificate menu.

    • To use form credentials for authentication, select Specify form credentials, then enter the form URL (the location of the page that accepts credentials information) in the Form URL box, and click the Enter Credentials button.

      1. When the logon prompt from the remote server opens in a new window, enter the form credentials and log on.

      2. You will be asked if the logon was successful. If it was, then the credentials that are required for authentication are stored on the remote site.

    • To use cookie authentication, select Use cookie for crawling, and then do either of the following:

      1. Click Obtain cookie from a URL to fetch a cookie from a Web site or server.

      2. Click Specify cookie for crawling to import a cookie from your local file system or a file share. You can optionally specify error pages in the Error pages (semi-colon delimited) box.

  5. Click OK.

Top of Page

Test crawl rules on a URL

You can test crawl rules on a URL to determine what rules will be applied when the URL is crawled and what the result of applying those rules will be (either inclusion or exclusion of content). Be aware that testing crawl rules on a URL does not actually crawl the URL.

  1. On the Manage Crawl Rules page, in the Type a URL and click test to find out if it matches a rule box, type the URL that you want to test.

  2. Click Test.

  3. The result of the test is listed below the Type a URL and click test to find out if it matches a rule box.

Top of Page

Edit a crawl rule

If you edit a crawl rule, the changes do not take effect until the next full crawl is started.

  • On the Manage Crawl Rules page, in the crawl rules list, click Edit on the menu of the crawl rule that you want to edit.

    You can find information about the settings for crawl rules in the Add a crawl rule section.

Top of Page

Delete a crawl rule

If you delete a crawl rule, the deletion is not reflected until the next full crawl is started.

  1. On the Manage Crawl Rules page, in the crawl rules list, click Delete on the menu of the crawl rule that you want to delete.

  2. Click OK in the message box to confirm that you want to delete the crawl rule.

Top of Page

Reorder crawl rules

  • On the Manage Crawl Rules page, in the Order column in the list of crawl rules, select a value in the drop-down list that specifies the position you want the rule to occupy. Other values are shifted accordingly.

    Crawl rules are applied in the order that they are listed. Therefore, if two rules cover the same or overlapping content, the first rule that is listed is applied.

Top of Page

No comments:

Post a Comment