Designs of 1st Generation 3D Search Engine 

Designs of 1st Generation 3D Search Engine

I have worked with these before designing my 3D Search Tools - oogle

Basic Algorithms for Search Robots

A popular implementation for spiders is to search for information. WebCrawler and InfoSeek are two search engines that provide the user feedback from a spider-type program.

Algorithm for a Search Engine

A search engine starts off with a user query, usually a keyword or two. The server side program then takes the query and sends it to the robot. The robot starts with a set of starter pages and parses the starting pages for anchors and the keywords. It then traverses the anchors, looks for keywords there, and recurs this process until either a maximum depth of anchors is reached, a time-out period is reached, or a specified number of matches is met.
The titles are returned to the user as hypertext links, and are often ranked by the occurrence of the query words.
Search engines, such as WebCrawler, actually assign a rank to each page it finds for you. Most of these ranking systems assign a higher rank for a higher incidence of keywords found, so if your keyword is "HTML" and a document has the string "HTML" in it 100 times, that document would be ranked higher than a different document containing the string "HTML" only 40 times.

A searching spider can be implemented with these steps:
With a database of root pages, search for the keyword. If the keyword is found, rank the page on the number of occurrences of that keyword, and add it to a temporary list. Check for time-out and number of matches requirements; keep going if necessary;
otherwise, wrap the results in HTML and return the results to the user.
Go to the anchors of the root pages; load their children; again search for the keyword. Again, if the keyword is found, add it and its rank to the temporary list. Check for time-out, search depth, and number of matches, repeat this step as many times as necessary, continually building the list of matches.
After one of the conditions is met, the list of URLs, their titles, and their rankings according to number of keywords per page can be translated to HTML. After this page is built, it can be returned to the user as the search results.
-------------------------------------------------------------------------------
A few tips: First, don't forget to look for the robots.txt file, and search for the user-agent field that matches your browser. With this, you know where and where not to aim your robot.
Another good idea if using a robot that takes in a number of root nodes that access different servers is to use a round-robin type schedule to increase the amount of time the robot successively hits one server. A hit every few seconds, or better yet, every few minutes, is much more polite to the owner of the server than making a few thousand requests in a minute or two.
--------------------------------------------------------------------------------
With this and a good parsing program, you can build a Visual Basic front end that enables the user to specify the root URLs, the number of times to recur, and the information the user seeks.
Behind this the program, you need to set up a system of requesting, parsing, and analyzing pages for content, as well as a system to determine if the search, time-out, and depth conditions and restrictions have been met. Tack on the results into a textbox or a browser custom control, add a status bar indicating the progression of the search, and you've built a full-featured Web robot.

WebSearcher: A Simple Search Tool

The vast amount of information available on the World Wide Web is provides its greatest strength. Millions of resources scattered across the world are available any day, at any time, to anyone with a connection and a browser. . .Great, right?
Try finding something you need. Since the creation of the World Wide Web, search tools have come full circle from non-existent to extremely prevalent. I can think of ten different ones off the top of my head, and I'm sure many more exist.
Search engines provide the compass to the Web traveler. Either by going through a series of narrower and narrower categories, or by just typing in a keyword.
The search engine then looks through its database of registered URLs and returns to the user a listing of what it found that lines up with the keywords entered.
Right now most search engines depend on the use of a Web browser such as Netscape Navigator: You go to the site of the search engine, enter your query, and in a moment or two, the browser displays a set of pages that match your request.
Now, using a custom control such as the Sax Webster control described in Chapter 4, "Using Custom Controls for WWW Programming," you can put this search engine on the user's desktop.

Designing the Application
There are a few basic capabilities you'll want this search engine to have:
Allow the user to specify base URLs to be searched, place these base URLs in a list box, and make sure that any pages linked to these root URLs are searched also.
Show the pages in a Webster control as they're received. The Webster control is used because it provides a GetLinkURL method, which allows you to iterate through an array containing the URLs of all of the hypertext links on a specified page.
http://www.microsoft.com/kb/peropsys/win95/q138789.htm
http: //www.microsoft.com/kb/peropsys/win95/q153038.htm
--------------------------------------------------------------------------------
For some reason the Webster control seems to have trouble communicating over certain dial-up TCP/IP stacks, including the stack included with Internet in a Box. If at all possible, use the Microsoft TCP/IP stack that ships with Windows 95. It is very easy to configure the Dial-Up Networking feature of Windows 95 to work with just about any Internet service provider. There are two Microsoft Knowledge Base articles available at the Microsoft Web site that explain how to do this. Point your browser to
http://www.microsoft.com/kb/peropsys/win95/q138789.htm and http://www.microsoft.com/kb/peropsys/win95/q153038.htm
to view these articles.
--------------------------------------------------------------------------------
Show any hypertext links found on the base pages in a listbox, using the aforementioned GetLinkURL method.
Take a user-specified keyword and record the URL of any pages that contain the keyword in a listbox.
When the search engine isn't performing a search, allow the user to view a URL in the Webster control by double-clicking a URL in any of the list boxes.
Keep the user informed of the status of the Internet activity (connecting, sending, and receiving of information) using the status bar control.
With these ideas in place, you can start thinking about the processes involved in setting up the form and the flow of the program.
Knowing the base functionality of the program, you can chose the controls for the program. Figure 14.1 shows how the form should look when you're finished with this section.
--------------------------------------------------------------------------------
The code for this chapter is included on the CD-ROM accompanying the book. You may enter it as you follow along in this chapter or copy it from the CD.
CD-ROM: I've included a file named CODE-17.ZIP that contains the code for this chapter. –Craig
--------------------------------------------------------------------------------
Figure 14.1. Design time view of the Web search application.
To create this form, follow these steps:
Start a new project in Visual Basic. Add the Webster control using Tools | Custom Controls. You'll use this control to load the URLs and parse the links found on the loaded pages.
Increase the form's size to allow enough room for all the controls.
--------------------------------------------------------------------------------
This application was originally designed on a monitor set at 800 x 600 resolution. If you are limited to, or prefer, 640 x 480, you'll probably have to shrink some of the controls on the left hand side of the form to get everything to fit.
--------------------------------------------------------------------------------
Add a Webster control to the form, positioning it so it takes up the right half of the form.
Set the HomePage property to an empty string. Set the LoadImages property to False since the program is only searching for text.
Set the PagesToCache property to 1 to remove any page caching. This is necessary to ensure that Webster loads and parses each page. Otherwise, a cached page would not be properly searched because Webster merely re-displays cached pages without firing the LoadComplete event, or any event for that matter.
Using the control's custom properties page (select the Custom property in VB's Property Window and click the button with the ellipses), select the Display tab and turn off all the Buttons checkboxes except for the Back/Forth checkbox. The tab should appear as in Figure 14.2.
--------------------------------------------------------------------------------
You can leave more of the buttons enabled if you wish, but you must make sure that while the application is performing a search, the user is unable to change the URL that Webster is attempting to load. The best method for doing so is to modify the control's ButtonMask property right before a search to turn off all buttons. Then, upon completion of a search, turn your buttons back on using the same property.
--------------------------------------------------------------------------------
Add a status bar (Status1) and two textboxes: one for entering URLs (txtURL) and one for entering the keyword for the search (txtKey)
Add three listboxes: one for the user-specified URLs to search (lbURL), another for the anchors found within those user-specified URLs (lbAnchor), and one listbox for URLs of pages containing the keyword specified (lbFound).
Add the labels as shown in Figure 14.1.
Add command buttons for adding (cmdURL, Index = 0) and deleting (cmdURL, Index = 1) user-specified URLs, starting the search (cmdSearch), resetting the application (cmdReset), and exiting (cmdExit).
With these controls in place, the form should look like Figure 14.1. If so, you can start adding code to the project. Otherwise, retrace your steps and make it look similar.

Coding the Application
Now that the form's controls are in place, it's time to add some code. This section provides all the code necessary to make the search application operate.
The Declarations Section
There are a few form-level variables defined in the Declarations section of our form's code. Open the form's code module by clicking the View Code button on the VB Project window or by pressing the F7 key. In the Declarations section, enter the following lines:
Option Explicit
Dim fPageLoaded% 'set true when a page is loaded
Dim LBFlag As Integer 'which list box to process
The first line specifies that all variables within the application must be declared before they can be used. The first variable defined is a flag that is used to determine if a page is loaded in the Webster control. Because Webster caches pages, if the first searched page is the current page loaded in the Webster control, Webster won't re-load the page. If Webster doesn't re-load the page, the LoadComplete event won't fire and the program won't process the page.
The second variable, LBFlag, is a flag that tracks which URL list box is being processed. The application only searches the URLs specified by the user and then any URLs linked on those pages. If you're processing URLs from the user-specified list, you'll load the anchor list box with all the links found on the pages. If you're processing the anchor list box, you'll only be searching for text, not links.
The AddMatch Subroutine
This subroutine adds the URL of a page containing the search string to the lbFound listbox. The URL to be added is provided as a string parameter to the subroutine. The code, shown in Listing 14.1, first searches all the URLs currently in the listbox to verify that the URL specified doesn't already exist in the list. If a duplicate URL is not found, the URL is added to the list box.
Listing 14.1. The AddMatch subroutine.
Sub AddMatch(sMatchURL As String)
Dim x As Integer
For x = 1 To lbFound.ListCount
lbFound.ListIndex = lbFound.ListCount - 1
If lbFound.Text = sMatchURL Then
Exit Sub
End If
Next x
lbFound.AddItem sMatchURL
End Sub
Specifying the URLs to Search
Let's start off by coding the adding and removing of user-specified URLs from the lbURL listbox. Each of the URLs the user adds is searched for the keyword when the Search command button is pressed.
The Add URL and Remove URL buttons are in a control array, the code behind the Click event is found in Listing 14.2.
Listing 14.2. The cmdURL_Click event code.
Private Sub cmdURL_Click(Index As Integer)
'allows user to build URL listbox
Select Case Index
Case 0 'add url
If Trim(txtURL.Text) = "" Then
Exit Sub
End If
lbURL.AddItem (txtURL.Text)
txtURL.Text = ""
Case 1 'remove url
If lbURL.ListIndex tags is not considered pure text, and you're also not interested in searching it either. Once the size is determined, the GetText method is used to retrieve all the pure text from the current page. The code then uses the Instr() function to determine if the search string is contained within the text. If the search string is found, AddMatch is called to add the current URL to the found listbox (lbFound). If the string is not found within the text, the code checks for the string in the page's title. Again, if the search string is found, the URL is added to lbFound using AddMatch.
And that, finally, concludes the majority of our search engine.
Viewing Pages
Another feature of the Web Search Tool is that it allows you to load any of the URLs from any of the listboxes into the Webster browser. If a search is not in progress, you can double-click a URL in any of the listboxes and it will be loaded by the Webster control. The code to make this happen is contained in Listing 14.6 but doesn't bear much explanation.
Also, because the Webster control allows the user to click on hypertext links and load the page the link points to, you'll want to disable this feature while a search is in progress. The best way to do this is by using the Webster control's DoClickURL event.
By setting the Cancel parameter to True within the event's code is done if the Search button is disabled, you prevent the Webster from loading the page pointed to by the URL that was clicked. Although not applicable for this application, this event can also be used to trap URLs that you want to prevent the user from accessing.
Listing 14.6 Code to load pages from the list boxes.
Private Sub lbURL_DblClick()
If cmdSearch.Enabled Then Webster1.LoadPage lbURL.Text, False
End Sub
Private Sub lbAnchor_DblClick()
If cmdSearch.Enabled Then Webster1.LoadPage lbAnchor.Text, False
End Sub
Private Sub lbFound_DblClick()
If cmdSearch.Enabled Then Webster1.LoadPage lbFound.Text, False
End Sub
Private Sub Webster1_DoClickURL(SelectedURL As String, Cancel As Boolean)
'if the search button is off, don't allow clicks
' (the program is still searching)
If Not (cmdSearch.Enabled) Then Cancel = True
End Sub
Testing The Application
This application is simple to test. After all the code is entered or copied from the CD-ROM, run the application. Make sure you have either an active Internet connection or have a Web server running locally.
--------------------------------------------------------------------------------
If you're running Windows 95 and would like to run a local Web server, I'd recommend O'Reilly and Associates WebSite server. An evaluation copy is included on the CD-ROM accompanying this book. Another good choice is the FrontPage Personal Web Server that ships with Microsoft's FrontPage Web site editor.
--------------------------------------------------------------------------------
Enter a URL into the URL To Add text box and click the Add URL button. Next, enter a string to search for on the page specified by the URL
Click the Search button and watch the action. You should see the page you specified load into the Webster browser. Then, if there are any links on that page, the Anchor List Box is filled with them and each page is loaded and searched. The URLs for any pages with matches are added to the Matched URLs list box.
Once the search is completed (the Search button turns back on), you can double-click any of the URLs to load the page into the Webster control. Then, use the Webster browser just like any other Web browser to surf to your heart's content.
For example, Figure 14.3 shows the results of searching the URL http://www.infi.net for the string cool. After the search filled the listboxes, I double-clicked the URL in the URL Search List box to re-load the starting page into the Webster control.
Figure 14.3. The Web Search Tool in action.
Other Directions
This sample application is not meant to be a fire-and-forget solution for searching the Web. Quite a few areas could be pursued, and I leave a few suggestions—ideas you can add.
Robots.txt support—This would allow you to target any server and find out whether it permits robots (such as this one) and what directories are off limits (see Chapter 12, "A Brief Introduction to Web Spiders and Agents").
Merciful timing—As discussed in Chapter 12, you should not rapid fire requests to servers. This may or may not happen during the execution of this sample application depending on the URLs you specify and the links contained within those URLs.
Recursion—Add the functionality to dig as deep as you want. This application recurs the users' specified links and the links within those pages. Without too much effort, you could add a textbox that takes a number, and instructs the program to search that many levels deep into the user-specified URLs anchors.
Ranked searching—Add the ability to perform Soundex and other types of searches so that matches can be made based on criteria other than just finding the exact text within the document. Note that this is a very advanced topic but mastery of such a searching mechanism can have very high rewards.

Designing the Link Checker Application

The usual flow of the application would look like this: The user enters a URL to validate. The app loads that URL into a Sax Webster control (first described in Chapter 4, "Using Web Browser Custom Controls") and parses all of its anchors using the control's GetLinkCount and GetLinkURL methods as described in Chapter 14, "WebSearcher: A Simple Search Tool."
Then the Microsoft Internet Control Pack's HTTP client control is used to retrieve the HTTP header information for the anchors (see Chapter 2, "HTTP: How To Speak on the Web," for information on HTTP messages, and Chapter 5, "Retrieving Information From the Web," for information on the HTTP client control).
If you don't have access to the Sax Webster control, there's a short section at the end of this chapter describing how to accomplish the link checking using the Microsoft Internet Control Pack's HTML control (also introduced in Chapter 4). Using the Microsoft control requires a good deal more code so this chapter only describes how to modify the Webster-based code to work with the Microsoft control.
Each response received by the HTTP client control is then checked for the previously listed error conditions, and if present, the URL should be marked as not valid. If a valid HTTP header information response is received for the link, the link is marked as valid. Also, if the Web server associated with the link provides the Last-Modified HTTP response header field, the field's
value is displayed in the grid.
--------------------------------------------------------------------------------
The Last-Modified HTTP response message header field specifies the date and time the resource represented by the requested URL was last updated. However, not all HTTP servers provide this field when returning information to HTTP client applications.
--------------------------------------------------------------------------------
You'll also want the user to be able to specify whether to check only links local to that site or all links referenced within the page.
You add this functionality via a frame and two option buttons.
The form has a tab control (see Figures 15.4 and 15.5) that allows the user to select either Link View or Web View: the Link View is where the grid is placed, the Web View is where the Webster control is placed. There is also a checkbox on the Web View to enable and disable loading of embedded images by the Webster control. By turning the load images off, the page to be verified will load faster.

Designing the User Interface

The final result of this section is shown in Figure 15.4, which shows the Link View tab, and Figure 15.5, which shows the Web View tab. Most of the controls use the default properties, but a few have their properties modified to meet the needs of this application.
Figure 15.4. Design time view of the Link View tab.
Figure 15.5. Design time view of the Web View tab.
http://www.microsoft.com/icp
Start a new project in Visual Basic. View the currently available custom controls by selecting the Tools | Custom Controls
menu item. This project requires the following controls be included in the list:
Microsoft Grid Control (GRID32.OCX)
Microsoft HTTP Client Control (HTTPCT.OCX)—this is available in the Microsoft Internet Control Pack (ICP) discussed in Chapter 5. The ICP is available on the Web at http://www.microsoft.com/icp
Microsoft Windows Common Controls (COMCTL32.OCX)
Sax Webster Control (WEBSTER.OCX)
Sheridan Tabbed Dialog Control (TABCTL32.OCX)
All controls except the Webster control and the HTTP client control ship with the Visual Basic Professional Edition. After you add the controls, you must also add a reference to the Microsoft Internet Support Objects. Use the Tools | References menu and select this item in the list. If it is not in the list box, click the Browse button and locate the file NMOCOD.DLL. If you can't find this file on your system, you probably need to re-install the Microsoft Internet Control Pack.
Once the proper controls have been added to the project, you can begin to populate the form. To create the controls directly on the form, follow these steps:
Add a label to the top left corner and give it a Caption of Web Link Verifier. Set its FontSize to 13.5 and its BackStyle to 0 (transparent).
Add the label for the URL text box near the top center of the form. Set its Caption to URL to Verify and its BackStyle to 0.
Add a text box below this label. Set its Name to txtURL.
Add the HTTP client control. It's not visible at run time so it can be placed anywhere. In Figure 15.4 and 15.5 it's in the top left corner. Set its Method property to 2 (HEAD method). The default control name, HTTP1, is used in the code. If this is not the name provided as the default, change the control's Name property to HTTP1.
Add a StatusBar control. Set Align to 2 (align bottom) and Style to 1 (single pane simple text). The control's name should default to StatusBar1. If it doesn't, change it so that is matches the code.
Add an SSTab control above the status bar control. Size it similar to what's shown in Figure 15.4. Set its Tabs and TabsPerRow properties to 2. Select Custom on the VB Properties window and click the button with the ellipses to access the tab control's custom properties page (shown in Figure 15.6). On the General tab, enter Link View as the TabCaption for tab 0.
Click the button with the ">" symbol to change the current tab to 1. Enter Web View as the caption and click the OK button.
Again, the default name should be SSTab1. If it's not, change the control's Name property to SSTab1.
Figure 15.6. The SSTab control's custom properties page.
Now that the form's shell has been created, it's time to add controls to the tabs. Bring the tab back to the Link View by clicking on its tab caption. Refer to Figure 15.4 for control placement. Follow these steps to add the controls for this tab:
Add the Verify command button. Set its Caption to &Verify. Set its name to cmdVerify. Because you don't want the user to be able to start a verification without a URL entered in txtURL, set the button's Enabled property to False.
Add the Reset button. Set its Set its Caption to &Reset. Set its name to cmdMain.
Add the Exit button by copying and pasting the Reset button. When asked by Visual Basic if you wish to create a control array, answer Yes. Set the new button's Caption to &Exit.
Add the Microsoft grid control. Set its Cols property to 3. Leave the rest of its properties set at their default values. The Name property should be Grid1.
Add the frame that appears above the command buttons. Set its Caption to Links To Verify. The default name, Frame1, should be used.
Add an option button to the frame. Set its Name to optLocal and its Caption to &Local Only.
Copy and paste the option button to the frame, below the first. Answer Yes when asked to create a control array. Set the new option button's Caption to &All Links.
You're almost there. Click the Web View tab caption to move to the other tab. To add the controls, refer to Figure 15.5 and follow these steps:
Add a CheckBox control. Assign chkImages to its Name property. Set Value to 1 and its Caption to Load &Images.
Add a Webster control and size it to fill most of the tab. Change the PagesToCache property to 1 and the HomePage property to an empty string. The Name should default to Webster1 but if not, change the Name property to Webster1.
Now that the controls are in place, it's time to start entering some code. The next section covers all the code necessary to make the Link Verifier work.

Coding the Application
The task of this application is to retrieve the user-specified URL, gather all of the anchors out of that page, add those anchors to the grid, and then for each URL in the grid, attempt to retrieve the HTTP header information. Finally, the app marks the URL verified or not verified accordingly.
A lot of the code here is also used in Chapter 14, "WebSearcher: A Simple Search Tool." The link checker, as you will see, is a customized version of the Web search tool. There, instead of looking for the invalid link conditions, you look for a user-specified keyword.
The Declarations Section
The Declarations section contains the following code:
Option Explicit
Dim Conn_Done As Integer
Dim Grid_Pos as Integer
The Conn_Done variable is a flag used by the HTTP control to signal the end of an HTTP request. The Grid_Pos variable stores the row in the grid where the next URL checked will be inserted.
The AddAnchor Subroutine
The AddAnchor subroutine is used to add URLs to the grid control. The routine goes through the grid row by row making sure that the URL to be added doesn't already exist. If the URL doesn't exist in the grid the routine adds it to the grid. The code is shown in Listing 15.1.
Listing 15.1. AddAnchor subroutine.
Sub AddAnchor(sNewAnchor As String)
Dim X As Integer
For X = 1 To Grid_Pos
Grid1.Row = X
Grid1.Col = 0
If Grid1.Text = sNewAnchor Then
Exit Sub
End If
Next X
Grid1.AddItem sNewAnchor
Grid_Pos = Grid_Pos + 1
End Sub
The GetHostFromURL Function
This routine was first introduced in Chapter 5. It is taken from the dsWeb sample that ships with the Dolphin Systems dsSocket control discussed in that chapter. The function is used to parse the host name from a URL. The function depends on the URL being valid. If the URL is invalid, it returns an empty string.
The GetHostFromURL() (Listing 15.2) retrieves the host name from the URL. The host name is the portion of the URL that occurs between the "//" and the first "/" characters. If the "//" is not present, GetHostFromURL() considers the URL to be invalid and returns an empty string.
Listing 15.2. GetHostFromURL() Function
Private Function GetHostFromURL(szURL As String) As String
' parse out the hostname from a valid URL
' the URL should be of the format: http://www.microsoft.com/index.html
' the returned hostname would then be: www.microsoft.com
Dim szHost As String
Dim lPos%
szHost = szURL
' invalid URL
If InStr(szHost, "//") = 0 Then
GetHostFromURL = ""
Exit Function
End If
szHost = Mid(szHost, InStr(szHost, "//") + 2)
lPos% = InStr(szHost, "/")
If lPos% = 0 Then
GetHostFromURL = szHost
Exit Function
Else
GetHostFromURL = Left(szHost, lPos% - 1)
Exit Function
End If
End Function

Return to Main Page

Comments

Comment Very interesting and beautiful site. It is a lot of helpful information. Thanks!

Sat Mar 18, 2006 2:28 pm MST by Martin

Add Comment




Search This Site


Syndicate this blog site

Powered by BlogEasy


Free Blog Hosting