Adaptive Search
From Kb
Contact Article Author | Blog of Article Author | FirstPartners.net Home | LinkedIn profile of Author
Contents |
Red Adaptive Search
Everybody has a Knowledgesphere – what they know , and what they understand. Most of the time , it’s stuck in people’s heads. Red Adaptive Search combines a personal search engine , Knowledgebase , Information Gatherer and the intelligence to learn what the user wants. With this combination , users can extend and exchange their Knowledgespheres.
This section outlines what the adaptive search portion of Red Piranha can do. It is a fully working system , demonstrating the core capabilities , with the ability to be easily extended. It is installable by a person with minimal knowledge of the Tomcat Web server , and usable to a person able to use a browser and the Google Search engine.
Problem
- You have the information , but you cannot find it
- Information is in disparate data sources and systems
- A lot of the value on what is ’good information’ is in people’s heads.
1. Section Layout
1.1. Reminder of what it does
1.2. Sections a-b are business. Sections c-d are technical. It is done this way so that you can evaluate Red FC first, then decide to take the time to investigate further.
1.3. Tech Setup link
1.4. Start with business user overview
Update with simple / adaptive search
Business Case
Red-Piranha is an open source search system that can actually ’learn’ what you are looking for. It lets you go everywhere , find anything , understand everything.
Because it is open source , it can integrate with any system. Because you can use it as a web page , command line or XML- WebService , it will work with most languages , including Java , Perl , C#/.Net and PHP. As a Java based program , it will run on any platform including Windows , Linux / Unix and Mac.
SUGGESTED USES
- Personal Search Engine for your Desktop (Windows , Linux and Mac).
- Intranet Search Engine - Search your Company or College Intranet.
- Part of your Development Project - Have search abilities up and running in a few minutes.
- To provide Search facilities on your website.
- As a P2P search engine.
- In conjunction with a wiki, as a knowledge / document management solution.
- Scan a set of websites for the data you want (e.g. Search Job sites on a hourly basis).
- Explore the Semantic web using RDF.
- Search RSS feeds for the information you want.
- Search your Companies systems (including SAP , Oracle or any other Database / Data source).
- Provide a back end for searching in your App (Web , Swing , SWT , Flash, Mozilla-XUL, PHP , Perl or even c#/.Net) .
- Document Management for PDF, Word and other Docs.
- As a Webservice to provide search information
- As a command line tool , to give searching power to your scripts.
- Provide a Search facility for your project documentation.
Red-Piranha allows you to search information with a minimum of effort. With a little effort , it can search *anything* , including Oracle Databases , XML Webservices (including Java/J2EE and .Net webservices) , RSS and even Web based XML feeds such as Google and Amazon.
GETTING STARTED
First of all download Red-Piranha from here. If you’re not sure which one you want, download the ready-to-deploy (bin) file. The other files contain either the ready-to-deploy plus full source (bin_src_lib) or the source only (src).
If you have not already installed Java and Tomcat , you can get them from the Sun Java and Apache Tomcat websites. Red-Piranha should work with Java 1.3 and Tomcat 3 , although we recommend Java 1.4 or higher and Tomcat 4/5.
Unzip the file you have downloaded - there should be a file called RP.war. Copy this file into the the ’webapps’ folder of your tomcat. Within a number of seconds you should see a new folder called ’RP’ created.
Congratulations - your copy of Red-Piranha has now deployed and is ready to use.
USING RED-PIRANHA
To use Red-Piranha - open your favourite web browser and point it at http://localhost:8080/RP . Within a few seconds , you should see the Red-Piranha start screen. This will have three items of interest
- A Text box , where we enter the information to add or search
- An ’add information’ button - to tell Red-Piranha about new information
- A ’Search’ button - to carry out a search.
Before we can search , we must tell Red-Piranha we information we are interested in. This is as easy as putting the piece of information we want to add (e.g. the folder c:\temp\) in the search box and pressing the ’Add information’ button. A message will be displayed saying that your information is being added and will be available to search shortly. For more information , look in the logs at TOMCAT_HOME\Webapps\RP\logs\rp.log
Examples of things we can add to Red-Piranha are
- A folder (e.g. C:\Temp\). All files in both this folder and *all* it’s subfolders will be added.
- An individual file. This file can be text , a web page , a word document , or pdf document. For binary files (like word , which are not plain text) , Red-Piranha will scan the file for recognizable text and add that.
- A Web page. Red-Piranha will add this web page , *and* web pages it links to.
- A Google Search (e.g. http://www.google.com/search?q=some+thing&num=100). Red-Piranha will get the results of the google search , and add information on the pages it links to.
- An XML file (including RSS feeds) , either on disk or over the web.
- Favourites / Bookmarks folders - Red-Piranha will index the web pages that these favourites point to.
Adding information can take anything from a few milliseconds , depending on the amount of information being added. Once added, Red-Piranha will check on a regular basis to see if the information added has changed and re-index if required. Your information is now available to be searched.
To do a search , put the item you want to search for into the textbox and press ’search’. Red-Piranha will show the search results on the screen. Clicking on the link beside the search results will show you the original information (as long as you have access to it).
From version 0.3 onwards , Red-Piranha can ’learn’ what search results you are interested in an improve your future searches. To give Red-Piranha feedback and help it ’learn’ what you are interested in , click on any of the links on the ’search results’ page. Red-Piranha makes a note of your choice , which is used to adjust the search results later.
Using n the Enterprise
1.5. How it can be extended
1.6. How it can link to the other samples
Differeneces between simple and adatpive
- feedback
- simpleCategoryManager v CategoryManager
- BareCategoryStore v BasicCategoryStore
1.7.
Screenshot – Default Search Screen
Screenshot – Add Information
Screenshot – Search Results
RUNNING RED-PIRANHA SEARCH
Get the samples binary (as per...)
The steps below can all be done from the command line. If you are using the command line, we’ll presume you know what you’re doing and you are able to take the information required from the Eclipse notes below. The process will be similar for other IDE’s (Netbeans , Websphere developer)
Troubleshooting
Security Notes
- For this simple deploy, there are no restrictions on who can add items to be searched.
Security on documents found during a search is managed outside of the RP application
Red Adaptive Search and Web 2.0
- Sharing of user knowledge
- Social Networking (and tools)
- Service Orientated Architecture
- Demonstration of mashup
- Searching RDF
- User adding value by his / her actions
- Knowledge Management
- Unconventional data sources and adaptation to user preferences.
Deployment
- Install of Red-Piranaha Project (http://red-piranha.sourceforge.net)
- Does not have Web 2.0 interfaces – but the focus is elsewhere
- User guide : how to use system to learn
- Developer guide : how to extend system to capture new inforamtion
- How to setup Engine against existing data sources, for example a HTML / wiki based knowledge base (ii) How to extend to handle new sources of data, such as RDF and RSS.
1.8. Download as per chapter x
1.9. Build ant file
1.10. Deploy Deploy the attached RP.war onto your WebServer
- For Tomcat, this is a copy into the webapps directory
1. Open the RP Web page in a Browser
- For Tomcat on your local PC this is http://localhost:8080/RP
2. Add the directory that you wish to search later (e.g C:\Temp\ or http://www.iib.ie)
1. Wait a few moments to allow for indexing
1.11. Go head and Search!
Technical – Behind the Scenes
Here is a list of what Red-Piranha uses under the covers.
- Spring - a J2EE lite framework that gives us a lot of functionality
- Tomcat - the Java Web Server we can run in.
- Lucene - the Apache project to give a searching and indexing engine.
- RDF / XML - Jena , to store all our information in RDF (aka the Semantic Web)
- Xerces - for XML manipulation
- PDF box - for reading of PDF documents.
- Rp Core (to give ...)
Alternative runtime configurations
How to build and run (or just use simple tests ...)
- on web
- from command line
- in eclipse
- ADD Run ... rpCommand line ... <arguments tab> (arguments). ADD C:\Temp\searchdata .... (working directory) ${workspace_loc:red-piranha/red-adaptive-search/war}
- SEARCH Run ... rpCommand line ... <arguments tab> (arguments). SEARCH enterprise .... (working directory) ${workspace_loc:red-piranha/red-adaptive-search/war}
- Otherwise the Spring appcontext will fail to load
- simple version build(for web)
Project Aims
The main items that Red Piranha adaptive search currently provides
- All necessary source Code including Java , configuration files , Junit Tests and build scripts. All files marked with copyright notice as per appendix C
- Project can be built and deployed (using build scripts from 2.1) on Apache Tomcat, Java 1.3 , running on Windows XP/ 2000/NT and Redhat Linux (Fedora 2).
- Once deployed on these platforms , all User Stories in this document (Section 3) can be carried out using the system.
- The system works with users running with Internet explorer versions 4/5/6 and Firefox 1.0 (Section 4 and Appendix B)
User Stories
The user stories list the different ways in which the user can interact with the search application.
- Story: Application Start
The steps to taken when the application is First deployed (Tomcat Hot Deploy) or when Tomcat is (re) started. No user output , only to log files.
(START) Tomcat is Started
- Application loads the plugins as stated in PluginManager
- Get all Classes implementing IPlugin Interface from
- rp.war (the war file that contains the RP application)
- Plugins Directory (as specified in directory structure in Section 7)
- For each Plugin that has been loaded
- Start a Background thread.
- Call the onLoad method on each plugin
(END)
- Story: Show Search Page
The user opens the default url : http://localhost:8080/rp
- User opens page in browser
- Show search screen
- Story: Add Information
Details how the user can add information to the system
(Start) user presses ’Add Information’ button
Get list of Plugins implementing IInterestedInAdd from Plugin Manager.
For Each Plugin …
- Start low priority thread
- call add() on interface
Return to Search screen, showing the message "You can continue to search while we add your Information"
Examples of resources / information that can be added to the system are
- Local Directory in the format C:\SomeDir\SomeSubDir – or other drive letter.
- Local File in the format C:\SomeDir\SomeSubDir\Somefile.extension
- Remote file in format http://someurl/somedir/somepage
- Special files (local or remote) e.g. *.xml , *.html , *.rss
- Text Files and Binary Files (e.g. *.doc *.pdf)
- Add the url of another RP (remote) application. This (1) do the search on the remote RP and (2) add the search results (html page) to the (local) Knowledgebase.
- Adding the url of a Google search , index the Google search results page.
- Add a local directory containing bookmarks (IE / Mozilla format)
- Add a local directory containing History (IE / Mozilla format)
- Story: Normal Search
Details how the user can search for information in the Knowledgebase
User enters search term and presses ’Search Button’. Search Term can be simple e.g. (java j2ee x) , or as complex as Lucene allow (e.g. java AND j2ee NOT xml)
Get Search Results
- Get list of Plugins implementing IInterestedInSearch from Plugin Manager.
- For Each Plugin returned…
- Start low priority thread
- call search() on interface
- loop until either ’isReady’ returns true or reaches timeout
- Timeout set in global / plugin properties file
- call getResults() to get search results
- Combine into Collection of Search Results
- If no results throw RP exception (to display error message on search page)
Filter Search Results
- Get the preferred Plugins implementing IInterestedInFilter from Plugin Manager / as set in config. file.
- For phase 1 , this is BasicIntelligence , or it’s delegates.
- Use this class to sort search results
Display search results
Display search results(Sample search results).
- Story: Feedback from Search Results
How the user can help RP ’learn’ what he or she wants. Subsequent searches return different results in line with what the user requests here .
(Start) Clicks on one of the feedback links /buttons on the screen to triggers feedback. This are detailled in Appendix B , but examples are:
- (1)Search query (associates terms like Java J2ee together)
- Search result (main url link) clicked on
- Negative feedback (I like this)
- Positive feedback (not for me)
- (2)More from this category
- Category X Use More | Use less
Get the plugins implementing IinterestedInFeedback as defined in the global properties file. (this be the BasicIntelligence, which then uses other classes as required for phase 1)
- call giveFeedback / update on Interface , passing in the feedback.
- Note of the user feedback is made in FeedbackDatastore
- BasicIntelligence Class update() method , does quick adjustment of score.
- When the update method completes , does the original search again and displays results.
Note(1): The original search (as per user story 3.4) automatically triggers feedback and (re)search , the user is unaware of having given feedback.
Note (2) after giving this feedback , search results coming only from the category that the user clicked on be displayed. These can be identified by Category name , should be stored via the BasicIndex class
- Story (Exceptions)
- What to do when something goes wrong
(Start)
If a RPException / other Exception is thrown.
3. If RPException , see if has details of UserFriendlyMessage () and log, display
4. If other type of exception , log details and display generic error message to user. The generic error message can be configured via the global config file.
(End)
User Interface
Screens
Search Screen – bare
Search Screen – with results / allowing for feedback.
- Browser Output is
- HTML output to be IE 4/5/6 and Mozilla Firefox 1.0 upwards compatible.
- No JavaScript HTML Pages.
- HTTP Post / Get Info
- All Interaction with browser is by Http-Get , so that params form part of the url visible in the address bar of the browser.
- Book marking a url (used to access the RP application) and recalling it later cause RP to do the same search.
- Adding this url of another RP (remote) application cause the application to (1) the remote RP does the search and (2) local RP add the search results to it’s knowledgebase.
- Java API
- All the functionality of the system is available via a Java API (the main class being KnowledgeBase manager).
- 3rd Party programs can use the RP application as a library via this API. The Javadoc that is provided as part of the product on the KnowledgeBase manager class give full instructions on how to interact with the system in this manner.
- Command Line
All The functionality as defined for the HTML interface be available via the command line. A full readme file is available at xxxx giving details of how to drive the RP system via the command line.
Core Classes, Interfaces and Concepts
- Plugins are the means by which the system can be easily extended. Plugins are dynamic in that they are discovered and reloaded at runtime (i.e. when the system starts). This section defines the various interfaces that a plugin implements.
- The main plugin interfaces are:
IPlugin- Marks a class as being a plugin.
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInAdd.html IInterestedInAdd] – register to be notified when new info is added.
## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInFeedback.html IInterestedInFeedback] – register to be notified when the user gives feedback.
## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInResultsFilter.html IInterestedInResultsFilter] – register as being able to sort and filter search results.
## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInSearch.html IInterestedInSearch] – register as being able to carry out a search.
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInAdd.html IInterestedInAdd] – register to be notified when new info is added.
- Other (utility) plugins are:
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\IDataExtractor.html IDataExtractor]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\index\IIndexManager.html IIndexManager]
- Concrete Implementations of Interfaces
- The following concrete classes are used in managing plugins that implement these interfaces.
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\global\KnowledgeSphereManager.html KnowledgeSphereManager]
First point of contact for the RP System , and the point at which all the user interfaces converge (the it is the controller in the MVC pattern) and provides access to all the RP core functionality. As such it does things such as catch exceptions ,manages threads etc
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\global\PluginManager.html PluginManager]
Responsible for locating and loading plugins. On Application startup (inc Deploy of rp.war)
- Search for classes implementing IPlugin in rp.war
- Search for classes in Plugins Directory (specified in Section 7)
- If no plugins found , log the reason, throw RPException.
- The Diagram below outlines how plugins relate to each other.
| <<UI>> | Programmatic (Java API) Command Line HTML (Servlet) | ||
| bgcolor = "#E6E6E6" | | |||
| <<Singleton>> | KnowledgeSphere Manager | <- 1..1 -> relation | PluginManager CategoryManager |
| bgcolor = "#E6E6E6" | | bgcolor = "#E6E6E6" | | ||
| <<Iplugin>> | L-> Core Plugins | L-> Utility Plugins | |
| IDataExtractor IIndexManager |
- Other Interfaces in System
- These interfaces are not exposed externally (like the plugin interfaces) but are used internally to ensure a good , configurable , design)
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\ICategory.html ICategory] – Basic Unit of info – many categories make up database.
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\IFeedback.html IFeedback] – Feedback is how the user teaches the System
- IBasicCategoryStore– Persistent storage of Data as part of the systems.
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\INewInformation.html INewInformation] - items that the user id adding to the RP system.
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\ISearchQuery.html ISearchQuery] - something the user wants to find.
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\ISearchResult.html ISearchResult] – what RP finds in response to a search query.
- Plugins Implementing the following interfaces:
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInAdd.html IInterestedInAdd]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\CategoryManager.html CategoryManager] Using
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\IDataExtractor.html IDataExtractor] sees which concrete implementation one can handle this type of data
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\BasicCategory.html BasicCategory]
Handle to the IDataExtractor that formed it
Saves Data using BasicCategoryStore.
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInFeedback.html IInterestedInFeedback]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\BasicIntelligence.html BasicIntelligence] uses
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\BasicCategory.html BasicCategory]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datastore\BasicMetaDataStore.html BasicCategoryStore]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\CategoryManager.html BasicIndex]
FeedbackDataStore
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInResultsFilter.html IInterestedInResultsFilter]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\BasicIntelligence.html BasicIntelligence]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\BasicIntelligence.html BasicIntelligence]
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInSearch.html IInterestedInSearch]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\CategoryManager.html BasicIndex] uses
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\BasicCategory.html BasicCategory]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datastore\BasicMetaDataStore.html BasicCategoryStore]
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\IDataExtractor.html IDataExtractor]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\FileDataExtractor.html FileDataExtractor]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\XmlDataExtractor.html XmlDataExtractor]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\UrlDataExtractor.html UrlDataExtractor]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\WebQueryDataExtractor.html WebQueryDataExtractor]
- ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\index\IIndexManager.html IIndexManager]
- Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\index\BasicIndex.html BasicIndex] uses ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datastore\BasicMetaDataStore.html BasicCategoryStore]
- Other Core Classes in the System
- RPException – Extensible / Chained Exception for the RP System. Contains a user friendly message (for example , how to display errors as per Screen 2 , Appendix B)
- RPCommandLine – Command line entry point to the RP system
- RP Struts Classes needed to implement HTML interface.
Basic Plugin Implementations
- The previous section detailed the interfaces by which a plugin could extend the system. The section details the plugins currently implemented and supplied as part of phase 1.
- Additional / modified classes needed for the system to function as specified are also provided.
- Where background processes are specified , their priority can be set via the config files.
- User Events
User events and the (main) classes that handle them are:
- Add Information
- CategoryManger (delegating to Categories)
- Search
- BasicIndex
- Feedback
- FeedbackDatastore (and BasicIndex to update)
- Startup (onLoad)
- CategoryManager (refreshing / updating Categories)
- BasicIntelligence (relinking / rescoring Category and FeedbackDataStore information)
- BasicIndex – reindexing updated information
- Category Manager
onLoad() method
- Get all known Categories
- Check disk in Dir (section 7) and load all the Categories found there.
- Persistence Mechanism (uses BasicCategoryStore )
- Refresh the Category Data (as Background process)
- For each Category found
- Get data as per add() method below , save into tmp category. (the original url given by the user is stored in the category , so calling add() again is easy)
- When ready , copy tmp category over old Category.
- Notify BasicIntelligence to rescore
- Notify BasicIndex to reindex
add() method
- get a list of all available IDataExtractor plugins from PluginManager
- If not IDataExtractor methods returned , throw RPException
- for each IDataExtractor
- call canHandle() method , make note of the int value returned
- using the IDataExtractor that returned the highest int value
- Construct a new BasicCategory class , passing in IndexManager (one for the entire RP app) , BasicCategoryStore (as IMetaDataStore) and the IDataExtractor
- Call construct() method to start conversion from Information pulled by DataSource to Data as stored by BasicCategoryStore. (Using common class / interface produced by DataExtractor , consumed by BasicCategoryStore)
- BasicCategoryStore also stores Category info , such as the
IDataExtractor that created it , the URL provided during ’Add information’ etc.
- Data Extractor implements IDataExtractor
- Basic Tasks
- Recognises can / cannot handle new piece of information
- Converts the original data format into format as can be stored by BasicCategoryStore (e.g. as Nodes / Tuples)
- If adding file / piece of information with same name , just create a new category with this info (e.g. SameName and SameName1)
- Methods
- canHandle(INewInfo as added by user)
- returns on int depending on how suitable it is to handle information (or –1 if it cannot handle info)
- addData (INewInfo as added by user)
- extract data from the data source into / convert to nested tuple class.
- Where possible data extractors should be configurable using local / global properties files e.g. the amount of data per Node/ tuple after parsing.
- Some sample DataExtractor implementations are below. Additional / modified implementations may be needed to fully implement phase 1.
- File Data Extractor
- Handles generic text files.
- canHandle(INewInfo) - returns 1 if can open using standard Java File() object , -1 if cannot
- Converts Text file into Tuples /nodes as follows: (object , subject ,relation
- 1st Pass : Anything like URL , convert in Keyword=Name, Value =href
- follow to one level (the index file found at this url , but do not follow any of the links therein)
- 2nd Pass:
- Tokenise files into words : groups of letters
| with characters A-z , 0-9 , -
