Biobots description
From EdwardsLab
Contents |
Bots for Bioinformatics
This is a new research area at the interface of bioinformatics, semantic web technology, and human-computer interaction. We will design and implement several "Bots for Bioninformatics". These bots will be able to answer simple queries, and retrieve standard data. The Bots will be semi-autonomous and updating regularly.
Note that this has been heavily influenced by my recent interaction with Sandy. Its an email scheduling service, that I really like. In part because of the way the emails are phrased, and in part because it works!
Basic Details
- Instantiate a web server and database that allows a user to register with a name, email address, and phone number
- Create a simple web form that allows a user to submit a sequence ID. Retrieve the following information about the ID
- Using the SEED published WSDL
- List of aliases
- Function
- Protein sequence
- Based on the aliases
- Function at other sources (TIGR, NCBI, etc).
- Based on the protein sequence
- BLAST results from NCBI
- Using the SEED published WSDL
- Create a summary of the above data in XML initially, and then present the XML to the end user.
- Provide the service as a WSDL with webservices so that the user can make a request in the form of a url and retrieve the XML
- Store the results in the database including a time stamp to note the last time the searches were ran
- Allow the searches to be automatically updated at given time intervals, but also to be frozen, deleted, and copied to a new search.
- See the databases section below for a suggested schema
- Databases must be generalizable to allow easy extension of new features
- Email and TXT a summary of the summary to the user. The summary is described below.
This provides the basic usability of a bot that will go out and search for data at routine intervals. We can add tables to include other data as required.
Email Service
Instantiate an email service at request@biorobots.org
Note that we also need an email service at help@biorobots.org that goes to a human or several humans
Initially the goal is to be able to receive emails, and parse them for specific items, and take actions on those items. The actionable items at them moment should be:
- search for term
- get results for term
- get results for search #
- rerun search #
- delete search #
- freeze search #
- copy search #
In these cases, term should be an ID, like a FIG id, GI, or something similar, and # should be a previous search. The small words (for, results for search, search, etc) should be optional. Other words, like "hey, would you get me the results for xxx" should also be optional, to allow natural language queries.
Databases
User Information
The user information needs to contain the pertinent user information and whether they have validated their email address/phone number
- Userid int, primary key, unique
- Username varchar(255)
- Password varchar(255) or password
- Email address 1 varchar(255)
- Phone number int or varchar(12)
- Carrier
- Validated email1 boolean
- Validation code for email1 varchar(10)
- Validated email2 boolean
- Validation code for email2 varchar(10)
- Validated phone boolean
- Validation code for phone varchar(10)
- Timestamp created (int/date)
The validation code needs to be a 5-10 digit code that is sent via email, sms etc There should be a reset function to allow reseting password provided a valid username and email are provided. Only Username, password and email1 are required. Userid is internally generated (as are the validation codes). The phone number and carrier is so that we can start using an SMS service to send results
Query Information
This information should just capture the information submitted by the user:
- InternalID int, primary key, unique
- UserID int
- Query varchar(255)
- Frequency to update search int
- Timestamp submitted (int/date)
- Timestamp of last search (int/date)
- Notify Email1 boolean
- Notify Email2 boolean
- Notify Mobile boolan
- Deleted boolean
- Freeze boolean
For the timestamps we should use unix time, and for the frequency we should use seconds. We can then use the SQL select InternalID, Query where timestamp_of_last_search + frequency < current_time to retrieve those searches that need to be repeated. On the form, however, we will only accept searches every n days (e.g. daily, weekly, monthly), etc.
The freeze is a switch to allow a user to start/stop the search at will. If it is set, the search will not be run anymore.
The Notify {email1, email2, mobile} will determine where results are sent to. If nothing is checked (a possibility) search is run but no notification sent.
The deleted is a boolean. If it is sent, the search is considered deleted and will not show up on the results page
Basic Results Information
- InternalID int, primary key, unique
- QueryID (equal to InternalID in table Query Information) int
- List of SEED aliases varchar(255)
- SEED Function varchar(255)
- Protein Sequence varchar(255) or pointer to a file
- Timestamp last retrieved (int/date)
Extended Results Information
- InternalID int, primary key, unique
- QueryID (equal to InternalID in table Query Information) int
- Subquery ID (equal to InternalID in table Basic Results Information) int
- Alias used in search (parsed from list in aliases in table Basic Results Information) varchar(255)
- Database searched varchar(50)
- URL used varchar(255)
- Function varchar(255)
- Other data retrieved from NCBI etc records??
- Timestamp (int/date)
BLAST Results Information
- InternalID int, primary key, unique
- QueryID (equal to InternalID in table Query Information) int
- Database searched varchar(50)
- URL used varchar(255)
- File of BLAST results pointer to a file
- Top 20 hits from BLAST search varchar(255) [use ; separated list of identifiers]
- Timestamp (int/date)
Google Results
- InternalID int, primary key, unique
- QueryID (equal to InternalID in table Query Information) int
- Database searched varchar(50)
- URL used varchar(255)
- Some results from Google - we need to explore the google api for this
Summary Email
The user should set the level of warning that they want:
- Major change only (new SEED alias, new function)
- Change in secondary database (new function,
- New BLAST hits
- Any new data (warn that may occur frequently).
The summary email should include:
- The ID that they are searching against -- both the submitted id and the internal id
- The SEED function
- The change that triggered the results
- A list of actions or a reminder of some actions that they could take (e.g.
- rerun search #
- delete search #
- freeze search #
- copy search #
In all cases the search # should be the internal ID. The use of the word search should be optional. Case should be optional.
