4 Gene Cluster With Literature Profiles
5 Literature Mining Gene Networks
Given a set of genes, for example from high-throughput experiments, it can be helpful to know which biological functions and molecular networks may be involved, or whether genes from a given list or all human genes are related to certain topics, such as various biological and pathological processes. Some pre-defined annotation databases, such as GO (Gene Ontology), or pathway databases, such as KEGG, or PPI (Protein-Protein interaction) databases, such as HPRD and IntAct can be used as a gold-standard description.Some annotation tools that integrate these manually curated databases, such as DAVID and EGAN provide convenient and practical application. However, due to structured vocabularies and manual curations, pre-defined annotations are inevitably limited in scope, quantity and flexibility. Here, we developed a web server GenCLiP 2.0 (http://ci.smu.edu.cn) from our previous stand-alone software GenCLiP. Compared with other similar tools, such as DAVID, PubGene, iHOP, and STRING, the unique characters of GenCLiP 2.0 are:
i) analyze the gene functions with free terms generated by literature mining or provided by user;
ii) precisely identify and integrate comprehensive molecular interactions from the entire PubMed, to construct interaction networks and sub-networks related with free terms.
For analysis, the user needs to upload a list of gene IDs. Gene Cluster With Literature Profiles module will generate statistically over-represented keywords grouped by a fuzzy cluster algorithm to annotate the input genes. The keywords were generated based on occurrence frequencies of free terms in gene related literature or can be provided by user. The relationships of genes and keywords are linked to the relevant Medline abstracts in which co-occurrence of genes and keywords are highlighted. Also this module can generate heat-map of genes and selected keywords using the average linkage hierarchical clustering algorithm. Literature Mining Gene Networks module will construct a gene-network of the input genes and generate sub-networks based on the user defined query terms, at the same time calculate the probability of random occurrence of the networks through random simulation. The edge of networks represents the association of two genes and link to their related literatures. In addition, the Word Related Gene Search module can query genes co-occured with input terms in a sentence or abstract.
Click the Login text link at the top of the page opens a new page requesting your Email (username) and password. If you have not been previously registered with GenCLiP 2.0, there is a link to the register page. If you have already registered but forgotten your password, you can regain your password.
We try to make it as easy as possible for users to use GenCLiP 2.0. Therefore, registration is not forcibly required. Unregistered users have access to the same functionality as registered users. By registering, user can retrieve and continue to analyze the recent analyses through click the job name at the right-hand side of the webpage. For registered users, If your archive of analysis is inactive for more than one month, system will clean up all information (your gene list, results and so on) on the server side to conserve disk space.
GenCLiP 2.0 accepts four types of gene accessions/IDs. Users can view all the gene accession options from the drop-down selection menu at the input page.
For user-defined up- and down-regulated genes, the software distinguish gene name in color, red represented up-regulated, green represents down-regulated gene.
Please note that, a gene list that over 3000 genes is forbidden.
For registered users, name your job if you wish, that name can help you to retrieve the job later easily, or the system will automatically name it by a random number and the time you submitted. Unregistered users have no permission to retrieve former analysis, so ignore this option.
Click the grey icon and the module name to begin further analysis.
This module focuses on gene annotation based on the free word. There are two ways to apply keywords, automatic and manual. The user can add relevant keywords and remove inappropriate keywords. The purpose of keywords' editing is to allow you to find the topics that relate to a particular part of your research. Users themselves should play critical roles in judging "are the results making sense or not for expected biology?".
Manually editing--Filter Keywords:
Manually editing--Delete Keywords:
To save your annotated result from your browser to your hard drive simply click the link "Save" then type your filename and save to your hard drive. You can then open this file in Microsoft Excel.
For the ambiguous keywords, assistant term is helpful for identification. In this case, "Th1" or "Th2" and "T cell" and "immune" must appear in a abstract.
GenCLiP 2.0 uses the gene list and selected keywords for clustering analysis using the average linkage hierarchical clustering algorithm. The cluster results will be displayed in the image box and can be checked easily. User can download the result txt file and input for public software, such as Cluster/Treeview programs.
Select terms to proceed cluster analysis:
Heat map type of 'Strict'.
Heat map type of 'Trend'.
Based on user-defined gene list, red represents up-regulated gene, green represents down-regulated gene.
There are three input boxes in this Gene Network module. The first and second input fields (Network related with keyword(s)) are used to search and construct gene co-occurrence networks. An interactive gene pair will be confirmed when the two genes (they must appear in a sentence) and the keywords co-occurrence in a sentence or abstract (depending upon the option selected). All terms will be highlighted on the relevant abstracts.The third input box (Gene(s) in the network related with keyword(s)) is used to decide whether one gene in the result networks will be shown in purple border (related to the input keyword) or not. Genes are considered to relate to this keyword if they co-occurrence in a sentence. For three input boxes, synonyms are supported, more terms are separated by a comma that means "OR". For example, "apoptosis,apoptotic", means "apoptosis" or "apoptotic" appear on the corresponding field will be treated the same.
Gene's network construction is combined with limited times of random simulation, so it will be a slow process. Nevertheless, the process is no longer than 1 minute, if the server is running properly.
To simplify the network or remove wrong gene pair, users can delete the genes and the edges. Right-click a gene and Select First Neighbors can select genes associated with this gene. Or press shift and click multiple genes and edges. And then right-click the selected element to delete.
For up- and down-regulated genes.
Random simulation is performed to determine whether the analyzed gene list is related to the specified keywords, and/or engaged in the same gene networks related to the specified keywords.
Generally, Word Related Gene Search will allow you to identify genes that are related to your specific query. This feature is useful for identifying genes associated with diseases, phenotypes, function, etc. In many cases, your search term will correspond to various genes or proteins.
You can use the function to see the list of genes matching an arbitrary query, then click on the "number of articles" or "official symbol" can link to a particular gene and relevant literatures. The result will be some sentences matching both your search terms and the gene indicated. The gene list is sorted by frequency, genes with the highest number of articles mentioning them are shown at the top of the list.
The synonyms section displays lists of aliases for each gene known by GenCLiP 2.0. The summary section displays each gene's summary offered by NCBI. It is worth noting that many gene or protein synonyms are ambiguous, thus one and the same synonym is often used for different genes. Even human experts can have difficulties to resolve such ambiguities and automatic systems, like iHOP, will therefore always exhibit certain errors. In many cases, experts will recognize incorrectly identified synonyms simply by scanning a sentence.
(i). Searches are case-insensitive: MYC and myc will find the same set of results.
(ii). To search for sentences or articles containing more than one search term (logical "AND" searches), separate the terms with a space. As an example, cancer stem cell finds sentences mentioning cancer, stem and cell.
(iii). To search a phrase that contains spaces, place the phrase in quotation marks. For example, "cancer stem cell" finds articles mentioning the multi-word "cancer stem cell", so the articles would be more specific. If you did not place quotation marks around cancer stem cell, then the system would find more article, but less specific.
(v). Boolean searching (the use of AND, OR, and NOT operators) is not currently supported. To perform an "AND" search, enter the search terms with spaces in between each term. Such as, mir target , which identifies articles mentioning both mir and also target, then users can find some microRNAs and their targets easily.
GenCLiP 2.0 provides three modules to identify the most relevant (over represented) biological keywords associated with a given gene list.
DAVID proposed a novel clustering algorithm, named fuzzy cluster, to address this problem efficiently. Fuzzy cluster algorithm classifies highly related terms into functionally related groups. Typically, a biological term is a cooperation of a set of genes. As an example, if two or more biological processes are done by a similar set of genes, the processes might be related to the biological network somehow.
This type of grouping of functional annotation can give a more insightful view of the relationships between annotation categories and terms compared with the traditional linear list of enriched terms, as highly related annotation terms may be dispersed among hundreds, if not thousands, of other terms. This reduces the burden of associating different terms associated with the similar biological process, thus allowing the biological interpretation to be more focused at the biological module level. Empty report means that there are no annotations passed the specified threshold. It does not mean that no annotation exists.