Networking

Version 2.01
October 1996

David Steffen, Ph.D.
President, Biomedical Computing, Inc.
6626 Westchester
Houston, Texas 77005
USA

Introduction

The Internet has become an important tool for biological and biomedical research scientists. Using the Internet, it is possible to perform a number of kinds of analyses on research data and to search for and obtain information. Over the last several years, the number of tools and the amount of information relevant to biologists available on the Internet has grown and the ease of use of these tools has grown as well. As a result of both of these trends, the value of Internet resources for most biologists now significantly outweighs the costs in time and money of using it. The overall goal of this chapter is to help biologists use the Internet effectively and to show computer scientists working in biocomputing how biologists are currently using the Internet in order to indicate what is working well and ought to be expanded and what is not working well and needs improvement.

This chapter has three specific goals:

  1. To provide background information which will help demystify computer network usage.
  2. To provide an introduction to the resources available to biologists over the network in sufficient detail to allow the students in this course[1] to explore and learn how to use these resources on their own.
  3. To provide practical instruction to these students on using the specific network resources needed during the remainder of the course.

It is assumed that the students already can use an http (www; World Wide Web) client to connect to the course text and its linked resources. (Examples of clients include Mosaic, Netscape and lynx.)

Note from the author: The bulk of this chapter was completed in April of 1996, but minor revisions (version 2.0 -> 2.01) were made in October 1996. Much has changed between April and October but time constraints prevented me from making all the updates I would like. Please:


Table of Contents


next up previous contents
Next: ... Internet Resources ...Up: Introduction Previous: Table of Contents


An overview of computer networking

Presumably, you are reading this description of computer networking over the Internet and thus do not need instruction on how to connect to the Internet. If this assumption is incorrect, this section will not help you, nor will the rest of this chapter. The purpose of this section is to provide a very brief and informal theoretical description of computer networks, internets, and THE Internet to help demystify it for the person already using it.

I have no idea what kind of computer you are using or what kind of network connection it has. This is as it should be; I don't need to know. The current state of the Internet is such that connections between computers all over the world function seamlessly and easily, requiring little or no understanding of them by their users. In fact, however, the seamless connection that allows me to effortlessly retrieve documents from an http server or to chat on BioMOO is actually quite complex "under the hood".

There are many different kinds of networks on which your computer might reside. (The computer on which you are reading this might not even be directly connected to a network, but rather might be connected as a terminal to a computer which is on a network). These networks can vary both in terms of their physical and electrical properties (for example, RS485 or Ethernet) and in terms of how data is encoded on these media. For example, Ethernet can carry data encoded as TCP/IP, Appletalk, or Novell Netware. Similarly, Appletalk can be carried over Ethernet or an RS485 network (which Apple calls localtalk). Most likely, the network to which your computer is connected is a Local Area Network (LAN) as opposed to a Wide Area Network (WAN). LANs interconnect a limited number of computers within a limited area. For example, the Sun computer to which I am connected is directly connected to a few tens of computers in the Molecular Biology Computing Resource and the Department of Cell Biology at Baylor. Most of the computers at Baylor are on different networks. To connect to these other computers at Baylor and to computers all over the world requires interconnections between LANs which is usually accomplished with a WAN. WAN connections can be made via T1 lines or ISDN connections, for example.

Connections between LANs are accomplished by specialized pieces of hardware generically called gateways. (Bridges and routers are specific classes of gateways. A bridge just passes packets of information from one network to another whereas a router examines the address in each packet and intelligently routes the packet to the correct network.) Gateways are special purpose computers whose jobs might include determining which packets of information to transmit from one LAN to another, or reformatting packets of data as required by differences between the LANs. A collection of networks interconnected by gateways is referred to as an internet. One internet has grown to include many computers around the world and, in honor of its dominant role in worldwide computing, is referred to as THE Internet. Commonly, internet (with a lower case i) refers generically to any interconnected set of networks, and Internet (with an upper case I) refers to THE Internet.

For two computers on a network or on an internet to communicate with each other, they need to have a unique way of referring to each other, or addresses. On the Internet, addresses have the form of four numbers, each number having a value between 0 and 255. An example of such an address is:

129.106.28.111

These addresses are hierarchical; all of the addresses of the form 129.106.28.### might be at one institution, or within one department at that institution, for example. Gateways on the Internet contain maps of how the Internet's constituent networks are connected one to another, so that given such an address, they can determine one or more routes from where they are on the network to the appropriate destination, and thus which gateway(s) to hand any given packet off to.

One can (and sometimes does) use numeric addresses such as the above. More commonly, however, one uses an address consisting of words, such as:

merlin.bcm.tmc.edu

This form of address is converted to the numeric form of the address either by your local computer, or more commonly by a computer to which it is connected, called a "nameserver".

The advantages of using the "name" form of addresses are first, they are more user friendly (easier to remember and understand for humans) and second, they make for more reliable connections. Sometimes it is necessary to change the numeric address of a computer, or to move services from one computer to another, and when this happens, connections to the old numeric address will no longer work. However, nameservers can be automatically updated to associate the old name with a new numeric address, so that connections to the name will continue to work.

To connect to a remote computer, you need permission from that computer to connect, and you need to specify a kind of connection; for example telnet, gopher, http, ftp, or email. These different kinds of connections are characterized by different capabilities, different protocols for communication, different client software (that which the connecting user uses) and different server software (that which the host computer uses). Permission to use a computer is controlled on a "service by service" as well as a "user by user" basis. For example, anyone may make an http connection to merlin.bcm.tmc.edu, only those users with an account may make a telnet connection, and nobody may make a gopher connection. The way that you specify what kind of connection you want is by specifying a "port". This port is not physical, but rather can be thought of as a sub-address.

On merlin.bcm.tmc.edu, Port 23 is connected to a telnet server, and Ports 8001 and 8080 are connected to different http servers. There are standard ports for different services, port 80 for http and port 23 for telnet, for example, but these standards are just a convenience; any service can be connected to any port. The convenience of using the standard port is that users will know how to connect without being told. In fact, this frequently is where a connection will go by default. For example, most http (web) clients can connect to any port, but will connect to port 80 if no port is specified.


next up previous contents
Next: Find ... on InternetUp: Introduction Previous: Overview of ... networking


How Different Internet Services are Used


telnet

Telnet is one of the oldest of the network services and perhaps the easiest to understand. Telnet allows one computer to "log on" to another computer as if it were a terminal. Once logged on, you frequently will have all the privileges of a local user; you can run programs, create and delete files. This is probably the most common way that users with accounts will use a computer.

Although "full service logins" as is described above are perhaps the most common use of the telnet protocol, in fact as much control as the host's system administrator desires may be imposed on a telnet connection. Thus, a telnet service may be advertised with a public login name and password. Login with this name, however, is likely to be restricted to a limited number of commands. The National Institutes of Health in the United States uses such a telnet login to disseminate information as to the membership of study sections.

Finally, a telnet client can sometimes be used to connect to a different kind of server. Most people will use a telnet client the first time connecting to a MOO, and some people will continue to use telnet as their client, although most of us find dedicated clients to be significantly more convenient. Also, it is possible to connect to a gopher server with a telnet client if you understand the required syntax. This is almost never done to use a gopher server, but is commonly done when debugging a gopher system.

From a practical point of view, every telnet host will be different, and thus you will need to learn about each one as you have occasion to use it.

ftp

Telnet is useful for interactive computer access, but is much less useful for transferring files. Ftp is an older service designed specifically for file transfer. Originally it, like telnet, was intended for account owners. However, as it became apparent that it was useful to make files available to the world at large without giving all those wanting the files an account, the variant of "anonymous ftp" developed. In this variant, logging in with a "magic" user name (most commonly "anonymous" or "ftp") eliminates the requirement for a password.

Once logged on via ftp, access to the host filesystem is accomplished by a series of commands. On a unix ftp client, the commands are unix-like; cd to Change Directory and ls to LiSt the files in that directory. To transfer files, you execute either get [2] a file from the host computer or put a file onto it (where allowed). These commands do not depend on the host computer running UNIX! They do, however, depend on the client. A graphical user interface (GUI) for example, might not have typed commands at all, but buttons. These are ftp commands, some of which happen to be similar to unix commands.

One pair of ftp commands which is especially important to understand are binary and text. Ftp transfers occur in text mode by default. In text mode, the file received may not be identical to the one on the host, as ftp may make changes in the file during transfer, to allow for differences in how different operating systems handle text. For example, UNIX terminates lines with the linefeed character (ASCII 10 decimal), the Macintosh operating system with a carriage return (ASCII 13 decimal) and MSDOS uses one of each. These differences are corrected for during a text transfer. This is highly desirable for text files, but catastrophic for binary files like program object code and pictures. Thus, before getting such a file, it is important to issue the binary command. This instructs ftp to transfer files unmodified.

To a large extent, use of the World Wide Web has rendered (direct) ftp access obsolete. In the first place, www has its own file transfer protocols which are much more intuitive than the pseudo-unix of ftp. In the second place, html documents can contain links to ftp servers. This means that files which have not been made available on the web but only on an ftp server can nonetheless be retrieved by an html client if someone has placed a link to that ftp server in an html document. In the third place, www clients can be used as ftp clients for anonymous ftp by entering the appropriate URL. For example, supposing your are given the following instructions to retrieve a file:

"The file is available by anonymous ftp.
 ftp to ftp.bcm.tmc.edu
 and retrieve mbcr/pub/file.txt"

...you could accomplish this with your www client by using the URL:

ftp://ftp.bcm.tmc.edu/mbcr/pub/file.txt

email

Both ftp and telnet are interactive, more or less real time programs. Sometimes it is useful, however, to communicate with another computer, or more commonly, a user on another computer, by leaving them a message which they can read and respond to at their convenience. This is done over the Internet by using the email system. Email is almost always set up by a system administrator and the individual user has little they can or need to do to get it working. Thus, I will discuss simple email here no further.

Many computers which do not have "direct Internet access" nonetheless can exchange email. Thus, email represents a "lowest common denominator" for computer interconnection. That being the case, email has been pressed into service for uses in addition to simple user to user mail-like communication. One such category of uses is provided by what are called "mailservers". Mailservers provide a service that one might expect to perform using a telnet, ftp, gopher, or www system via email. The way this is done is that email is sent to a program on the host computer rather than to a user and this program responds to information in the Subject or Body of the email message. For example, it is possible to retrieve sequences from Genbank or to use blast to search genbank using mailservers. Although many people see this as a primitive fossil to be used only as a last resort, use of mailservers actually has some advantages that might mandate their continued role well into the future. Because they are mail based, they are asynchronous. A user can make a request at their convenience, and then go on to other tasks while waiting for their request to be fulfilled. On the server end, requests can be queued to be filled as the host machine has the resources to do so. I personally choose to retrieve public domain Macintosh software via a mailserver rather than one of the many ftp, gopher, or www sites because these latter, interactive sites tend to be very busy and thus difficult to log onto.

The biggest disadvantage of mailservers is that communication with them requires a very precise syntax in the email message. Further, there are no standards for this syntax and thus each different mailserver has a different syntax for us to learn. The homework requires that you learn at least a basic subset of the commands for the retrieve server and I recommend that you do so for blast email server as well.

Another use to which email has been put is for group communication rather than one to one communication. This is accomplished via a software package called a listserver. Use of listserver-based mailing lists is, in fact, an important tool used by this course. Mail sent to a listserver is resent to all members of a group. In addition to sending and receiving email from the group, one can send messages to the listserver itself to subscribe, unsubscribe, retrieve archives messages, and so on. There are a relatively small number of listserver software packages being used, so that there is a reasonable hope (but no certainty) that if you are familiar with one listserver, commands for another may be the same. Two of the most common listserver software packages are listserv and Majordomo. Majordomo is the listserver used in this course. Although you do not need a sophisticated knowledge of Majordomo to participate in the course, you should at least learn how to unsubscribe from lists when you are no longer interested in them.

I have been both a user and an administrator of listservers, and personally find them clumsy to use for a couple of reasons. First, remembering the commands and email addresses (one for each group to which you belong and one for the listserver to issue commands) is difficult. Second, listservers are completely dependent upon and very picky about email addresses. At Baylor, a user's email address is different depending on how they log onto the system, and these addresses change with some regularity. This introduces no problems for receiving mail from a listserver as the Baylor system automagically resolves these changing names, but produces recurring problems sending mail to a listserver, as the listserver may require that your posting comes from precisely the same address as is present in the subscription list. Third, because listservers use the email system, messages from a listserver group are intermixed with your private email and with messages from all the other listservers you subscribe to. Fourth, the email program you are likely to be using to read the intermixed mess of messages lack many commands which are extremely useful for efficiently following a group that even the most primitive Usenet software will have. It might be thought that Usenet software would make listservers obsolete in the same way that www ought to make ftp obsolete. Why this has not occurred will be discussed below.

Usenet

The alternative to listservers for group communication are newsgroups. Newsgroups use entirely different protocols and software than email (and thus listservers). The distinction between listservers and Usenet is made less apparent, however, by the fact that one can typically send email from within a newsgroup client to the author of a newsgroup message and because as a result one might receive email in response to a message posted to a newsgroup.

To read messages posted to newsgroups, one runs any one of many newsgroup client programs, subscribes or unsubscribes to groups, and reads the messages one group at a time. The advantages of newsgroups over listservers are seemingly overwhelming. Messages from different groups are kept separate from each other, and all of them are completely separate from your personal email. Subscribing and unsubscribing and posting to and from groups typically involves a keystroke, and help files are another keystroke away. Within a group, it is possible to read the messages by topic rather than in the order they are posted. If a topic is uninteresting to you, it is possible to delete all the messages on that topic, and this can even be set up to occur automatically (via a "kill file"). Finally, as a "moral" issue, when the number of readers becomes large, newsgroups consume fewer system resources worldwide than do listservers. The reason that listservers still exist, however, is that a proliferation of the number of newsgroups causes a variety of problems, including consumption of world wide system resources, and, partially as a result, it takes significant effort and interest within the internet community to create a new group. Thus, listservers are used to create small and/or temporary and/or casual groups whereas newsgroups are set up to allow conversation on more general issues of widespread interest.

The classic collection of newsgroups is Usenet. Usenet consists of seven groups (and thus is also known as the "big seven"); sci (science), comp (computers), soc (social or sociology - I am not sure), talk, misc, news, and rec (recreational). It is important to remember, however, that not all newsgroups are part of Usenet. Newsgroups which are not part of Usenet include Bionet, Clarinet, biz, alt, bcm, and many, many others. There are about 1300 groups in Usenet but about 10,000 groups overall!

Most users will not notice the difference between Usenet and non-Usenet groups that are received by their site. However, not all newsgroups are transmitted to all sites. bcm, for example, is a group set up by and for Baylor College of Medicine and is only received within Baylor. In fact, what characterizes Usenet is the rules used therein for group creation. Thus, Usenet is an assurance of general interest (if not quality) which is intended to encourage more system administrators to carry these groups.

Newsgroups use, by convention, a hierarchical naming scheme. Consider two examples:

sci.bio.microbiology
bionet.microbiology

sci.bio.microbiology is the microbiology subgroup of the bio(logy) subgroup of the sci(ence) group of Usenet. Bionet.microbiology is the microbiology subgroup of bionet. Bionet is not part of Usenet. Bionet (like the other non-Usenet groups) has its own system for group creation, however, and provides its own assurance of quality. I personally find the Bionet groups to be the most useful of the newsgroups.

From a practical point of view, I suggest that newsgroup newcomers of the biological persuasion look over the list of bionet groups and the sci.bio subgroups and subscribe to those that seem interesting. Follow them for a while, and unsubscribe from less interesting groups until a balance between the time required to follow the groups and the value of the information retrieved is reached. In addition, for biologists interested in computing, subscription to a very limited and specific subset of groups in the comp group of Usenet can be invaluable.

Information can be obtained from newsgroups both by "lurking" (reading the group without posting) and by asking specific questions and waiting for the replies. Good citizenship requires, however, that in addition to posting questions one answers when appropriate, though too many answers are more often a problem than too few.

As a final warning, the two biggest problems with newsgroups are a very low signal to noise ratio and an exceedingly low level of common courtesy. (Both of these problems are less on Bionet, in my opinion.) If you post to newsgroups, expect to be gratuitously insulted ("flamed") to an extent you may have never before experienced. Also remember that more than one career has been destroyed by the black hole time sink of Usenet.

WAIS, gopher, and HTTP (World Wide Web)

Although ftp provides the capability of transferring files over the Internet, it is not, for a number of reasons, a good tool for sharing information. The first attempt at such an information sharing system was gopher. Gopher allowed convenient viewing of files on line, the use of meaningful names rather than cryptic filenames for these files, and the linking of one gopher site to another.

At about the same time, WAIS (an acronym for Wide Area Information Server) provided a solution to another problem; that of searching for information rather than browsing for it. WAIS is a powerful, free, easy to use package both for the indexing, serving, and searching of free-text databases. As a result, a large number of databases useful for biologists sprung into being. Soon after, the WAIS searching capability was integrated into the gopher server such that a biologist by learning one system (gopher) had the capabilities provided by both gopher and WAIS. As a result, gopher became a standard and extremely useful tool for biologists.

At about the same time the concept of linking files on physically separate computers was combined with the concept of hypertext to produce a system for making information available over the Internet called HTTP (HyperText Transport Protocol) or the World Wide Web (www, W3, or the Web). The World Wide Web system initially spread more slowly because it made greater demands on the client hardware and as a result was more platform dependent than gopher. By now, however, the World Wide Web has almost entirely replaced gopher as the standard for information exchange between biologists. Among the capabilities unique to the Web are:

Another reason for the web's success is that many (most? all?) web clients have the ability to access gopher and some other servers so that a biologist by installing and learning a web client gains access to web servers and linked gopher and ftp servers. Ironically, although most gopher clients contained built-in WAIS clients and thus can be used to access both gopher and WAIS servers, most web clients cannot access WAIS servers (directly). The fact that this precludes acccess existing WAIS servers is becoming less important with time as these servers are replaced with web servers. What is unfortunate is that WAIS represents a powerful way of quickly and simply setting up searchable databases. One explanation for the absense of WAIS support in web browsers is that the WWW community is focussing its efforts on links between the web and powerful database management systems (dbms). I strongly support this effort; developing such databases and their web links is what I do for a living. It is the case, however, that developing such dbms-based web sites is a significant effort well beyond the capabilities of a typical biologist. In contrast, many biologists were able to set up a WAIS-indexed databases after a few hours work.

Two solutions are available to this problem. The first takes advantage of a feature present in many (most? all?) web clients, the ability to access "proxy servers". A WAIS proxy server has the ability to take input from a web client, reformat that input and use it to access a WAIS server and convert the output of that server into a form useable to the web client. This approach allows one to access any WAIS server, but requires that a proxy server be installed, typically on a local machine, and that each client be configured to use that proxy server, something not difficult but perhaps beyond the patience and skill of many biologists. The second solution which does not allow access to existing WAIS servers but does allow use of the powerful indexing and searching features of the WAIS protocol is to install an http-compatible WAIS server. A number of such servers exist and are in use.

The lack of platform independence remains a problem for the web. It is probably the case that three web browsers now account for most of the clients currently in use; the America On Line (AOL) browser, Netscape, and Microsoft's Internet Explorer. (Netscape is a product of Netscape Communications Corporation, and the AOL browser is produced by AOL specifically to allow access to the web from their commercial computer site.) Both Netscape Communications Corporation and Microsoft have chosen to include many features in their browser which are not included in the current html standards and thus which are not present in other browsers. Many of these features are very attractive and thus html authors have tended to use them, resulting in sites only fully useable with one specific browser. The problem with this is that an open standard is being converted into a proprietary system. An additional problem is that clients for different platforms (e.g. X-terminals, Microsoft Windows, Macintosh) do not necessarily have the same features at the same time.

Most recently, a powerful new feature was added to some clients; the ability to download and execute programs written in a language called Java. This ability obviously expands the power of the web enormously. Only time will tell what uses can be made of this feature. (The Netscape client supports Java, but in addition uses another, similar feature, called JavaScript.) There are, at present, some clear disadvantages of this feature, however. First, until JAVA support is included in all browsers, this further moves html/http away from an open standard towards a proprietary system. Second, there are obvious security concerns of this new feature. Netscape Communications Corporation put significant effort into protecting Java users from security problems and based on their history can be expected to be most aggresive about attacking security holes as they appear, but users need to be aware of the potential for such problems and to act accordingly. Thus, major security problems in Nescape 2.0 have been corrected in Netscape version 2.01, but if a user does not upgrade to 2.01, the security problems persist. Also, severe problems persist at least into Netscape 2.01. (This current problem is not unique to Netscape but rather applies to all clients using Java.) I don't know if the current version fixes this particular security problem, but I do know that Netscape Communications is likely to fix it eventually. On the other hand, I also know that as old security problems are fixed, new ones may appear. What can a working scientist do? My advice is 1) keep in touch with the author of your browser and upgrade to new versions as seems warranted, 2) at present, upgrade from Netscape 2.0 to the current version, 3) consider turning off support of Java, and 4) above all, be cautious, alert, and informed.

(Thanks to Paula Burch of Baylor College of Medicine's Academic Informatics Services for help with this section. Any problems with it are, however, completely my responsibility.)

MOOs

MOO stands for Mud, Object Oriented, where MUD stands for Multi-User Dungeon. Dungeon is one of the first of the computer games, a text-based game loosely derived from the (non-computer) game Dungeons and Dragons. In Dungeon, one types a series of commands into the computer to cause an imaginary self to maneuver through an imaginary environment to try to avoid being killed by imaginary monsters, to solve imaginary puzzles, and to accumulate imaginary treasures. In a MUD, many players participate in the same imaginary environment so that they can interact with each other as well as with the computer, adding to the gaming environment. Besides MOO, there are many forms of MUD. They are, in general, systems for creating these games and allowing multiple people to play them over the internet. MOO, however, was created at Xerox because it was felt that text-based virtual reality could be used for serious conferencing as well as for gaming. (The Object Oriented in the name describes the built-in programming language for creating and modifying the virtual reality). I assume most of you are more or less familiar with the concept, being that this course is conducted on a MOO, and in any case, description is a much less efficient way of conveying what this is all about than participation. The links below to connect to three (non-gaming, more or less serious) MOOs of particular interest to biologists. Use them and log on as "guest" to explore these environments.

BioMOO
Connect, Home Page
DU MOO
Connect, Home Page
CollegeTown MOO
Connect, Information

Some Practical Considerations for Using Internet Services

  1. Virtually all Internet resources are free. It is unlikely in the extreme that you would spend money by accident while exploring the Internet.
  2. Although the Internet is becoming increasingly important to biologists, it is still not a sufficient resource for keeping up with biology. In fact, it is far from the most important resource for keeping up with biology. With access to a good (or even adequate) science library, you could do without Internet, albeit with a significantly greater amount of work on your part. On the other hand, Internet cannot come close to making up for the absence of a library. It is, of course, also true, that no matter what you do you cannot hope to keep up with progress in even a narrow specialty within biology. Some of us are devoting our lives to struggling with this problem, and the solutions are likely to be computer based and to involve the Internet. I hope and expect to be able to reverse the sense of this paragraph within the next 20 years.
  3. Network resources are not as reliable as one would like. If you select a resource and receive only an error message, the fault is likely to be with the server. It is even the case that if you reach the server and don't receive the results you expect (e.g. search for a common term and get nothing in return) this might be due to the fact that the server is misbehaving rather than you doing the search incorrectly.
  4. Just because it is on a computer doesn't mean it is correct. This sounds like a banal truism, but it is startling how good the beauty of computer output can make bad data look. Genbank is loaded with author errors (both in the sequence and comments sections) and probably contains some archivist-introduced errors as well. The same is true of Medline. Any study which assumes perfection in such data will produce an incorrect result.
  5. Compared to the well developed conventions for referencing data from the paper literature, conventions for acknowledging of Internet resources in a thesis or paper are in utter chaos. Surfing the net for a few months will uncover a number of competing standards for how to accomplish this. For resources available on the web, a URL is my reference of choice. Dave Kristofferson has argued in favor of one for newsgroup messages, for which URLs are inapplicable.

next up previous contents
Next: Use...in this Course Up: Introduction Previous: How Different Services...


How to Find Things on the Internet

So how do I find all these wonderful tools and data that are supposed to be on the Internet? The bad news is that there is no perfect way to doing so. The good news is that you don't need a perfect way; finding half the useful resources on the Internet is well worth while, and a LOT better than ignoring the Internet altogether. It is also true that the tools available for finding things on the Internet are a lot better than they used to be. Unfortunately it is also true that the task required of these tools is a lot greater than it used to be. Internet is vast and disorganized, and the overwhelming majority of what is there is irrelevant to you. Further, the Internet changes constantly; new resources appear, old resources become outdated or disappear and the paths and techniques used to access resources change.

In my opinion, the overwhelmingly most useful approach to finding resources on the Internet is to let someone else do it. This is not immoral or lazy but just good sense. If someone has already gone to the effort of identifying the best biology resources on the Internet, it is wasteful for you to try to reinvent the wheel. Fortunately for you, someone has done so. In fact, several lists of Internet resources exist so that a lack in one is likely to be covered by another. Mostly, I rely on the lists assembled by Paula Burch and the other good folks at MBCR. These do not try to be definitive, but the interests of MBCR tend to match mine. Among those lists that do try to be definitive, some of my favorites are:

  1. EXPASY.
  2. Pedro's Biological Research Tools[3].
  3. USGS NETWORK RESOURCES: BIOLOGY SERVERS.
  4. The World Wide Web Virtual Library: Biosciences: simple List, or Searchable Index.

For the first three of the above resources, you will see a simple list of links which you can select and explore. If you find something you like, you should add it to your Hotlist/Bookmark List.

For the fourth resource, you can get the same thing by selecting "List" in the resource. However, if you select "Searchable Index" you can search for keywords in the list rather than just trying to scroll through it. Searchable indexes will become more and more important as the number of resources grows.

These resources also illustrates an unfortunate problem with searching on the web, however; their syntax is not standard. (This is not a problem with the particular resource, but rather a problem common to all web searches.) Note that the search allows for "perl regular expressions." I know and love both regular expressions and perl, but if you don't, figuring out how to do this can be daunting. Making matters worse is that there are several flavors of regular expression on the loose (thus the qualifier PERL regular expression) and knowing one can mislead you when using another. In my opinion, standardization is necessary in this area.

How do you know when a new list of resources appears, or how do you hear about resources that, for one reason or another are not on these lists? You may hear about them from friends or colleagues, with whom you may communicate in person (e.g. when you bump into someone in the hall), on the telephone, by email or on a MOO, just as you hear about new reagents or techniques. Email and MOO are, of course, Internet-based, but work much like telephone and personal communication respectively.

Newsgroups, which are another way of finding out about resources, is unlike most conventional modes of scientific communication. (In this discussion I will focus on Bionet, because that is what I use, but some of the sci.bio groups might be appropriate for such communications as well). By posting a question to bionet.software, for example, you are asking your question of hundreds or thousands of people. The advantages of this are obvious. The disadvantage is that you can easily waste the time of hundreds of people with an inappropriate[4] question.

What if the above approaches fail, or what if you take upon yourself the duty of preparing a list of resources? There are a number of resources available for searching the Internet. Paula Burch at MBCR has generated a nice list of these. Among these are servers that search the http servers, ftp servers, and usenet. Finally, in addition to the usenet-search resources listed by Paula, Bionet has its own web site for searching only its postings - an excellent way for a biologist to reduce noise.


nextuppreviouscontents
Next: Other Web ResourcesUp: Introduction Previous: How To Find ...


Use of World Wide Web Resources Required in this Course


Resources Needed Throughout the Course

There are two resources that you will use over and over again in the course, and hopefully for a long time thereafter; Entrez and the BCM Search Launcher.

Entrez is an extremely powerful tool for searching nucleotide sequence databases, protein sequence databases, or a subset of Medline. For the sequence databases, what you search is the comment fields of the sequences, not the sequences themselves. To search for sequences, use the BCM search launcher or another similar tool. Entrez is constantly being redesigned, so that if you are reading this chapter some time after I wrote it, many of the details of how to use Entrez are likely to have changed. A year ago, the home page was a bit confusing, but at present it is, in my opinion, completely intuitive. There are two lists of links on the home page. The first are links to various help files which you ought to take full advantage of. The second are links to the actual searches, including searches of Medline, Genbank, or protein sequence databases required for this course. (Medline is listed twice; "Search the molecular biology subset of MEDLINE" and "Find MEDLINE articles that match a given text". You want the first for this chapter.)

What makes using these Entrez searches confusing (to me, at least) is that Entrez has a rather unique way of doing things, and it is not always clear (to me, at least) how various search options work, even after reading the help files and experimenting. I don't try to cover all of the capabilities of Entrez in this section, and some of what I do cover I probably have wrong. I strongly urge you to read the help files and experiment to learn about the many other powerful features of Entrez.

Practical Use of Entrez

This section should not be taken as a definitive explanation of the best way to use Entrez, but rather as one way that I find to be convenient and useful.

The nucleotide, protein, and Medline databases are all searched in the same basic way. From the home page, select which of these you want to search, and searching thereafter is very similar.

Once you select a database from the home page, you come to a search page. Entrez is designed to build searches progressively. You perform an initial search and then either decrease (or increase) the number of hits retrieved by adding search terms which are ANDed (or ORed) to what you have. This is potentially very powerful, but has some drawbacks as well. The major one is that typically I want to do a series of unrelated searches. Because of the progressive nature of Entrez, however, searches done one after another all build on one another. On the search page are "Accept" and "Clear" buttons, but also a link (not a button) labeled "clear all". Select this link to perform a totally new search.

Once you have completed your first search, (which can take one or two steps, see examples below) the page will be divided into three parts, named Current Query, Add Term(s) to Query, and Modify Current Query. These are used to conduct progressive searches. Such progressive searches are not needed for this course, so I will not explain them here, but I urge you to explore them by experimentation and using the help files.

The BCM Search Launcher


The BCM Search Launcher allows you to launch a number of different kinds of searches from a common front end. To deal with the vast number of options that result, a rather cryptic but powerful series of pages was designed.

From the home page, select Protein or Nucleic Acid searches or Multiple Alignments. The pages you go to, in each case, will all be similar in structure. Near the top of the page will be a text box into which you paste (or type) the sequence(s) to be analyzed. In the middle of the page are the buttons you use to launch (or clear) the search, and on the remainder of the page is where you describe what kind of search you want to do, implemented as radio buttons. Each radio button represents, in general, a server (although two buttons might represent the same server with different options). For each server, there are a few words of explanation, including where the server is located, and most importantly, three links:

  1. One labeled [H] which links you to a help file for the server.
  2. One labeled [P] which tells you the parameters the search will use.
  3. One labeled [O] which takes you to an alternative page from which to launch the search which allows you to set options to values different from the default.

Resources Required for Chapter 1: Pairwise Sequence Alignments

Although the BCM Search Launcher includes the various kinds of BLAST and FASTA searches, for Chapter 1 of the course you will be adjusting parameters of these searches which are not normally adjusted, and thus which are not adjustable from BCM Search Launcher. As a result, you will use the additional resources to do BLAST and FASTA searches required for Chapter 1.

The BLAST server is implemented as an HTML form you fill out. As such, it is largely self-explanatory. I do, however, note the following points which might be helpful.

  1. Not all options apply for all searches; choosing the program BLASTN, for examples, makes choice of a matrix irrelevant.
  2. Depending on your browser, some of the data entry fields work in a confusing way. For example, the field in which you specify the maximum number of sequences to return contains a default of 250. When you select it, the field may display 0 even though the 250 is still there. If this happens, you have to remove it by backspacing to enter the desired value.
  3. In order to specify values for S and W you click on the Additional Options: YES radio button. Having done that, you need to specify which additional options you want in the next field as if you were supplying them on the command line to the UNIX-based BLAST program. To find out how to do that, click on the phrase "Additional Options" (which is a link) which will take you to a page explaining that.
  4. One example: For a nucleic acid sequence, the default value for W (the window size) is 12. The value for S (the expected score) is normally calculated from E (the number of chance matches expected from the search) and the other characteristics of the search. Normally one does not change W at all, and changes S by changing E, but for educational purposes, you will be doing so here. To accomplish this, you might type into the field:

    s=50 w=10

    Values of W over 12 (the default for nucleic acids) are not allowed. Further, although the search is performed with non-default values of W, a warning message is generated.

The Fasta server is suggested here for a number of reasons:

  1. It allows you to set ktup.
  2. It is the only server I am aware of that offers the latest versions of FASTA, version 2.0. Version 2.0 offers a number of advantages over previous versions, including being more sensitive and providing a statistical estimate of the probability that each match might have occurred by chance.
  3. Unlike the GenQuest server (which is the server accessed by the BCM Search Launcher) it allows you to do FASTA searches of DNA sequences. (The GenQuest server only allows FASTA searches of peptide sequences.)

Use of this server is largely self-explanatory. From the home page, linked above, scroll down the page and pick the (FASTA) search you wish to perform. There are different links for DNA and protein searches, for example.

Unfortunately, the above FASTA server was unreliable at the time of this writing. Thus, this Fasta server is given as a backup. It is not FASTA 2.0 and is rather bare-bones compared to many of the servers on the web. For this server, you initiate your search on the web and the result is emailed to you. The form is mostly self-explanatory, with the exception of one part I found/find confusing; the choice of a library to search. At this point you pick one from a list of cryptic 4 or 5 letter codes from a pop-up. I have been unable to find help for this, but after some thinking and experimenting came to the conclusion that they represented EMBL, new (recent) entries (EMNEW), EMBL, all entries (EMALL), the equivalent for Genbank (GBNEW, GBALL) and EMBL divided up by organism: EVRL = Viral sequences in EMBL, etc.

You will be doing the pairwise alignments for Chapter 1 using the alignment tool built into BioMOO. The reason for this is that we were unable to locate a net-based alignment tool which had the characteristics required for this chapter. You might, however, want to explore the following Pairwise Alignment server.

Resources Required for Chapter 3: Multiple Alignment

SRSWWW is a complex server which has a number of properties which I find confusing. First, there are a lot of apparently similar options which do rather different things. For example, on the home page there is a conventional link named Databanks, and below that a series of buttons labeled "Search sequence libraries", "Search libraries with protein structure information", "Search a library linked to sequence libraries", etc. If you know you want to search PDB, the database of protein structures generated by Brookhaven National Laboratories, which of these should you choose? In general, it depends. For this chapter of the course, the answer is the link to Databanks.

The advantages of SRSWWW are that it is more comprehensive than other servers of this kind, and that it contains relationships between data that are not easily accessible from the original data and which, among other things, allows you to link between databases. To learn how to use the full capabilities of this server is well beyond the scope of this course, but fortunately, there are only two features of SRSWWW which you will need for this course; PDBFINDER and ALI (a.k.a. 3dALI). Ali is used only for a few optional exercises and thus will not be treated in detail. To access it, select the:

       Network Browser for Databanks in Molecular Biology

page. This gives you a list of databases, from which you select ALI. Note that for the ALI database, only searches by ID work. An ALI ID is a cryptic letter code. The easiest way to find a code of interest (if someone has not given it to you) is to do a wildcard search (which you get if you search without entering a search term) and browsing the list of all IDs returned.

The other database you will use in this chapter is a reformatted PDB called PDBFINDER. PDB is a database of protein 3D structures. Each PDB entry contains a list of amino acids with their 3D coordinates which can be converted into a protein structure. Our use of PDB in this course is surprising in that we make minimal use of this structural information. Rather, we mostly use the amino acid sequence stripped of its 3D information. The reason for this is that you will be repeating a published alignment, and the structural information was used in the publication. Thus, the authors referred to the sequences they used by their PDB filenames, and in order to retrieve these sequences, you must do so from PDB.

PDB is not an ideal database from which to retrieve sequences. In the first place, the sequence is imbedded within structural information and it would take a fair bit of work with a text editor (or a program) to cleanly excise the sequence. Further, PDB uses three letter amino acid codes which would have to be converted to 1 letter codes before it could be used by most programs. Use of SRSWWW/PDBFINDER solves these problems for you as when you retrieve a file from PDB finder, the peptide sequence is presented as one (long) line of 1 letter amino acid codes, trivial to excise. In addition, in many cases SRSWWW is able to link the PDBFINDER file to the equivalent Swissprot file.

In summary, to retrieve a sequence from a PDB file requires the following steps:

       Network Browser for Databanks in Molecular Biology

When you perform the above steps the first time you are requested to do so in this chapter, your search will return no results. This is due to another property of PDB which makes it difficult to use, that PDB ID numbers are not guaranteed to persist. It is characteristic of database entries that they can become superseded by newer entries. How to handle this is a major philosophical issue in database design. In Genbank, the ACCESSION field contains the accession number of the current file and following that the accession numbers of all sequences it supersedes. Thus, if you search Genbank with a superseded ID number, you will retrieve the more recent record. This is not the case for PDB. The older ID is contained in the record, but only in a superseded field, which can only be searched using a free text search. Unfortunately, SRSWWW cannot do full text searches of PDB. Thus, another resource you will need in this chapter is the PDB Gopher. In general, if you search for an ID and do not find it, do a free text search for it on this gopher. Use of this gopher is self-explanatory.

Although Entrez contains a copy of the Swiss-Prot database, the copy of the Swiss-Prot database available from the SWISS-PROT Server itself contains sequences that are absent from the Entrez copy. These sequences are expired (that is, they have been replaced by updated versions of these sequences) and thus have been removed from the Entrez version of Swiss-Prot. Since these expired sequences are required for this chapter, you will use the SWISS_PROT server directly to obtain them. Help is available for the SWISS-PROT server.

The following will also be used, but are only described briefly at this point. They will be considered in more detail in the chapter itself.

  1. MSA is a good tool for aligning a small number (e.g. 5) of sequences in order to gain insight into their relationships. This server uses the algorithm described in Altschul, Lipman, Kececioglu (1989).
  2. The The BCM Search Launcher, (which we have covered earlier) especially the Clustal facility of that server.
  3. The MaxHom Alignment server[5], which is offered together with structure prediction. This server also has an information facility available.
  4. The All-All related peptide sequences server[5], discussed above.
  5. The AMAS server [5] which you can use to Analyse Multiply Aligned Sequences.
  6. WebLogo Sequence Logo Generation [5], a tool for the Analysis of Multiple Alignments. It requires its input in FASTA format.

Go Back to the Table of Contents

Resources Required for Chapter 4: Mathematical Analysis of Molecular Phylogenetics

The sequences you will need in this Chapter can all be retrieved from Entrez, as is described above.

You will need to do multiple sequence alignments using the CLUSTAL program in the BCM Search Launcher, also described above. Select Multiple Alignments from the first page, and CLUSTAL will be an option on the next page.

The Tree of Life provides a linked phylogeny of species which will be useful for interpreting sequence comparisons. Its use is obvious.

All-All: is a server which does "all vs. all" alignments of a list of sequences[6]. By looking at the offered example, you ought to be able to easily figure out how to use this server. Note that all of the sequences are listed one after another in one entry field.


nextuppreviouscontents
Up: Introduction Previous: Resources Required ... Course


Other Web Resources

Medline is an on line database of "medical" references. Because "medical" is interpreted extremely broadly (all articles in the journal "Cell" for example, are in there) it is of value to virtually anybody having anything to do with biology. It contains a complete literature citation, complete abstract in most cases, and a variety of other useful kinds of information (institution of first author, grant support for the research described and key words). This database is indexed and searchable. Searches normally occur in real time and take seconds. Papers back to 1966 are included, although the earliest references lack abstracts.

One factor that dramatically increases the value of Medline is the fact that it is constructed by professional abstractors who assign key words from a controlled and highly structured vocabulary. Thus, if you can define key words that describe what you are interested in you can be sure that you will not miss articles because the author uses terminology different that you use in your search.

So far as I know, there is no complete set of Medline that can be accessed for free. However, in my opinion, paying for access to a complete version of Medline is practically a necessity. Access to Medline can be obtained on an individual basis either from The National Library of Medicine ( NLM) or from other vendors, such as BRS or Dialogue. In many cases, an institution will obtain a Medline site license. Baylor, for example, is a part of Texas Medical Center which makes Medline available to all of its members for no charge.

ENTREZ, which we talked about as a sequence retrieval tool above, is a database which links a subset of Medline (including abstracts!), all of Genbank, and a comprehensive protein sequence database. In addition to being able to search Medline in the powerful ways (e.g. by words, MESH heading, etc.), and being able to search the sequence databases both for sequences and key words in the comments section, links are provided between Medline references, and sequences reported in these references and inversely from references in the comments section back to Medline. Finally, both for references and sequences, linkage to other references similar in topic or other sequences related by sequence similarity are provided.

Linking of databases is being promoted as one of the important future directions in biocomputing. Unfortunately, there are few concrete examples of this presently available. The best of these, in my opinion, is Entrez.

OMIM stands for Online Mendelian Inheritance in Man. Mendelian Inheritance in Man has as its organizing concept human genetics, especially human genetic diseases, but its author, Victor McKusick, takes such a broad view of this topic that this resource is almost a global review of (human) biology. Everyone who cares about biology should have a link to OMIM on their Hotlist/Bookmark List!

GDB stands for Genome Database. It is a massive relational database of human genetic mapping information. Because genetic mapping is the next step up the size scale from sequencing, this is a resource that biologists interested in sequence analysis ought to explore.


nextuppreviouscontents
Up: Introduction Previous: Resources Required ... Course


Back to VSNS BioComputing Division Home Page
VSNS-BCD Copyright

David Steffen
steffen@blkbox.com