March 23, 2011, 4:05 a.m.
posted by franni
HTTP Technology Primer
HTTP is the basis for Web browsing. HTTP is built upon TCP/IP and is considered an application-level protocol for distributed, collaborative, hypermedia information systems. It is a request/response-oriented protocol where an HTTP client makes a request, and an HTTP server services that request and subsequently responds.
When looking at HTTP from an application-programming point of view, the first thing to understand is that HTTP is connectionless and stateless. HTTP is based upon a Web server (sometimes referred to as HTTPD [HTTP daemon]) receiving a request and then formulating a response to a client. It is connectionless because neither the client nor the server retains any state information regarding the application data. It is up to the application programmer to maintain any state information necessary to the application.
In most cases, the client is a Web browser but could be an application, a Java applet, or another Web server. While this request/response protocol is not as sophisticated as the newer connection-oriented protocols such as Internet Inter-ORB (Object Request Broker) Protocol (IIOP), it has proved very flexible in allowing a wide variety of vendors to create Web servers, Web browsers, and other HTTP-based systems.
1 Uniform Resource Identifiers
URIs have been called many different names: Universal Resource Identifiers, Universal Resource Locators, WWW addresses, Uniform Resource Locators (URL) and Uniform Resource Names (URN). URLs and URNs are kinds of URIs. URL is specific to the HTTP scheme while URN is not. As far as HTTP is concerned, URIs are simply formatted strings that identify—via name, location, or any other characteristic—a resource. URIs in HTTP can be represented in absolute form or relative to some known base, depending upon the context of their use. The two forms are different in that an absolute URI always begins with a protocol name followed by a colon.
HTTP does not place any limits on the length of a URI. Therefore, HTTP servers should support this requirement. However, programmers formulating a URI ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths.
1.1 HTTP URL
Each Web resource (an HTML page, a JSP page, a servlet, etc.) that can be requested from a Web server must have a unique name associated with it. That unique name is called a URL. For discussion purposes, let's consider a URL as a way to uniquely identify a Web page, which exists on a particular Web server. For example, to access the index.html page on the www.abc.com Web server, the absolute and explicit URL would be http://www.abc.com/index.html. The format of a URL is as follows:
1.2 What is the difference between a URI and URL?
According to the specification [RFC 2396] all URLs are URIs. However, URIs allow Web services to be defined in a way that they are not bound to a specific server. This has many advantages.
2 Requests, Responses, and Headers
HTTP is a simple protocol based on a client sending a request to a Web server and then getting a response. When the client sends a request, the request contains all of the information that the Web server needs to process the request. Both the request and the response contain a start-line, zero or more header fields (also known as "headers"), an empty line (i.e., a line with nothing preceding the CRLF) indicating the end of the header fields, and possibly a message-body.
The headers section of a message contains a general-header section (headers that are applicable to both the request and the response and specific headers), an entity-header section, and either a request-header section or response-header section, depending upon the type of message. The general-header section contains items such as Cache-Control, Date, and Transfer-Encoding. The Transfer-Encoding header can impact the message length as the encoding type may increase the size of the body of the message.
The request-header section contains headers such as Host, Accept-Charset, and Referer. The Referer header specifies the URL of the page from which the request came from while the Host header contains the name of the target host specified in the request (the host which is processing the request).
The response-header section contains headers such as Age, Location, and Server. The Server header specifies the name of the server that generated the response.
Entity headers define metainformation about the entity-body or, if no body is present, about the resource identified by the request. Some entity-headers include Allow, Content-Encoding and Last-Modified.
In the case of a request message, the start-line is the request itself. An HTTP request is characterized by a method token, followed by a Request-URI and a protocol version, ending with a CRLF. The method token is one of GET, POST, OPTIONS, HEAD, PUT, DELETE, TRACE, CONNECT, or some extension method as defined by the implementation.
When using HTTP methods to create a request, the application programmer should understand that the writers of the HTTP protocol consider some methods as safe and others as unsafe. This definition of a safe method was noted in the HTTP specification so that user agents can be written to make a user aware of the fact that a possibly unsafe action is being requested. It is thought that the safe methods will not generate side effects as a result of calling them. The protocol does not enforce this idea of safe methods nor can it, as implementers are free to create servers that handle these requests in any way that they see fit. Two key HTTP request methods are particularly important to the programmer (GET and POST). GET is a safe method while POST is unsafe since it is expected that POST will cause side effects by posting some new data:
After receiving and interpreting a request, the server must respond. The response message contains a start-line—the status of the request. This status-line contains the HTTP protocol version followed by a numeric status code and its associated textual phrase, with each element separated by spaces. The status code is a 3-digit integer result code of the attempt to understand and satisfy the request. The textual phrase is for debugging purposes.
The first digit in the 3-digit code defines the class of the response. The last two digits do not have any categorization role but instead help to uniquely identify the response. There are 5 values for the first digit:
As with any HTTP message, after the start-line (status-line in the response case), the message headers are given, followed by the message-body. The message-body contains the actual data, which will be displayed in the Web browser.
3 Pulling It All Together
When using HTTP, there are numerous scenarios. In an effort to pull together the ideas presented here about URIs and messages over HTTP, we need to take a look at the GET and POST requests and how the interaction between the client and server occurs.
We show a GET request round-trip in Figure. The request is for the URL of http://webserver/index.html. Note that when using a Web browser and HTML to make HTTP requests, a GET request could be made in several ways. Here are the well-known ways:
The Web server, upon receiving this request, maps the request to a file located on the Web server file system (e.g., C:\www\html\index.html) and then responds with the contents of that file to the browser. The entire transaction involves one connection to the Web server and an almost immediate response.
In Figure, we show a POST request round-trip. This request is for the URL http://webserver/servlet/Register. Note that when using a Web browser and HTML to make HTTP requests, a POST request can only be made by clicking a button on a FORM which appears inside of an HTML page. The FORM would need to specify a method of POST as in the tag <FORM method="POST" action= "url…">.
The Web server, upon receiving this request, transfers it to the servlet engine, which loads the requested servlet by searching the classpath and then runs the servlet. Next, the Web server (or Web container) reads the posted data and performs the requested operation. Lastly, the Web server (or servlet engine) responds with a message that is displayed in the browser. The entire transaction involves at least one connection to the Web server with an almost immediate response.