Search |
Scout Module APIScout Module APIPlanetLab TeamThe Scout kernel module provides non-priviledged users access to a restricted form of raw IP datagram sockets on PlanetLab. Additionally, it tracks per-slice network usage and allows ports to be reserved on a node for the exclusive use of a particular slice. This document describes the safe raw socket interface, as well as the accounting and port management features of the module. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Chapter 1. Safe Raw SocketsOverviewSafe raw sockets are used to access raw network data, including IP, ICMP, UDP and TCP headers, while enforcing protection between different users (slices). No user is able to interfere with others by sending or receiving data on ports that have been registered to, or are currently being used by, other users. Sending data from an unregistered port or non-local IP address is also not allowed. Access to safe raw sockets does not require super-user privileges or the corresponding Linux capabilities. Socket APIThe safe raw socket API uses the standard Linux socket API with some minor semantic differences. Just as in standard Linux, first the socket must be created with the socket system call. To create a safe raw socket, the domain of the call must be set to PF_INET, the type to SOCK_RAW, and the protocol can be one of the following:
sock = socket(PF_INET, SOCK_RAW, IPPROTO_TCP); Once the socket is created, it is necessary to bind it to a particular local port of the specified protocol (or identifier in the case of an ICMP socket). The standard Linux bind system call is used. To bind the socket created above to local TCP port 9090:
struct sockaddr_in sin;
memset(& sin, 0, sizeof(sin));
sin.sin_port = htons(9090);
bind(sock, (struct sockaddr *)& sin, sizeof(sin));
After the socket has been bound to a local port, it is ready to be used to send and receive data. The usual sendto, sendmsg, recv, recvfrom, recvmsg and select calls can be used (note that the send call is not supported for raw sockets). Packets received on a safe raw socket include the IP and TCP/UDP/ICMP headers, but not the link layer protocol. By default, packets sent on a raw socket include the TCP/UDP/ICMP header but not the IP header. To send packets that include the IP header, the IP_HDRINCL socket option must be set on the socket. This call will succeed only after a successful bind:
int tmp = 1;
setsockopt(sock, 0, IP_HDRINCL, & tmp, sizeof(tmp));
To close a safe raw socket, use the usual close system call. ICMP SocketsICMP packets can be sent and received through safe raw ICMP sockets. To protect users from interference, each ICMP socket is allowed to send and receive only packets of the registered type bound to the socket. In a similar way to the standard sockets the bind system call is used to specify the packets that are to be received and sent through a socket. To receive ICMP error messages associated with a specific local TCP/UDP port (e.g., Destination Unreachable, Source Quench, Redirect, Time Exceeded, Parameter Problem), the ICMP socket needs to be bound to the port.
#include < planetlab.h >
struct sockaddr_in sin;
sock = socket(PF_INET, SOCK_RAW, IPPROTO_ICMP_UDP);
memset(& sin, 0, sizeof(sin));
sin.sin_port = htons(9090);
bind(sock, (struct sockaddr *)& sin, sizeof(sin));
For example, the above code fragment creates and binds the ICMP socket to local UDP port 9090. Only ICMP error messages associated with local UDP port 9090 can be received through this socket. This type of ICMP socket is read-only. To send and receive ICMP messages that are not associated with a specific TCP/UDP port number (e.g., Echo, Echo Reply, Timestamp, Timestamp Reply, Information Request, Information Reply), the socket has to be bound to a specific ICMP identifier. The ICMP identifier is a 16-bit field present in bytes 5/6 in the header of these messages. Only messages containing the right identifier can be sent or received through a safe raw ICMP socket of this type.
struct sockaddr_in sin;
sock = socket(PF_INET, SOCK_RAW, IPPROTO_ICMP);
memset(& sin, 0, sizeof(sin));
sin.sin_port = htons(23456);
bind(sock, (struct sockaddr *)& sin, sizeof(sin);
For example, the above code fragment creates and binds an ICMP socket to identifier 23456. Only ICMP messages with this identifier and of the proper type can be sent and received through this socket. No two users are allowed to bind an ICMP socket to the same local UDP/TCP port, or the same identifier. See here for more information. Restrictions on Sent PacketsPackets sent on a safe raw socket will be rejected if any of the following is true:
Sniffer SocketsA slice can bind a raw "sniffer" socket to a port owned by the slice to snoop IP datagrams sent and received on that port. Sniffer sockets are read-only and do not interfere with traffic. Creating a sniffer socket on a free port makes the slice the owner of that port, meaning no other slice can bind a socket to that port. A current limitation is that only one sniffer socket can be created per port. To create a sniffer socket, it is necessary to call setsockopt on the socket before binding it to the port. Example:
#include < planetlab.h >
int tmp, sock;
struct sockaddr_in sin;
sock = socket(PF_INET, SOCK_RAW, IPPROTO_TCP);
tmp = 1;
setsockopt(sock, 0, SO_RAW_SNIFF, & tmp, sizeof(tmp));
sin.sin_port = htons(1234);
bind(sock, (struct sockaddr *)& sin, sizeof(sin));
Examples
Example 1-1. Creating a safe raw socket
/* protocol = IPPROTO_UDP, IPPROTO_TCP, IPPROTO_ICMP, IPPROTO_ICMP_TCP,
* or IPPROTO_ICMP_UDP */
if((sock = socket(PF_INET, SOCK_RAW, protocol)) < 0) {
perror("socket");
exit(1);
}
memset(& sin, 0, sizeof(sin));
sin.sin_port = htons(local_port);
if((bind(sock, (struct sockaddr *)& sin, sizeof(sin))) < 0) {
perror("bind");
exit(1);
}
Example 1-2. PING-PONG program using Linux or safe raw sockets
#include < stdio.h >
#include < stdlib.h >
#include < string.h >
#include < sys/socket.h >
#include < netinet/in.h >
#include < arpa/inet.h >
#include < linux/ip.h >
#include < linux/udp.h >
#define BUFFER_SIZE 1500
#define PING 1
#define PONG 2
int
main(int argc, char * argv[])
{
int sock;
struct sockaddr_in sin;
unsigned short local_port;
unsigned short remote_port;
unsigned char protocol;
char * buffer;
struct iphdr * ip_header;
struct udphdr * udp_header;
char * remote_ip_str;
unsigned char ping = 0;
unsigned int * count = 0;
unsigned int this_count = 0;
int semantics = 0;
int linux_socket = 0;
unsigned short buffer_size = 0;
int tmp, len;
if (argc < 4 || argc > 6) {
fprintf(stderr, "USAGE: %s remote_ip local_port remote_port"
" (PING|PONG) [LINUX]\n", argv[0]);
return 1;
}
protocol = IPPROTO_UDP;
remote_ip_str = argv[1];
local_port = atoi(argv[2]);
remote_port = atoi(argv[3]);
if (argc >= 5) {
if (strncmp(argv[4], "PONG", 4) == 0) {
ping = 0;
}
else if (strncmp(argv[4], "PING", 4) == 0) {
ping = 1;
}
else {
fprintf(stderr, "PING or PONG ?.\n");
return 1;
}
}
if (argc == 6) {
if (strncmp(argv[5], "LINUX", 5) == 0) {
linux_socket = 1;
}
else {
linux_socket = 0;
}
}
printf("Remote IP %s, local port %d, remote port %d%s%s\n",
remote_ip_str, local_port, remote_port,
ping ? ", PING" : ", PONG",
linux_socket ? ", LINUX_SOCKET" : "");
if (linux_socket) {
semantics = SOCK_DGRAM;
} else {
semantics = SOCK_RAW;
}
if ((sock = socket(PF_INET, semantics, protocol)) < 0) {
perror("socket");
exit(1);
}
bzero((char *)& sin, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_port = htons(local_port);
if ((bind(sock, (struct sockaddr *)& sin, sizeof(sin))) < 0) {
perror("bind");
exit(1);
}
if (! linux_socket) {
tmp = 1;
setsockopt(sock, 0, IP_HDRINCL, & tmp, sizeof(tmp));
}
bzero((char *)& sin, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_port = htons(remote_port);
sin.sin_addr.s_addr = inet_addr(remote_ip_str);
buffer_size = BUFFER_SIZE
- (linux_socket ?
(sizeof (struct iphdr) + sizeof (struct udphdr))
: 0);
printf("buff %d, %d\n", buffer_size, BUFFER_SIZE);
buffer = (char *) malloc(buffer_size);
while ( 1 )
{
if (!linux_socket)
{
ip_header = (struct iphdr *) buffer;
ip_header->ihl = 5;
ip_header->version = 4;
ip_header->tos = 0;
ip_header->tot_len = htons(buffer_size);
ip_header->id = 0;
ip_header->ttl = 64;
ip_header->frag_off = 0x40;
ip_header->protocol = protocol;
ip_header->check = 0; /* This will be done in the kernel */
ip_header->daddr = inet_addr(remote_ip_str);
/* Leave src IP address blank, kernel will fill it out. */
ip_header->saddr = 0;
udp_header = (struct udphdr *) (ip_header + 1);
udp_header->source = htons(local_port);
udp_header->dest = htons(remote_port);
udp_header->len = htons(buffer_size - sizeof(struct iphdr));
udp_header->check = 0;
}
if (ping)
{
if (linux_socket) {
count = (unsigned int *) buffer;
} else {
count = (unsigned int *) (udp_header + 1);
}
* count = this_count ++;
if (! (this_count % 1)) {
printf("%d\n", this_count);
}
if (sendto(sock, buffer, buffer_size, 0,
(struct sockaddr *) & sin, sizeof(sin)) < 0) {
perror("sendto");
}
}
ping = 1;
len = sizeof(sin);
if (recvfrom(sock, buffer, buffer_size, 0,
(struct sockaddr *) & sin, & len) < 0) {
perror("recvfrom");
return 1;
}
}
close(sock);
return 0;
}
Chapter 2. AccountingOverviewThe Scout module tracks per-slice network usage information and reports this information in /proc/scout/accounts/. Relevant files: /proc/scout/accounts/summary: Summarizes network usage for all slices /proc/scout/accounts/[slice id]: Detailed per-slice network accounting Summary FileFile /proc/scout/accounts/summary provides an overview of usage per slice. An example: [princeton8@planetlab-3] cat /proc/scout/accounts/summary slice sent recvd sockcnt 735 2612 406 0 906 101173 769796 0 630 128152 31494 0 816 87215779 63074403 0 38 19952 14616 0 752 20715441 32329457 0 74 309936 210930 0 642 1156702 54851 1 900 2329907 6298539 7 28 14112 33401 0 99 1926602 14897404 5 0 1044464 15327866 7 unknown 5962227 2771272 0 The slice column identifies the slice, and the sent and recvd columns show how many bytes have been sent/received by the slice since the machine was booted. The sockcnt column shows how many sockets the slice currently has open. Loopback sockets are counted in the sockcnt column, but packets sent to the loopback address are not charged against the sent and recvd columns. The unknown slice ID is a catch-all category for packets that cannot be matched to a slice. An outgoing packet may be charged to unknown if it is sent by a socket not managed by the Scout module, for instance, one internal to the kernel (e.g, an ICMP socket that sends Echo Response packets) or one created during the boot process before the Scout module was loaded. Likewise, an incoming packet is charged to unknown if it does not demux to a socket managed by the Scout module. Per-Slice InformationMore detailed traffic breakdowns for each slice can be found in /proc/scout/accounts/[slice id]. Each row in the table gives the counts for a particular socket type:
For instance, in the sent_pkts column, the raw_udp row counts the packets sent by the slice on safe raw sockets bound to UDP ports, and the udp row counts the packets sent on standard UDP sockets. Together they reflect the total number of UDP packets sent by the slice. The misc socket type catches packets sent by the slice that do not fall into one of the other socket categories. Currently drop_pkts and drop_bytes are only used for safe raw sockets. Chapter 3. Port ManagementOverviewThe Scout module manages all TCP and UDP ports and ICMP IDs to ensure that there are no collisions between safe raw sockets and TCP/UDP/ICMP sockets. For each IP address, all ports are either free or "owned" by a slice. This means that two slices may split ownership of a port by binding it to different IP addresses. Right now only two IP addresses are supported: the external and loopback addresses. A port/IP address pair that is owned by one slice is unavailable to all other slices. A slice can claim ownership of a port in two ways:
A slice that owns a port bound to the external IP address can open three sockets on that port. First, it can open one "consumer" socket. A consumer socket is a communication endpoint, and may be either a standard TCP/UDP socket or a safe raw socket (these sockets consume packets, in contrast to a "sniffer" socket). Second, it can open one ICMP error socket to receive ICMP Destination Unreachable messages on a TCP/UDP port. Third, it can open one sniffer socket. A current limitation of the module is that only one ICMP error and sniffer socket is allowed per port. A slice that owns a port bound to the loopback address can only open one standard TCP/UDP socket on that port. A TCP/UDP socket bound to INADDR_ANY binds to both the external and loopback IP addresses. Relevant files: /proc/scout/ports/summary: Summarizes port ownership/usage /proc/scout/ports/reserve: Write to reserve a port /proc/scout/ports/release: Write to release a port reservation Summary FileThe file /proc/scout/ports/summary shows the current status of all ports managed by the Scout module. For example: [princeton8@planetlab-3] cat /proc/scout/ports/summary prot port slice types tcp 33301 758 l tcp 33301 759 c icmp 11234 900 C tcp 11234 900 CI udp 11234 900 CI tcp 12521 642 CSR tcp 22 0 C udp 123 0 C tcp 80 0 C tcp 79 0 C udp 32768 99 C The prot and port columns together identify the port. The slice column shows the owner of the port, and the types column shows the references that have been placed on the port. Note that in the example above, TCP port 33301 has split ownership. Values for types are:
The [C] reference implies [c] and [l], and so only one of [Ccl] will be present in the types column. The [I] and [S] references only apply to the external IP interface. In the above example, slice 758 has bound TCP port 33301 to the loopback IP address and 759 has bound the same port to the external address. Slice 759 will be able to open ICMP error and sniffer sockets on the port but slice 758 will not. Reserving PortsPorts can be reserved by writing to file /proc/scout/ports/reserve. Only the Node Manager (i.e., root) can write to this file. Run 'cat' on this file to see the port reservation syntax:
[princeton8@planetlab-3] cat /proc/scout/ports/reserve
Write to this file to reserve a port
Format: [vserver id] u|t|i [port #]
For the second argument, u = udp, t = tcp, i = icmp
Example: 758 t 12345
A reserved port is owned by the slice, and only this slice can open sockets on that port bound to any IP address. The reservation takes place immediately upon writing to /proc/scout/ports/reserve. If a port is already owned by a slice for any IP address, and an attempt is made to reserve the port for another slice, it will have no effect. A port reservation remains in effect until it is explicitly released. To remove a reservation, the Node Manager writes the same string used to reserve the port to /proc/scout/ports/release. Chapter 4. Packet TaggingThe Scout module tags every outgoing packet with the ID of the sending slice, by placing the slice ID in the nfmark field of the sk_buff containing the packet. This tag is used by the Hierarchical Token Bucket traffic controller to assign the packet to the correct token bucket; currently, each slice with an ID of at least 500 has its own token bucket, and packets sent by other users end up in the "default" bucket. A limitation of this approach is that other modules that use the netfilter interface (e.g., iptables) could try to write to the nfmark field as well; currently, no such conflicts are known in PlanetLab. Chapter 5. NotesWhat's NewVersion 2.0.5:
Filesplanetlab.h - a useful header file. plkmodutil-1.0.4-planetlab.i386.rpm - RPM with traceroute and ping using safe raw sockets. The normal versions of these programs will not work in vservers. Also contains plabdump, a wrapper for tcpdump that can be used to observe traffic on a TCP or UDP port plkmodutil-1.0.4.tgz - The same, as a tarball. |
PlanetLab loginAnnouncements
|