Scout Module API

Scout Module API

Scout Module API

PlanetLab Team

The Scout kernel module provides non-priviledged users access to a restricted form of raw IP datagram sockets on PlanetLab. Additionally, it tracks per-slice network usage and allows ports to be reserved on a node for the exclusive use of a particular slice. This document describes the safe raw socket interface, as well as the accounting and port management features of the module.

Single page document format

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.



Chapter 1. Safe Raw Sockets


Overview

planetlab.h

Safe raw sockets are used to access raw network data, including IP, ICMP, UDP and TCP headers, while enforcing protection between different users (slices). No user is able to interfere with others by sending or receiving data on ports that have been registered to, or are currently being used by, other users. Sending data from an unregistered port or non-local IP address is also not allowed. Access to safe raw sockets does not require super-user privileges or the corresponding Linux capabilities.


Socket API

The safe raw socket API uses the standard Linux socket API with some minor semantic differences. Just as in standard Linux, first the socket must be created with the socket system call. To create a safe raw socket, the domain of the call must be set to PF_INET, the type to SOCK_RAW, and the protocol can be one of the following:

  • IPPROTO_TCP or IPPROTO_UDP for a socket that will be used to send/receive TCP or UDP packets

  • IPPROTO_ICMP for an ICMP socket that sends Echo Request packets and receives Echo Reply

  • IPPROTO_ICMP_TCP or IPPROTO_ICMP_UDP for an ICMP socket that can receive ICMP Destination Unreachable messages for a particular TCP or UDP port

The following example creates a TCP safe raw socket:

    sock = socket(PF_INET, SOCK_RAW, IPPROTO_TCP);

Once the socket is created, it is necessary to bind it to a particular local port of the specified protocol (or identifier in the case of an ICMP socket). The standard Linux bind system call is used. To bind the socket created above to local TCP port 9090:


    struct sockaddr_in sin;

    memset(& sin, 0, sizeof(sin));
    sin.sin_port = htons(9090);
  
    bind(sock, (struct sockaddr *)& sin, sizeof(sin));

After the socket has been bound to a local port, it is ready to be used to send and receive data. The usual sendto, sendmsg, recv, recvfrom, recvmsg and select calls can be used (note that the send call is not supported for raw sockets). Packets received on a safe raw socket include the IP and TCP/UDP/ICMP headers, but not the link layer protocol. By default, packets sent on a raw socket include the TCP/UDP/ICMP header but not the IP header. To send packets that include the IP header, the IP_HDRINCL socket option must be set on the socket. This call will succeed only after a successful bind:


    int tmp = 1;

    setsockopt(sock, 0, IP_HDRINCL, & tmp, sizeof(tmp));

To close a safe raw socket, use the usual close system call.


ICMP Sockets

ICMP packets can be sent and received through safe raw ICMP sockets. To protect users from interference, each ICMP socket is allowed to send and receive only packets of the registered type bound to the socket. In a similar way to the standard sockets the bind system call is used to specify the packets that are to be received and sent through a socket. To receive ICMP error messages associated with a specific local TCP/UDP port (e.g., Destination Unreachable, Source Quench, Redirect, Time Exceeded, Parameter Problem), the ICMP socket needs to be bound to the port.


    #include < planetlab.h >

    struct sockaddr_in sin;

    sock = socket(PF_INET, SOCK_RAW, IPPROTO_ICMP_UDP);

    memset(& sin, 0, sizeof(sin));
    sin.sin_port = htons(9090);
  
    bind(sock, (struct sockaddr *)& sin, sizeof(sin));

For example, the above code fragment creates and binds the ICMP socket to local UDP port 9090. Only ICMP error messages associated with local UDP port 9090 can be received through this socket. This type of ICMP socket is read-only.

To send and receive ICMP messages that are not associated with a specific TCP/UDP port number (e.g., Echo, Echo Reply, Timestamp, Timestamp Reply, Information Request, Information Reply), the socket has to be bound to a specific ICMP identifier. The ICMP identifier is a 16-bit field present in bytes 5/6 in the header of these messages. Only messages containing the right identifier can be sent or received through a safe raw ICMP socket of this type.


    struct sockaddr_in sin;

    sock = socket(PF_INET, SOCK_RAW, IPPROTO_ICMP);

    memset(& sin, 0, sizeof(sin));
    sin.sin_port = htons(23456);
  
    bind(sock, (struct sockaddr *)& sin, sizeof(sin);

For example, the above code fragment creates and binds an ICMP socket to identifier 23456. Only ICMP messages with this identifier and of the proper type can be sent and received through this socket.

No two users are allowed to bind an ICMP socket to the same local UDP/TCP port, or the same identifier. See here for more information.


Restrictions on Sent Packets

Packets sent on a safe raw socket will be rejected if any of the following is true:

If IP_HDRINCL was set on the socket:

  • The source IP address of the packet is a non-local address. (If the source address is left blank, then the kernel will fill it in.)

  • The protocol in the IP header is not the protocol specified when the socket was created

If the safe raw socket is a TCP/UDP socket:

  • The source port in the TCP/UDP header is not the same local port to which the socket was bound

If the safe raw socket is an ICMP socket:

  • The socket was created with a protocol of IPPROTO_ICMP_TCP or IPPROTO_ICMP_UDP

  • The identifier in the ICMP header is not the same one to which the socket was bound


Sniffer Sockets

A slice can bind a raw "sniffer" socket to a port owned by the slice to snoop IP datagrams sent and received on that port. Sniffer sockets are read-only and do not interfere with traffic. Creating a sniffer socket on a free port makes the slice the owner of that port, meaning no other slice can bind a socket to that port. A current limitation is that only one sniffer socket can be created per port.

To create a sniffer socket, it is necessary to call setsockopt on the socket before binding it to the port. Example:


    #include < planetlab.h >

    int tmp, sock;
    struct sockaddr_in sin;
 
    sock = socket(PF_INET, SOCK_RAW, IPPROTO_TCP);

    tmp = 1;
    setsockopt(sock, 0, SO_RAW_SNIFF, & tmp, sizeof(tmp));

    sin.sin_port = htons(1234);

    bind(sock, (struct sockaddr *)& sin, sizeof(sin));

Examples

Example 1-1. Creating a safe raw socket


    /* protocol = IPPROTO_UDP, IPPROTO_TCP, IPPROTO_ICMP, IPPROTO_ICMP_TCP,
     *   or IPPROTO_ICMP_UDP */

    if((sock = socket(PF_INET, SOCK_RAW, protocol)) < 0) { 
	perror("socket");
	exit(1);
    }
    
    memset(& sin, 0, sizeof(sin));
    sin.sin_port = htons(local_port);
    
    if((bind(sock, (struct sockaddr *)& sin, sizeof(sin))) < 0) {
	perror("bind");
	exit(1);
    }

Example 1-2. PING-PONG program using Linux or safe raw sockets


#include < stdio.h >
#include < stdlib.h >
#include < string.h >

#include < sys/socket.h >
#include < netinet/in.h >
#include < arpa/inet.h >
#include < linux/ip.h >
#include < linux/udp.h >

#define BUFFER_SIZE 1500
#define PING 1
#define PONG 2

int
main(int argc, char * argv[])
{
    int sock;
    struct sockaddr_in sin;
    unsigned short local_port;
    unsigned short remote_port;
    unsigned char protocol;
    char * buffer;
    struct iphdr * ip_header;
    struct udphdr * udp_header;
    char * remote_ip_str;
    unsigned char ping = 0;
    unsigned int * count = 0;
    unsigned int this_count = 0;
    int semantics = 0;
    int linux_socket = 0;
    unsigned short buffer_size = 0;
    int tmp, len;

    if (argc < 4 || argc > 6) {
	fprintf(stderr, "USAGE: %s remote_ip local_port remote_port"
		" (PING|PONG) [LINUX]\n", argv[0]);
	return 1;
    }

    protocol = IPPROTO_UDP;

    remote_ip_str = argv[1];
    local_port = atoi(argv[2]);
    remote_port = atoi(argv[3]);

    if (argc >= 5) {
	if (strncmp(argv[4], "PONG", 4) == 0) {
	    ping = 0;
	}
	else if (strncmp(argv[4], "PING", 4) == 0) {
	    ping = 1;
	}
	else {
	    fprintf(stderr, "PING or PONG ?.\n");
	    return 1;
	}
    }

    if (argc == 6) {
	if (strncmp(argv[5], "LINUX", 5) == 0) {
	    linux_socket = 1;
	}
	else {
	    linux_socket = 0;
	}
    }

    printf("Remote IP %s, local port %d, remote port %d%s%s\n",
	   remote_ip_str, local_port, remote_port,
	   ping ? ", PING" : ", PONG",
	   linux_socket ? ", LINUX_SOCKET" : "");
	 
    if (linux_socket) {
	semantics = SOCK_DGRAM;
    } else {
	semantics = SOCK_RAW;
    }

    if ((sock = socket(PF_INET, semantics, protocol)) < 0) { 
	perror("socket");
	exit(1);
    }

    bzero((char *)& sin, sizeof(sin));
    sin.sin_family = AF_INET;
    sin.sin_port = htons(local_port);

    if ((bind(sock, (struct sockaddr *)& sin, sizeof(sin))) < 0) {
	perror("bind");
	exit(1);
    }

    if (! linux_socket) {
	tmp = 1;
	setsockopt(sock, 0, IP_HDRINCL, & tmp, sizeof(tmp));
    }

    bzero((char *)& sin, sizeof(sin));
    sin.sin_family = AF_INET;
    sin.sin_port = htons(remote_port);
    sin.sin_addr.s_addr = inet_addr(remote_ip_str);
  
    buffer_size = BUFFER_SIZE 
	- (linux_socket ? 
	   (sizeof (struct iphdr) + sizeof (struct udphdr)) 
	   : 0);

    printf("buff %d, %d\n", buffer_size, BUFFER_SIZE);

    buffer = (char *) malloc(buffer_size);

    while ( 1 )
    {
	if (!linux_socket)
	{
	    ip_header = (struct iphdr *) buffer;
	    ip_header->ihl = 5;
	    ip_header->version = 4;
	    ip_header->tos = 0;
	    ip_header->tot_len = htons(buffer_size);
	    ip_header->id = 0;
	    ip_header->ttl = 64;
	    ip_header->frag_off = 0x40;
	    ip_header->protocol = protocol;
	    ip_header->check = 0; /* This will be done in the kernel */
	    ip_header->daddr = inet_addr(remote_ip_str);
            /* Leave src IP address blank, kernel will fill it out. */
	    ip_header->saddr = 0; 

	    udp_header = (struct udphdr *) (ip_header + 1);

	    udp_header->source = htons(local_port);
	    udp_header->dest = htons(remote_port);
	    udp_header->len = htons(buffer_size - sizeof(struct iphdr));
	    udp_header->check = 0;
	}

	if (ping)
	{
	    if (linux_socket) {
		count = (unsigned int *) buffer;
	    } else {
		count = (unsigned int *) (udp_header + 1);
	    }
	    * count = this_count ++;
	    if (! (this_count % 1)) {
		printf("%d\n", this_count);
	    }

	    if (sendto(sock, buffer, buffer_size, 0, 
		       (struct sockaddr *) & sin, sizeof(sin)) < 0) {
		perror("sendto");
	    }

	}

	ping = 1;

	len = sizeof(sin);
	if (recvfrom(sock, buffer, buffer_size, 0, 
		     (struct sockaddr *) & sin, & len) < 0) {
	    perror("recvfrom");
	    return 1;
	}
    }

    close(sock);

    return 0;
}



Chapter 2. Accounting


Overview

The Scout module tracks per-slice network usage information and reports this information in /proc/scout/accounts/. Relevant files:


/proc/scout/accounts/summary:     Summarizes network usage for all slices
/proc/scout/accounts/[slice id]:  Detailed per-slice network accounting


Summary File

File /proc/scout/accounts/summary provides an overview of usage per slice. An example:


  [princeton8@planetlab-3] cat /proc/scout/accounts/summary 
  slice   sent    recvd   sockcnt
  735     2612    406     0
  906     101173  769796  0
  630     128152  31494   0
  816     87215779        63074403        0
  38      19952   14616   0
  752     20715441        32329457        0
  74      309936  210930  0
  642     1156702 54851   1
  900     2329907 6298539 7
  28      14112   33401   0
  99      1926602 14897404        5
  0       1044464 15327866        7
  unknown 5962227 2771272 0

The slice column identifies the slice, and the sent and recvd columns show how many bytes have been sent/received by the slice since the machine was booted. The sockcnt column shows how many sockets the slice currently has open. Loopback sockets are counted in the sockcnt column, but packets sent to the loopback address are not charged against the sent and recvd columns.

The unknown slice ID is a catch-all category for packets that cannot be matched to a slice. An outgoing packet may be charged to unknown if it is sent by a socket not managed by the Scout module, for instance, one internal to the kernel (e.g, an ICMP socket that sends Echo Response packets) or one created during the boot process before the Scout module was loaded. Likewise, an incoming packet is charged to unknown if it does not demux to a socket managed by the Scout module.


Per-Slice Information

More detailed traffic breakdowns for each slice can be found in /proc/scout/accounts/[slice id]. Each row in the table gives the counts for a particular socket type:

  • tcp: standard TCP sockets

  • udp: standard UDP sockets

  • raw_tcp: safe raw TCP sockets

  • raw_udp: safe raw UDP sockets

  • raw_icmp: safe raw ICMP sockets

  • misc: unclassified sockets

  • loopback: all loopback sockets for any protocol

Each column in the table lists counts for that socket type. Counts maintained are:

  • sent_pkts: packets sent by the slice

  • rcvd_pkts: packets received by the slice

  • drop_pkts: packets destined for the slice but discarded by the Scout module (currently only raw sockets)

  • sent_bytes: bytes sent by the slice

  • rcvd_bytes: bytes received by the slice

  • drop_bytes: byte count of packets discarded by the Scout module (currently only raw sockets)

  • opened_sock: number of sockets that the slice has opened

  • closed_sock: number of sockets that the slice has closed

For instance, in the sent_pkts column, the raw_udp row counts the packets sent by the slice on safe raw sockets bound to UDP ports, and the udp row counts the packets sent on standard UDP sockets. Together they reflect the total number of UDP packets sent by the slice. The misc socket type catches packets sent by the slice that do not fall into one of the other socket categories. Currently drop_pkts and drop_bytes are only used for safe raw sockets.


Chapter 3. Port Management


Overview

The Scout module manages all TCP and UDP ports and ICMP IDs to ensure that there are no collisions between safe raw sockets and TCP/UDP/ICMP sockets. For each IP address, all ports are either free or "owned" by a slice. This means that two slices may split ownership of a port by binding it to different IP addresses. Right now only two IP addresses are supported: the external and loopback addresses. A port/IP address pair that is owned by one slice is unavailable to all other slices. A slice can claim ownership of a port in two ways:

  1. It can bind a socket to that port and IP address

  2. It can reserve the port. This means that only this slice can bind the port to any IP address.

A slice that owns a port bound to the external IP address can open three sockets on that port. First, it can open one "consumer" socket. A consumer socket is a communication endpoint, and may be either a standard TCP/UDP socket or a safe raw socket (these sockets consume packets, in contrast to a "sniffer" socket). Second, it can open one ICMP error socket to receive ICMP Destination Unreachable messages on a TCP/UDP port. Third, it can open one sniffer socket. A current limitation of the module is that only one ICMP error and sniffer socket is allowed per port. A slice that owns a port bound to the loopback address can only open one standard TCP/UDP socket on that port. A TCP/UDP socket bound to INADDR_ANY binds to both the external and loopback IP addresses.

Relevant files:


/proc/scout/ports/summary:  Summarizes port ownership/usage
/proc/scout/ports/reserve:  Write to reserve a port
/proc/scout/ports/release:  Write to release a port reservation


Summary File

The file /proc/scout/ports/summary shows the current status of all ports managed by the Scout module. For example:


  [princeton8@planetlab-3] cat /proc/scout/ports/summary 
  prot    port    slice   types
  tcp     33301   758     l
  tcp     33301   759     c
  icmp    11234   900     C
  tcp     11234   900     CI
  udp     11234   900     CI
  tcp     12521   642     CSR
  tcp     22      0       C
  udp     123     0       C
  tcp     80      0       C
  tcp     79      0       C
  udp     32768   99      C

The prot and port columns together identify the port. The slice column shows the owner of the port, and the types column shows the references that have been placed on the port. Note that in the example above, TCP port 33301 has split ownership. Values for types are:

  • [C]onsumer: This means the slice has bound a socket (either a standard TCP/UDP socket or a safe raw socket) on this port to both the external and loopback IP address

  • [c]onsumer: This means the slice has bound a socket (either a standard TCP/UDP socket or a safe raw socket) on this port to only the external IP address

  • [l]oopback: This means the slice has bound a TCP/UDP socket on this port to only the loopback IP address.

  • [I]CMP errors: The slice has opened a safe raw socket to receive ICMP Destination Unreachable messages for the corresponding TCP/UDP port

  • [R]eserved: The port has been reserved for this slice for all IP addresses (currently there is no way to reserve only the external or loopback IP address for a slice)

  • [S]niffer: A sniffer socket is open on this port

The [C] reference implies [c] and [l], and so only one of [Ccl] will be present in the types column. The [I] and [S] references only apply to the external IP interface. In the above example, slice 758 has bound TCP port 33301 to the loopback IP address and 759 has bound the same port to the external address. Slice 759 will be able to open ICMP error and sniffer sockets on the port but slice 758 will not.


Reserving Ports

Ports can be reserved by writing to file /proc/scout/ports/reserve. Only the Node Manager (i.e., root) can write to this file. Run 'cat' on this file to see the port reservation syntax:


  [princeton8@planetlab-3] cat /proc/scout/ports/reserve 
  Write to this file to reserve a port
  Format: [vserver id] u|t|i [port #]
    For the second argument, u = udp, t = tcp, i = icmp
  Example: 758 t 12345

A reserved port is owned by the slice, and only this slice can open sockets on that port bound to any IP address. The reservation takes place immediately upon writing to /proc/scout/ports/reserve. If a port is already owned by a slice for any IP address, and an attempt is made to reserve the port for another slice, it will have no effect.

A port reservation remains in effect until it is explicitly released. To remove a reservation, the Node Manager writes the same string used to reserve the port to /proc/scout/ports/release.


Chapter 4. Packet Tagging

The Scout module tags every outgoing packet with the ID of the sending slice, by placing the slice ID in the nfmark field of the sk_buff containing the packet. This tag is used by the Hierarchical Token Bucket traffic controller to assign the packet to the correct token bucket; currently, each slice with an ID of at least 500 has its own token bucket, and packets sent by other users end up in the "default" bucket. A limitation of this approach is that other modules that use the netfilter interface (e.g., iptables) could try to write to the nfmark field as well; currently, no such conflicts are known in PlanetLab.


Chapter 5. Notes


What's New

Version 2.0.5:

  • Fixed a kernel Oops triggered by tasks exiting without the CPU scheduler being aware of it.

  • Fixed a bug where the system would lock up due to a process maintaining RT priority (used by SILK to control the Linux scheduler) after its slice was removed.

Version 2.0.4:

  • Fixed a bug triggered when assigning more shares than are available to a slice.

Version 2.0.3:

  • Fixed a race condition in the CPU scheduler that caused some processes to never be scheduled even though they are runnable.

  • Multiple slices can bind to the same port on different IP addresses. For example, one slice can bind to port 53 on the loopback address (127.0.0.1) while another binds to the external IP address.

  • Restored per-slice packet counts, in a single table in file /proc/scout/accounts/[slice id]/.

  • Added the ability to remove packet counts for an expired slice by writing the slice id to /proc/scout/accounts/remove

Version 2.0.2:

  • Removed per-slice packet counts in /proc/scout/accounts/[slice id]/*. A bug in SILK caused a kernel Oops when the system-wide limit on the number of /proc files was reached.

  • Fixed problem with assert() that causes a kernel Oops.

Version 2.0.1:

  • Bug fix to improve performance seen by users running large numbers of Java threads.

Version 2.0.0:

  • Proportional share CPU scheduling can provide resource isolation to individual slices. More information on the API can be found here. Note that individual users cannot assign or change shares. Rather, this is a low-level mechanism to support ongoing research in resource economies on PlanetLab.

  • Packets sent and received on a loopback socket are now counted per slice, in files:

    /proc/scout/accounts/slice_id/loopback_*

    Loopback traffic does not appear in the per-slice totals in /proc/scout/accounts/summary.

  • Fixed a bug where some TCP control packets were not assigned to the correct user.

  • Fixed a bug where state was not correctly freed on a bind() error, resulting in "pathAddKey: failed!" messages in the log.

  • Vserver root disallowed from writing to port and CPU reservation files in /proc/scout.


Files

planetlab.h - a useful header file.

plkmodutil-1.0.4-planetlab.i386.rpm - RPM with traceroute and ping using safe raw sockets. The normal versions of these programs will not work in vservers. Also contains plabdump, a wrapper for tcpdump that can be used to observe traffic on a TCP or UDP port

plkmodutil-1.0.4.tgz - The same, as a tarball.


Known Problems