Managing VMs like a Data Scientist

9 min readApr 16, 2019

Managing virtual machines (VMs) as a data scientist can be tedious. If you are like me and work in a team that is not fortunate enough to have a data engineer cleaning, prepping and giving you your data on a plate with some garnish on the side, then you have to manage, extract and manipulate files sitting on various VMs. Logging into each of these VM to see if all the necessary files dumped, all the necessary packages installed and all the cron jobs executed on time can be a time consuming, inefficient and downright laborious task.

Luckily Python comes to the rescue with a package called paramiko. This posts explains how you can wrap your VMs in a DataFrame and execute the same command on all of them saving the returned output in a Python DataFrame.

The code for this post can be found on this git repo. Although this post relates to managing VMs — the underlying hack applied here is to use your current knowledge of DataFrames, with all their great functionalities that we all have come to know and love, and combine Python Classes to abstract and make inefficient tasks more efficient.

Background

Below is some background for those stumbling onto this post with no clue on any of the topics, feel free to skip the sections you know a lot about as the snippets below give a high-level overview to the reader to ensure the post makes sense as a whole.

VMs

As I work at a corporate and to avoid disclosing the IPs, I’ve opted to spin up 4 VMs on Google Cloud Platform (GCP), but the methodology for any VM on any domain is the same. If you are using GCP and you have the gcloud sdk installed, spinning up GCP VMs via the command line can be achieved with a one-liner as shown below.

gcloud compute instances create vm1 --custom-cpu 1 --custom-memory 1
gcloud compute instances create vm2 --custom-cpu 2 --custom-memory 2
gcloud compute instances create vm3 --custom-cpu 1 --custom-memory 1
gcloud compute instances create vm4 --custom-cpu 2 --custom-memory 2

The above bash commands use the gcloud sdk to spin up 4 VMs named: vm1, vm2, vm3 and vm4 respectively with either 1 vCPU and 1Gb of RAM or 2 vCPUs and 2Gb of RAM. If you log into your GCP console, you’ll see the VMs created, each with their respective public IP address that we’ll be using to ssh into.

SSH

If you’ve never worked in a shell before, you should give it a bash… For those unfamiliar with what the shell even is, it’s the screen that looks like the matrix that all the techies use at work; I show an example below.

Example of logging into a remote server using ssh in a Bash shell.

Similar to Windows 10 or MacOS, the shell is just another way to interact with hardware and comes preinstalled with a plethora of programs like ls, cp, lscpu, top and of course ssh. SSH, short for Secure Socket Shell, practically comes installed with every Unix (Mac OS) or Linux (Ubuntu, Red Hat, Debian) system. The ssh program runs in a shell and is used to start a SSH client program that enables secure connection to a SSH server on a remote machine.

The ssh command can be used to log into a remote machine, transfer files between two machines or to execute commands on a remote machine. To log into a VM, all you need is the IP address of the VM and a username and depending on how your user configuration, either a set of ssh-keys or a password. The syntax for the ssh command for the user root to log into a VM with IP ip then looks as follows:

ssh root@ip

In our example, for vm1 with IP 35.204.226.178 sitting on GCP this would translate into:

ssh louwjlabuschagne_gmail_com@35.204.226.178

Running this command in a shell logs us into vm1 and the shell we are working in will now not be a local shell anymore, but rather, a remote shell logged into the remote VM identified by public IP 35.204.226.178.

louwjlabuschagne_gmail_com@vm1:~$

Python Classes

Python is an object oriented programming (OOP) language. Almost everything in Python is an object, with its properties and methods.

A Class is like an object constructor, or a “blueprint” for creating objects. Classes provide a means of bundling data and functionality together. Creating a new class creates a new type of object, allowing new instances of that type to be made. A toy example is shown below where we create a Person class with 2 attributes: name and age.

class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

print(p1.name)
print(p1.age)

What makes the object orientated paridigm amazing is that we only have to write a great blueprint (Class) once, then we can reuse the hard work we’ve done again and again. For our Person class above, we might want to construct a list containing many people. An example of this is shown below:

people = [Person("Jane", 29), Person("John", 36), Person("Blake", 10)]

As a side note, Classes in Python always start with capitals as per PEP8 convention, but I digress. We can now iterate over the people list and access the class attributes for each Person in the list. For example, if we wanted to print out the names of all the people in the list, we can do the following:

for person in people:
    print(person.name)

Isn’t that nice and modular? Using Classes to abstract away complicated logic from the end user is a critical pillar in OOP and a great mindset to adopt to write scalable, reproducible and maintainable code.

If you ever see a function in Python starting and ending with two underscores (__), like the __init__() function above, know that these functions are “special”. The __init__() function, usually called a method instead of a function just because it is a function inside a class, but that is just some nomenclature. The __init__() method is used to initialise the class and is sometimes also called the constructor method.

Similarly to the “special” __init__() method in our Class, there is another “special” method __str__() which prints the friendly name of the object. The__str__() method is called by Python when you print a class using the print() function. For example say we code up: print(Person('Jane', 29)) what should print? Just the name, or the just the age, or both? The __str__() method tells Python what it should print.

Pandas

Ok, the theory is almost done. Just one more topic — Pandas.

Pandas is a Python library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables called dataframes. Dataframes are a data scientist’s bread and butter and is most likely the most used data type (Class) in the Pandas package. For this post you need only know two things about the DataFrame Class, viz.

what is a DataFrame (just a table) and,
what does the apply() function do to a column of a DataFrame.

I show an example of the Iris dataset as a DataFrame Class below. To my dismay, pandas has no built-in datasets, so I’ve consulted another imperative data science package — seaborn. Go check it out if you can.

import pandas as pd
import seaborn as snsiris = sns.load_dataset('iris')
iris.head()

So what is a DataFrame? It’s just a table — that’s it. However, it’s got some pretty cool built-in methods to make your data manipulation, interrogation and cleaning a much, much more pleasurable experience.

If we run the code below, which calls the apply() method on the sepal_length column, we get the output shown in the table. I hope the functionality of the apply() method is clear from the example… If not, stare at it a bit, then read on.

iris.sepal_length.apply(lambda row: 'tall' if row >= 5 else 'short')

There is a weird lambda keyword thrown into the example, which in short is just a “phantom” function, formally called an anonymous function. Basically it is a function that doesn’t have a name but runs some code. In our example, this anonymous lambda function checks each row of our column, and if the sepal length is greater or equal to zero returns tall, otherwise it returns short.

Bring it all together

I hear you saying: “OK cool Louwrens, nice background, but so what?” Well, we’ve covered all the theory needed to understand what is about to happen, which is:

Create a VM Class which gets initialised with an IP and username,
the init method then checks if we can connect to the remote VM and uses the ✅ and ⛔️ emoticons to show a successful or unsuccessful connection.
I then create a DataFrame containing all the IPs of our 4 remote VMs on GCP,
then we can use the apply() method to run bash commands on these VMs and return a DataFrame.
I then display a summary DataFrame containing the specs for these 4 VMs sitting on GCP.

Below I create the VM Class.

from paramiko import SSHClient
from paramiko.auth_handler import AuthenticationExceptionclass VM(object):
 def __init__(self, ip ,username, pkey='~/.ssh/id_rsa.pub'):
  self.hostname = ip
  self.username = username
  self.pkey = pkey 
  
  self.logged_in_emoj = '✅'
  self.logged_in = True  try:
   ssh = SSHClient()
   ssh.load_system_host_keys()
   ssh.connect(hostname=self.ip,
               username=self.username,
               key_filename=self.pkey)
   ssh.close()  except AuthenticationException as exception:
   print(exception)
   print('Login failed'%(self.username+'@'+self.ip))
   self.logged_in_emoj = '⛔️'
   self.logged_in = False def __str__(self):
  return(self.username+'@'+self.ip+' '+self.logged_in_emoj)

I then create a DataFrame, VMs, which holds all the IPs for our 4 VMs on GCP.

VMs = pd.DataFrame(dict(IP=['35.204.255.178',
                            '35.204.96.40',
                            '35.204.213.24',
                            '35.204.115.95']))

We can then call the apply() method on the DataFrame, which iterates through each host and creates a VM Class object for each VM which gets stored in the VM column of the VMs DataFrame.

VMs['VM'] = VMs.apply(lambda row: VM(row.IP, USERNAME, PUB_KEY), axis=1)

Note that the __str__() method of our VM Class is used to represent the VM Class in a DataFrame, as seen below. Each VM is represented as username + ip + ✅, exactly how we defined it in the __str__() method.

Ok great, we’ve created a DataFrame with a bunch of connected VMs inside. What can we do with these?

For those of you who don’t know, there is a command called lscpu in Unix which displays all the information about the CPUs on a machine, below is an example output for vm1.

louwjlabuschagne_gmail_com@vm1:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping:              3
CPU MHz:               2000.170
BogoMIPS:              4000.34
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K
NUMA node0 CPU(s):     0

We are now looking to get the output of the lscpu command for each of our 4 VMs on GCP; we can wrap the lcspu function in the exec_command() method (see the github repo) to return the output of each VM’s lscpu command.

lscpu = VMs.VM.apply(lambda vm: exec_command(‘lscpu’))

With which we can obtain a DataFrame like the one shown below.

Another useful command is the cat /proc/meminfo command, shown below, which returns the current state of the RAM for a Unix machine.

louwjlabuschagne_gmail_com@my-vm1:~$ cat /proc/meminfo
MemTotal:        1020416 kB
MemFree:          871852 kB
MemAvailable:     835736 kB
Buffers:           10164 kB
Cached:            53504 kB
SwapCached:            0 kB
Active:            92012 kB
Inactive:          17816 kB
Active(anon):      46308 kB
Inactive(anon):     4060 kB
Active(file):      45704 kB
Inactive(file):    13756 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                28 kB
Writeback:             0 kB
AnonPages:         46176 kB
Mapped:            25736 kB

I’ve extracted the most relevant columns from the lscpu, and cat /proc/meminfo commands and display an overview of our 4 VMs below. We can plot this information quickly with a library like seaborn or plotly that works great out of the box with DataFrame objects, or we can get summary statistics for all our VMs using the built-in methods pandas has.

Conclusion

This post has only scratched the surface on how using Classes and DataFrames in conjunction with each other can ease your life. Be sure to check out the jupyter notebook on the github repo to fill in some coding gaps I’ve eluded to in this post.

The next time you are doing data wrangling with pandas I encourage you to take a step back and consider wrapping some of the functionality you need in a Class and seeing how that could improve your workflow. Once written, you can always reuse the Class in your subsequent analysis or productionise it with your code. As the python mindset goes: “Don’t reinvent the wheel every time.”