Managing VMs like a Data Scientist

Managing virtual machines (VMs) as a data scientist can be tedious. If you are like me and work in a team that is not fortunate enough to have a data engineer cleaning, prepping and giving you your data on a plate with some garnish on the side, then you have to manage, extract and manipulate files sitting on various VMs. Logging into each of these VM to see if all the necessary files dumped, all the necessary packages installed and all the cron jobs executed on time can be a time consuming, inefficient and downright laborious task.

Background

Below is some background for those stumbling onto this post with no clue on any of the topics, feel free to skip the sections you know a lot about as the snippets below give a high-level overview to the reader to ensure the post makes sense as a whole.

VMs

Google Cloud Platform (GCP)
gcloud compute instances create vm1 --custom-cpu 1 --custom-memory 1
gcloud compute instances create vm2 --custom-cpu 2 --custom-memory 2
gcloud compute instances create vm3 --custom-cpu 1 --custom-memory 1
gcloud compute instances create vm4 --custom-cpu 2 --custom-memory 2

SSH

If you’ve never worked in a shell before, you should give it a bash… For those unfamiliar with what the shell even is, it’s the screen that looks like the matrix that all the techies use at work; I show an example below.

Example of logging into a remote server using ssh in a Bash shell.
ssh root@ip
ssh louwjlabuschagne_gmail_com@35.204.226.178
louwjlabuschagne_gmail_com@vm1:~$

Python Classes

class Person:
def __init__(self, name, age):
self.name = name
self.age = age

p1 = Person("John", 36)

print(p1.name)
print(p1.age)
people = [Person("Jane", 29), Person("John", 36), Person("Blake", 10)]
for person in people:
print(person.name)

Pandas

  • what does the apply() function do to a column of a DataFrame.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
iris.sepal_length.apply(lambda row: 'tall' if row >= 5 else 'short')

Bring it all together

I hear you saying: “OK cool Louwrens, nice background, but so what?” Well, we’ve covered all the theory needed to understand what is about to happen, which is:

  • the init method then checks if we can connect to the remote VM and uses the ✅ and ⛔️ emoticons to show a successful or unsuccessful connection.
  • I then create a DataFrame containing all the IPs of our 4 remote VMs on GCP,
  • then we can use the apply() method to run bash commands on these VMs and return a DataFrame.
  • I then display a summary DataFrame containing the specs for these 4 VMs sitting on GCP.
from paramiko import SSHClient
from paramiko.auth_handler import AuthenticationException
class VM(object):
def __init__(self, ip ,username, pkey='~/.ssh/id_rsa.pub'):
self.hostname = ip
self.username = username
self.pkey = pkey

self.logged_in_emoj = '✅'
self.logged_in = True
try:
ssh = SSHClient()
ssh.load_system_host_keys()
ssh.connect(hostname=self.ip,
username=self.username,
key_filename=self.pkey)
ssh.close()
except AuthenticationException as exception:
print(exception)
print('Login failed'%(self.username+'@'+self.ip))
self.logged_in_emoj = '⛔️'
self.logged_in = False
def __str__(self):
return(self.username+'@'+self.ip+' '+self.logged_in_emoj)
VMs = pd.DataFrame(dict(IP=['35.204.255.178',
'35.204.96.40',
'35.204.213.24',
'35.204.115.95']))
louwjlabuschagne_gmail_com@vm1:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping: 3
CPU MHz: 2000.170
BogoMIPS: 4000.34
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0
louwjlabuschagne_gmail_com@my-vm1:~$ cat /proc/meminfo
MemTotal: 1020416 kB
MemFree: 871852 kB
MemAvailable: 835736 kB
Buffers: 10164 kB
Cached: 53504 kB
SwapCached: 0 kB
Active: 92012 kB
Inactive: 17816 kB
Active(anon): 46308 kB
Inactive(anon): 4060 kB
Active(file): 45704 kB
Inactive(file): 13756 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 28 kB
Writeback: 0 kB
AnonPages: 46176 kB
Mapped: 25736 kB

Conclusion

This post has only scratched the surface on how using Classes and DataFrames in conjunction with each other can ease your life. Be sure to check out the jupyter notebook on the github repo to fill in some coding gaps I’ve eluded to in this post.

Data Scientist, Co-Founder Automator Plus, Musician

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store