Comparing Dataframes with Python

I created a "class" that will help compare two dataframes in Python. This is a easy-to-use version compared to the one that was created earlier.
Calling the functions in this class is easy.[3:8], df[3:4])
import pandas as pd

class Compare(object):
    #e.g., compare(df[3:8], df[3:4])
    def compare(x, y):
        return pd.merge(x, y, on = list(x), how = 'outer', indicator = '_difference_')

    #e.g Compare.readxl("c:\\users\\vprab\downloads\\Financial Sample.xlsx")[3:8]
    def readxl(file, sheet_name = 0, header = 0, skiprows = 0):
        return pd.read_excel(file, sheet_name = sheet_name, header = header, skiprows = skiprows)
    #e.g., readcboard(sep = ',')
    def readcboard(sep = ','):
        return pd.read_clipboard(sep = sep)
    #e.g., writexl(pd.read_clipboard(sep='a'), "prp.xlsx")
    def writexl(df, file):
        writer = pd.ExcelWriter(file)
    #e.g., writecboard(pd.read_clipboard(sep='a'))
    def writecboard(df, excel = True):
        df.to_clipboard(excel = excel, index = False)

Creating a Table of Word Frequencies in R

Function word_freq_table() takes a text and returns a table with the words within the text and the number of occurrences of that word. A list of optional attributes are also available to cleanse the resulting list of words:

  • to_upper: a logical parameter to inform if the text should be converted to upper case.
  • remove_punct: informs the punctuation/special characters to be removed.
  • remove_numbers: a logical parameter to inform if the numbers/letters 0 through 9 must be removed.
  • replace_CR: informs if a “carriage return” character must be replaced with something else.
  • replace_LF: informs if a “line feed” characters must be replaced with something else.
  • remove_repetitive_space: a logical parameter to inform if repetitive blank spaces must be replaced with a single blank space.
  • wordlength_atleast: informs the minimum length of words to be listed in the resulting table.

Here is the code block to call the function.

> word_freq_table(txt, replace_LF = " ", wordlength_atleast = 5)   %>%
+   head()
# Words Total: 228, Unique: 202
# String Length Minimum: 5, Maximum: 14
          Word Freq
2    ACCORDING    1
3        AFTER    2
4      AGAINST    1
5    AGREEMENT    1
6    APPOINTED    2

The function is listed below.

word_freq_table <-
           to_upper = TRUE,
           remove_punct = "[[:punct:]]",
           remove_numbers = TRUE,
           replace_CR = "\r",
           replace_LF = "\n",
           remove_repetitive_space = TRUE,
           wordlength_atleast = 1) {
    if (to_upper)
      txt <- stringr::str_to_upper(txt)
    if (remove_punct != "")
      txt <- stringr::str_replace_all(txt, remove_punct, "")
    if (remove_numbers)
      txt <- stringr::str_replace_all(txt, "[0-9]", "")
    txt <- stringr::str_replace_all(txt, "[\r]", replace_CR)
    txt <- stringr::str_replace_all(txt, "[\n]", replace_LF)
    if (remove_repetitive_space)
      txt <- stringr::str_replace_all(txt, "[ ]{2,}", " ")
    Word <- stringr::str_split(txt, " ")
    freq <- table(Word)
    freq <-
    freq <- dplyr::filter(freq, stringr::str_length(Word) >= wordlength_atleast)
    cat(sprintf("# Words Total: %s, Unique: %s\r\n", sum(freq$Freq), nrow(freq)))
      "# String Length Minimum: %s, Maximum: %s\r\n",

Graphs with GrViz through R-DiagrammeR

It is very easy to draw flowcharts or graphs through DiagrammeR. Below is a quick snippet to encourage you to look for more about GrViz and DiagrammeR.


digraph nicegraph {
graph[rankdir = TB]
node [fontname = Helvetica,
shape = rectangle, fixedsize = true, width = 2]
node [fillcolor = 'YellowGreen', style = filled]
A[label = 'ABC']
D[label = 'XYZ']

node [fillcolor = 'Orange', style = filled]
B[label = 'ABC']
C[label = 'ABD']
E[label = 'DEF']
F[label = 'MNO']

edge [arrowhead = vee]

Retrieving Oracle table results from Python

A simple two liner to retrieve data results from Oracle.

# 1. Download and unzip 
#      Instant Client Package - Basic (64 bit or 32 bit same as your Python client) 
#      from
# 2. Add ..\instantclient_12_2 to environment variable "Path"
# 3. Install module cx_Oracle: 
#      py!pip install C:/cx_Oracle-6.0.2-cp36-cp36m-win_amd64.whl

import cx_Oracle as ora
import pandas as pd
conn = ora.connect("uname", "pwd", "host:port/sid")
dset = pd.read_sql("SELECT * FROM TABLE", conn)

Combining multiple files to one Excel file

This python code below allows you to choose multiple files (.csv, .xls, .xlsx) and create a multi-worksheet Excel file.

# -*- coding: utf-8 -*-

from tkinter import filedialog
from tkinter import *
from pathlib import Path
import pandas as pd

root = Tk()
infiles = filedialog.askopenfilenames(
                title = "Select the file you wish to convert", 
                filetypes = (("Excel files", "*.xls;*.xlsx"),("all files","*.*")))
counter = 1
if (len(infiles) > 0):
    p = Path(infiles[0])
    outfile = str(p.parent) + "\\newfile.xlsx"
    writer = pd.ExcelWriter(outfile, engine='xlsxwriter')

for infile in infiles:
    p = Path(infile)
    if (p.suffix == ".csv"):
        inp=pd.read_csv(infile, sep = ",")
    elif (p.suffix in [".xlsx", ".xls"]):
        inp=pd.read_excel(infile, dype = str)

    counter = counter + 1
    sheetname =
    sheetname = sheetname.replace(p.suffix, "")
    inp.to_excel(writer, index = False, sheet_name = sheetname[0:30])

except NameError:
    print("No file selected")

Comparing Excel files with Python

A quick Python program to compare two excel files and create an output file.

from tkinter import filedialog
from tkinter import *
import pandas as pd

root = Tk()
df1=pd.read_excel(filedialog.askopenfilename(title = "Select the file you wish to compare", filetypes = (("Excel files", "*.xls;*.xlsx"), ("Text files", "*.txt;*.csv;*.tsv"),("all files","*.*"))))
df2=pd.read_excel(filedialog.askopenfilename(title = "Select the file with which you wish to compare", filetypes = (("Excel files", "*.xls;*.xlsx"), ("Text files", "*.txt;*.csv;*.tsv"),("all files","*.*"))))
df3=pd.merge(df1, df2)
writer = pd.ExcelWriter(filedialog.asksaveasfilename(title = "Select the folder where you wish to save the file", filetypes = (("Excel files", "*.xls;*.xlsx"), ("Text files", "*.txt;*.csv;*.tsv"),("all files","*.*"))), engine='xlsxwriter')

Suspicious Activity Report Statistics

A concise view of the SAR stats.

Tagged , , , ,

Converting CSV to Table rows in SQL Server and Oracle

Another useful block of code that will help in converting CSV to a table row.

with temp as
    select 108 Id, 'australia, australia, united states of america, iran' str
    select 109, 'india, iran, iraq'
    SELECT 109, 'israel, australia'
), xmlData as
     SELECT Id,  
         CAST ('<M>' + REPLACE(str, ',', '</M><M>') + '</M>' AS XML) AS Data  
     FROM  temp
SELECT xmlData.Id,  
     ltrim(rtrim(Split.a.value('.', 'VARCHAR(100)'))) AS Data  
FROM  xmlData CROSS APPLY Data.nodes ('/M') AS Split(a)

and in Oracle,

with temp as
    select 108 Id, 'australia, australia, united states of america, iran' str  from dual
    select 109, 'india, iran, iraq' from dual
    SELECT 109, 'israel, australia' FROM dual
SELECT DISTINCT ID, ltrim(regexp_substr(str, '[^' || ',' || ']+', 1, LEVEL)) str
FROM temp T
CONNECT BY regexp_substr(str, '[^' || ',' || ']+', 1, LEVEL) IS NOT NULL