Comparing Dataframes with Python

I created a "class" that will help compare two dataframes in Python. This is a easy-to-use version compared to the one that was created earlier.
Calling the functions in this class is easy.

Compare.compare(df[3:8], df[3:4])
import pandas as pd

class Compare(object):
    #e.g., compare(df[3:8], df[3:4])
    @staticmethod
    def compare(x, y):
        return pd.merge(x, y, on = list(x), how = 'outer', indicator = '_difference_')

    #e.g Compare.readxl("c:\\users\\vprab\downloads\\Financial Sample.xlsx")[3:8]
    @staticmethod
    def readxl(file, sheet_name = 0, header = 0, skiprows = 0):
        return pd.read_excel(file, sheet_name = sheet_name, header = header, skiprows = skiprows)
    
    #e.g., readcboard(sep = ',')
    @staticmethod
    def readcboard(sep = ','):
        return pd.read_clipboard(sep = sep)
    
    #e.g., writexl(pd.read_clipboard(sep='a'), "prp.xlsx")
    @staticmethod
    def writexl(df, file):
        writer = pd.ExcelWriter(file)
        df.to_excel(writer,'Sheet1',index=False)
        writer.save()
    
    #e.g., writecboard(pd.read_clipboard(sep='a'))
    @staticmethod
    def writecboard(df, excel = True):
        df.to_clipboard(excel = excel, index = False)

Creating a Table of Word Frequencies in R

Function word_freq_table() takes a text and returns a table with the words within the text and the number of occurrences of that word. A list of optional attributes are also available to cleanse the resulting list of words:

  • to_upper: a logical parameter to inform if the text should be converted to upper case.
  • remove_punct: informs the punctuation/special characters to be removed.
  • remove_numbers: a logical parameter to inform if the numbers/letters 0 through 9 must be removed.
  • replace_CR: informs if a “carriage return” character must be replaced with something else.
  • replace_LF: informs if a “line feed” characters must be replaced with something else.
  • remove_repetitive_space: a logical parameter to inform if repetitive blank spaces must be replaced with a single blank space.
  • wordlength_atleast: informs the minimum length of words to be listed in the resulting table.

Here is the code block to call the function.

> word_freq_table(txt, replace_LF = " ", wordlength_atleast = 5)   %>%
+   head()
# Words Total: 228, Unique: 202
# String Length Minimum: 5, Maximum: 14
          Word Freq
1 ACCELERATION    1
2    ACCORDING    1
3        AFTER    2
4      AGAINST    1
5    AGREEMENT    1
6    APPOINTED    2

The function is listed below.

word_freq_table <-
  function(txt,
           to_upper = TRUE,
           remove_punct = "[[:punct:]]",
           remove_numbers = TRUE,
           replace_CR = "\r",
           replace_LF = "\n",
           remove_repetitive_space = TRUE,
           wordlength_atleast = 1) {
    if (to_upper)
      txt <- stringr::str_to_upper(txt)
    if (remove_punct != "")
      txt <- stringr::str_replace_all(txt, remove_punct, "")
    if (remove_numbers)
      txt <- stringr::str_replace_all(txt, "[0-9]", "")
    txt <- stringr::str_replace_all(txt, "[\r]", replace_CR)
    txt <- stringr::str_replace_all(txt, "[\n]", replace_LF)
    if (remove_repetitive_space)
      txt <- stringr::str_replace_all(txt, "[ ]{2,}", " ")
    
    Word <- stringr::str_split(txt, " ")
    freq <- table(Word)
    freq <- as.data.frame(freq)
    
    freq <- dplyr::filter(freq, stringr::str_length(Word) >= wordlength_atleast)
    cat(sprintf("# Words Total: %s, Unique: %s\r\n", sum(freq$Freq), nrow(freq)))
    cat(sprintf(
      "# String Length Minimum: %s, Maximum: %s\r\n",
      min(stringr::str_length(freq$Word)),
      max(stringr::str_length(freq$Word))
    ))
    freq
  }

Graphs with GrViz through R-DiagrammeR

It is very easy to draw flowcharts or graphs through DiagrammeR. Below is a quick snippet to encourage you to look for more about GrViz and DiagrammeR.


library(DiagrammeR)

grViz("
digraph nicegraph {
graph[rankdir = TB]
node [fontname = Helvetica,
shape = rectangle, fixedsize = true, width = 2]
node [fillcolor = 'YellowGreen', style = filled]
A[label = 'ABC']
D[label = 'XYZ']

node [fillcolor = 'Orange', style = filled]
B[label = 'ABC']
C[label = 'ABD']
E[label = 'DEF']
F[label = 'MNO']

edge [arrowhead = vee]
A->B;A->C;D->E;D->F
}
")

Retrieving Oracle table results from Python

A simple two liner to retrieve data results from Oracle.

# 1. Download and unzip 
#      Instant Client Package - Basic (64 bit or 32 bit same as your Python client) 
#      from http://www.oracle.com/technetwork/topics/winx64soft-089540.html
# 2. Add ..\instantclient_12_2 to environment variable "Path"
# 3. Install module cx_Oracle: 
#      py!pip install C:/cx_Oracle-6.0.2-cp36-cp36m-win_amd64.whl

import cx_Oracle as ora
import pandas as pd
conn = ora.connect("uname", "pwd", "host:port/sid")
dset = pd.read_sql("SELECT * FROM TABLE", conn)

Combining multiple files to one Excel file

This python code below allows you to choose multiple files (.csv, .xls, .xlsx) and create a multi-worksheet Excel file.


# -*- coding: utf-8 -*-

from tkinter import filedialog
from tkinter import *
from pathlib import Path
import pandas as pd

root = Tk()
infiles = filedialog.askopenfilenames(
                title = "Select the file you wish to convert", 
                filetypes = (("Excel files", "*.xls;*.xlsx"),("all files","*.*")))
counter = 1
if (len(infiles) > 0):
    p = Path(infiles[0])
    outfile = str(p.parent) + "\\newfile.xlsx"
    print(outfile)
    writer = pd.ExcelWriter(outfile, engine='xlsxwriter')

for infile in infiles:
    p = Path(infile)
    if (p.suffix == ".csv"):
        inp=pd.read_csv(infile, sep = ",")
    elif (p.suffix in [".xlsx", ".xls"]):
        inp=pd.read_excel(infile, dype = str)

    counter = counter + 1
    sheetname = p.name
    sheetname = sheetname.replace(p.suffix, "")
    inp.to_excel(writer, index = False, sheet_name = sheetname[0:30])

try:
    writer.save()
except NameError:
    print("No file selected")

Comparing Excel files with Python

A quick Python program to compare two excel files and create an output file.


from tkinter import filedialog
from tkinter import *
import pandas as pd

root = Tk()
df1=pd.read_excel(filedialog.askopenfilename(title = "Select the file you wish to compare", filetypes = (("Excel files", "*.xls;*.xlsx"), ("Text files", "*.txt;*.csv;*.tsv"),("all files","*.*"))))
df2=pd.read_excel(filedialog.askopenfilename(title = "Select the file with which you wish to compare", filetypes = (("Excel files", "*.xls;*.xlsx"), ("Text files", "*.txt;*.csv;*.tsv"),("all files","*.*"))))
df3=pd.merge(df1, df2)
writer = pd.ExcelWriter(filedialog.asksaveasfilename(title = "Select the folder where you wish to save the file", filetypes = (("Excel files", "*.xls;*.xlsx"), ("Text files", "*.txt;*.csv;*.tsv"),("all files","*.*"))), engine='xlsxwriter')
df3.to_excel(writer)
writer.save()

Suspicious Activity Report Statistics

A concise view of the SAR stats.


Tagged , , , ,

Converting CSV to Table rows in SQL Server and Oracle

Another useful block of code that will help in converting CSV to a table row.

with temp as
(
    select 108 Id, 'australia, australia, united states of america, iran' str
    UNION ALL
    select 109, 'india, iran, iraq'
    UNION ALL
    SELECT 109, 'israel, australia'
), xmlData as
(
     SELECT Id,  
         CAST ('<M>' + REPLACE(str, ',', '</M><M>') + '</M>' AS XML) AS Data  
     FROM  temp
)
SELECT xmlData.Id,  
     ltrim(rtrim(Split.a.value('.', 'VARCHAR(100)'))) AS Data  
FROM  xmlData CROSS APPLY Data.nodes ('/M') AS Split(a)

and in Oracle,

with temp as
(
    select 108 Id, 'australia, australia, united states of america, iran' str  from dual
    UNION ALL
    select 109, 'india, iran, iraq' from dual
    UNION ALL
    SELECT 109, 'israel, australia' FROM dual
)
SELECT DISTINCT ID, ltrim(regexp_substr(str, '[^' || ',' || ']+', 1, LEVEL)) str
FROM temp T
CONNECT BY regexp_substr(str, '[^' || ',' || ']+', 1, LEVEL) IS NOT NULL
ORDER BY ID;