Fuzzy Joins in Python with d6tjoin


By Norman Niemer, Chief Data Scientist

 

Combining different data sources is a time suck!

 
Combining data from different sources can be a big time suck for data scientists. d6tjoin is a python library that lets you join pandas dataframes quickly and efficiently.

Coauthored with Haijing Li, Data Analyst in Financial Services, MS Business Analytics@Columbia University.

 

df_price
df_score

To prepare for backtesting, I need to merge “score” column to df_price. Obviously, ticker name and date should be the merge keys. But there are two problems: 1.Values in “ticker” of df_price and of df_valuation are not identical; 2.Scores were recorded on a monthly basis and I want each row in df_price to be assigned with the most recent assuming next score would not be available until next update date.

 

match_quality()

 

top1

The second line gives back the merged dataset. merged

The original df_price dataset contains only 1536 rows but why after merged it becomes 5209 rows? Well, that’s because in the original datasets we parse “date” as strings. d6tjoin use Levenshtein distance to calculate difference for strings by default and so one date from left would be matched with several dates from right. That tells you everytime dealing with date you should check its datatype first.

Change “date” datatype from string to datetime object and set top_limit=3 for “ticker”. Let’s check the result again.

import datetime
df_price["date"]=df_price["date"].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
df_score["date"]=df_score["date"].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
result=d6tjoin.top1.MergeTop1(df_price,df_score,fuzzy_left_on=['ticker','date'],fuzzy_right_on=['ticker','date'],top_limit=[3,None]).merge()
result['top1']['ticker']
result['merged']

top1
result

Looks good! All the tickers in left are perfectly matched and dates from left are matched to the closest from the right.

 

top1
result

Now we have our final merged dataset: price data without a previously assigned score are ignored and the others are each assigned with a previously most recent score.

 

d6t library. It provides solutions to common data science problems including:

  • d6tflow: build data science workflow
  • d6tjoin: quickly fuzzy joins
  • d6tpipe: quickly share and distribute data

 
Bio: Norman Niemer is the Chief Data Scientist at a large asset manager where he delivers data-driven investment insights. He holds a MS Financial Engineering from Columbia University and a BS in Banking and Finance from Cass Business School (London).

Original. Reposted with permission.

Related:



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *