An Algorithm for Matching Heterogeneous Financial Databases: a Case Study for COMPUSTAT/CRSP and I/B/E/S Databases

Irene Rodriguez-Lujan, Ramon Huerta

Abstract


Rigorous and proper linking of financial databases is a necessary step to test trading strategies incorporating multimodal sources of information. This paper proposes a machine learning solution to match companies in heterogeneous financial databases. Our method, named Financial Attribute Selection Distance (FASD), has two stages, each of them corresponding to one of the two interrelated tasks commonly involved in heterogeneous database matching problems: schema matching and entity matching. FASD's schema matching procedure is based on the Kullback-Leibler divergence of string and numeric attributes. FASD's entity matching solution relies on learning a company distance flexible enough to deal with the numeric and string attribute links found by the schema matching algorithm and incorporate different string matching approaches such as edit-based and token-based metrics. The parameters of the distance are optimized using the F-score as cost function. FASD is able to match the joint Compustat/CRSP and Institutional Brokers' Estimate System (I/B/E/S) databases with an F-score over 0.94 using only a hundred of manually labeled company links.


Full Text:

PDF

References


Baeza-Yates, R., Ribeiro-Neto, B., & others. (1999). Modern information retrieval. ACM press New York.

Bernstein, P. A., Madhavan, J., & Rahm, E. (2011). Generic schema matching, ten years later. Proceedings of the VLDB Endowment, (pp. 695-701).

Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., & Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5), 16-23. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1234765

Camacho, D., Huerta, R., & Elkan, C. (2008). An Evolutionary Hybrid Distance for Duplicate String Matching. Technical report, Universidad Autonoma de Madrid. Retrieved from http://arantxa.ii.uam.es/~dcamacho/StringDistance/hybrid-distance.pdf

Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. KDD Workshop on Data Cleaning and Object Consolidation, 3, pp. 73-78.

Commission, U. S. (n.d.). CUSIP Number. Retrieved from http://www.sec.gov/answers/cusip.htm

de Carvalho, M. G., Laender, A. H., Goncalves, M. A., & da Silva, A. S. (2012). A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(3), 399-412. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5645623

Doan, A., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machine-learning approach. ACM Sigmod Record, 30, pp. 509-520.

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons.

Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1-16. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4016511

Gal, A., & Shvaiko, P. (2009). Advances in ontology matching. In Advances in web semantics i (pp. 176-198). Springer. Retrieved from http://link.springer.com/content/pdf/10.1007%2F978-3-540-89784-2_6.pdf

Holland, J. H. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press. Retrieved from https://mitpress.mit.edu/books/adaptation-natural-and-artificial-systems

Huerta, R., Elkan, C., & Corbacho, F. (2013). Nonlinear Support Vector Machines Can Systematically Identify Stocks with High and Low Future Returns. Algorithmic Finance, 2, 1-45. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1930709

Isele, R., & Bizer, C. (2011). Learning linkage rules using genetic programming. Proceedings of the Sixth International Workshop on Ontology Matching, (pp. 13-24). Retrieved from http://vldb.org/pvldb/vol5/p1638_robertisele_vldb2012.pdf

Jaiswal, A., Miller, D. J., & Mitra, P. (2013). Schema Matching and Embedded Value Mapping for Databases with Opaque Column Names and Mixed Continuous and Discrete-valued Data Fields. ACM Trans. Database Syst., 38(1), 1-34. Retrieved from http://dl.acm.org/citation.cfm?id=2445585

Jaiswal, A., Miller, D., & Mitra, P. (2010). Uninterpreted Schema Matching with Embedded Value Mapping under Opaque Column Names and Data Values. IEEE Transactions on Knowledge and Data Engineering, 22(2), 291-304. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4799783

Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414-420. Retrieved from http://www.jstor.org/stable/2289924

Kang, J., & Naughton, J. F. (2008). Schema matching using interattribute dependencies. IEEE Transactions on Knowledge and Data Engineering, 20(10), 1393-1407. Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4527243

Kim, K.-j., & Han, I. (2000). Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert systems with applications, 19(2), 125-132. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417400000270

Köpcke, H., & Rahm, E. (2008). Training selection for tuning entity matching. QDB/MUD, (pp. 3-12).

Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2), 197-210. Retrieved from http://www.sciencedirect.com/science/article/pii/S0169023X09001451

Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3, pp. 484-493.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, (pp. 707-710).

Levine, D. (1996). Users guide to the PGAPack parallel genetic algorithm library. Argonne National Laboratory. Retrieved from ftp.mcs.anl.gov/pub/pgapack/user_guide.ps

Liu, H., Dou, D., & Wang, H. (2012). Breaking the Deadlock: Simultaneously Discovering Attribute Matching and Cluster Matching with Multi-Objective Metaheuristics. Journal on data semantics 1(2), 1(2), 133-145. Retrieved from http://link.springer.com/content/pdf/10.1007%2Fs13740-012-0010-0.pdf

Monge, A. E., & Elkan, C. (1997). Efficient domain-independent detection of approximately duplicate database records. Retrieved from http://cseweb.ucsd.edu/~elkan/approxdup.pdf

Monge, A. E., Elkan, C., & others. (1996). The Field Matching Problem: Algorithms and Applications. KDD, (pp. 267-270). Retrieved from https://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf

Moussawi, R. (2006). Linking I/B/E/S and Compustat Data. Wharton Research Data Services. Web.

Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334-350. Retrieved from http://link.springer.com/content/pdf/10.1007%2Fs007780100057.pdf

Sewell, M. (2010). The Application of Intelligent Systems to Financial Time Series Analysis. Department of Computer Science, University College London, University of London. Retrieved from http://s3.amazonaws.com/zanran_storage/www.cs.ucl.ac.uk/ContentPages/2546032205.pdf

Shvaiko, P. a. (2005). A survey of schema-based matching approaches. Journal on Data Semantics IV, 146-171. Retrieved from http://link.springer.com/content/pdf/10.1007%2F11603412_5.pdf

Winkler, W. E. (1999). The state of record linkage and current research problems. Statistical Research Division, US Census Bureau. Retrieved from https://www.census.gov/srd/papers/pdf/rr99-04.pdf

Zhao, H. (2010). Matching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information. Journal of Database Management (JDM), 21(4), 91-110. Retrieved from http://www.igi-global.com/chapter/matching-attributes-across-overlapping-heterogeneous/63676

Zhao, H., & Ram, S. (2007). Combining schema and instance information for integrating heterogeneous data sources. Data & Knowledge Engineering 61(2), 61(2), 281-303. Retrieved from http://www.sciencedirect.com/science/article/pii/S0169023X06000942




DOI: https://doi.org/10.11114/aef.v3i1.1164

Refbacks

  • There are currently no refbacks.


Paper Submission E-mail: aef@redfame.com

Applied Economics and Finance    ISSN 2332-7294 (Print)   ISSN 2332-7308 (Online)

Copyright © Redfame Publishing Inc.

To make sure that you can receive messages from us, please add the 'redfame.com' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders. If you have any questions, please contact: aef@redfame.com

-------------------------------------------------------------------------------------------------------------------------------------------------------------