Please use this identifier to cite or link to this item: https://idr.l1.nitk.ac.in/jspui/handle/123456789/7286
Title: An Empirical Study to Detect the Collision Rate in Similarity Hashing Algorithm Using MD5
Authors: Gangavarapu, T.
Jaidhar, C.D.
Issue Date: 2019
Citation: 2019 International Conference on Data Science and Engineering, ICDSE 2019, 2019, Vol., , pp.11-14
Abstract: Similarity Hashing (SimHash) is a widely used locality-sensitive hashing algorithm employed in the detection of similarity, in large-scale data processing, including plagiarism detection and near-duplicate web document detection. Collision resistance is a crucial property of cryptographic hash algorithms that are used to verify the message integrity in internet security applications. A hash function is said to be collision-resistant if it is hard to find two different inputs that hash to the same output. In this paper, we present an empirical study to facilitate the detection of collision rate when SimHash is employed to check the integrity of the message. The analysis was performed using bit sequences with length varying from 2 to 32 and Message Digest 5 (MD5) as the internal hash function. Furthermore, to enable faster collision detection with more significant speedup and efficient space utilization, we parallelized the process using a distributed data-parallel approach with synchronous computation and optimum load balancing. Collision detection is desirable, owing to its applicability in digital signature systems, proof-of-work systems, and distributed content systems. Our empirical study revealed a collision rate of 0% to 0.048% in SimHash (with MD5) with the variation in the length of the bit sequence. � 2019 IEEE.
URI: http://idr.nitk.ac.in/jspui/handle/123456789/7286
Appears in Collections:2. Conference Papers

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.