Title: Schema-Aware Indexes For Json Document Collections
Authors: D, Uma Priya
Supervisors: Thilagam, P. Santhi
Keywords: JSON;Schema extraction;Schema variants;JSON Indexing
Issue Date: 2023
Publisher: National Institute Of Technology Karnataka Surathkal
Abstract: Web applications, IoT devices, and other real-time applications generate an abundance of multi-structured data every day, increasing the complexity of data storage and man- agement. Large organizations such as Amazon, Google, and Facebook use NoSQL databases to store these large sets of diverse data. NoSQL databases offer an efficient architecture for meeting the performance and scale requirements of big data compared to relational databases. NoSQL document stores adopt the JSON format as the de-facto standard for storing multi-structured data. The data first, schema later approach of doc- ument stores greatly enhances the use of the JSON data format in modern applications. However, this flexibility poses several challenges for data management and knowledge discovery tasks. A JSON collection does not have an explicit schema to describe the internal struc- tures of documents; instead, the schema is implicit in the data, allowing the documents to have various structures. Therefore, knowledge of the implicit schemas is essential to understand the data stored in the collection. This schema information can be helpful for efficient data retrieval, data integration, query formulation, etc. In this direction, existing research extracts schemas from JSON documents using their structural related- ness and generates either global schema or schema variants. The global schema is the structural representation of the whole collection that summarises the unique attributes in a collection. This information is generally used for JSON document validation, query formulation, etc. As the global schema does not capture the different sets of attributes available in each document, it does not support various data management tasks such as data integration, query optimization, etc. To overcome this limitation, few studies focus on extracting schema variants from the collections. Schema variants represent the schema versions or distinct schemas of JSON collections that support the above- mentioned data management tasks effectively. Most literature focuses on extracting the schema versions from a collection using schema class types (entities) manually embed- ded in the documents. Due to the dynamic nature and sheer size of JSON documents,the manual embedding of class types in each document is not feasible in a real-time scenario. To address this issue, researchers employ clustering approaches to automati- cally identify the class types of a JSON collection in two steps. The primary step is to extract the schemas from a collection and then cluster the documents using the struc- tural similarity of extracted schemas. However, differently annotated JSON schemas are not only structurally heterogeneous but also semantically heterogeneous. Litera- ture shows that the automatic identification of class types of JSON documents based on structural and semantic similarity of JSON schemas is still in its infancy. To address these research gaps, this research employs both syntactic and semantic relationships of JSON schemas to capture the contextual information. In this work, we propose (i) Schema Embeddings for JSON Documents (SchemaEmbed) model to capture the con- textually similar JSON schemas, (ii) Embedding-based Clustering approach to group the contextually similar JSON documents, and (iii) Schema Variants Tree (SVTree) to represent the schema variants of each cluster. As SVTree contains information about the core (common) and schema-specific attributes in a cluster, it supports efficient data retrieval. The proposed approach is evaluated with real-world and synthetic datasets. The results and findings demonstrate that the proposed approach outperforms the cur- rent approaches significantly in grouping the contextually similar JSON documents. In addition, the impact of clustering in constructing a compact SVTree is also studied. The heterogeneous nature of JSON documents increases the complexity of the ef- ficient retrieval of data. Indexes have traditionally been used to improve the speed of data retrieval. Existing indexing techniques for JSON data use global schema to identify the unique attributes in a collection and support exact (lexical) matching of path-based queries. However, they suffer from huge index sizes and data retrieval time. As JSON schemas are annotated differently, providing semantic support increases the search rele- vancy. Existing work on the semantic search of JSON documents uses knowledge bases such as WordNet. However, they capture the abstract meaning of JSON attributes rather than their context. To bridge these research gaps, this research proposes efficient and compact index structures, namely JSON Index (JIndex) and Embedding-based JIndex (EJIndex), to support both lexical and semantic matching of path-based queries. With iithe help of core and schema-specific attributes of schema variants stored in SVTree, the proposed indexes reduce the index size by storing only a subset of attributes rather than all the attributes in a collection. Experimental results demonstrate that the proposed in- dexes outperform the existing approaches in retrieving both lexically and semantically relevant results, significantly reducing index size and data retrieval time. As JSON documents evolve and change over time, the implicit schemas must be extracted and updated in the database to support dynamic data retrieval. Existing ap- proaches focus either on maintaining the history of schema versions in data lakes or updating the global schema. Nevertheless, the schema variants must be updated to provide the latest documents for the user queries. In this work, we propose an Incre- mental SchemaEmbed model to generate schema embeddings for new schema variants of the latest documents while preserving the knowledge of old schema variants. The Incremental Embedding-based Clustering approach assigns the latest documents to the respective clusters based on the contextual similarity of their schema variants. Conse- quently, the JIndex and EJIndex are updated incrementally to support the retrieval of the latest documents for the user queries. The experimental results on diverse datasets show that the proposed work is efficient in updating the schema variants and the indexes.
