Abstract:Functional dependency discovery is widely used in distributed big data analysis and is an important means of data cleaning,quality assessment and semantic analysis.Existing function dependency discovery algorithms are mainly for centralized data and are not suitable for cloud computing data distributed on different nodes.It is time consuming to gather the original distributed data to the centralized node,and processing the data on the distributed node separately using the traditional single machine method may lead to inaccurate results.Existing distributed algorithms have the disadvantage of excessive memory consumption.Therefore,this paper proposes a fast low-memory distributed function dependency discovery algorithm based on cloud computing data processing platform Spark.The algorithm proposes multiple distributed task allocation strategies and maximum equivalence class element deduplication strategies based on identifier set consistency.Under the premise of ensuring correctness,the number of set intersection operations is reduced and the processing speed is accelerated.The experimental results show that compared with the traditional centralized algorithm,the distributed algorithm proposed in this paper reduces the average execution time by about 50% in this experimental environment,and the deduplication strategy further reduces the execution time by about 30%.Compared with the existing distributed function dependency discovery algorithm,this algorithm can save about 75% of memory in some instances.