Showing 1102 open source projects for "data quality"

View related business solutions
  • AestheticsPro Medical Spa Software Icon
    AestheticsPro Medical Spa Software

    Our new software release will dramatically improve your medspa business performance while enhancing the customer experience

    AestheticsPro is the most complete Aesthetics Software on the market today. HIPAA Cloud Compliant with electronic charting, integrated POS, targeted marketing and results driven reporting; AestheticsPro delivers the tools you need to manage your medical spa business. It is our mission To Provide an All-in-One Cutting Edge Software to the Aesthetics Industry.
    Learn More
  • Skillfully - The future of skills based hiring Icon
    Skillfully - The future of skills based hiring

    Realistic Workplace Simulations that Show Applicant Skills in Action

    Skillfully transforms hiring through AI-powered skill simulations that show you how candidates actually perform before you hire them. Our platform helps companies cut through AI-generated resumes and rehearsed interviews by validating real capabilities in action. Through dynamic job specific simulations and skill-based assessments, companies like Bloomberg and McKinsey have cut screening time by 50% while dramatically improving hire quality.
    Learn More
  • 1
    DQO Data Quality Operations Center

    DQO Data Quality Operations Center

    Data Quality Operations Center

    DQO is an DataOps friendly data quality monitoring tool with customizable data quality checks and data quality dashboards. DQO comes with around 100 predefined data quality checks which helps you monitor the quality of your data. Table and column-level checks which allows writing your own SQL queries. Daily and monthly date partition testing. Data segmentation by up to 9 different data streams. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Synthetic Data Kit

    Synthetic Data Kit

    Tool for generating high quality Synthetic datasets

    Synthetic Data Kit is a CLI-centric toolkit for generating high-quality synthetic datasets to fine-tune Llama models, with an emphasis on producing reasoning traces and QA pairs that line up with modern instruction-tuning formats. It ships an opinionated, modular workflow that covers ingesting heterogeneous sources (documents, transcripts), prompting models to create labeled examples, and exporting to fine-tuning schemas with minimal glue code.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Data-Juicer

    Data-Juicer

    Data processing for and with foundation models

    Data-Juicer is an open-source data processing and augmentation framework designed to enhance the quality and diversity of datasets for machine learning tasks. It includes a modular pipeline for scalable data transformation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Cookiecutter Data Science

    Cookiecutter Data Science

    Project structure for doing and sharing data science work

    A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. When we think about data analysis, we often think just about the resulting reports, insights, or visualizations. While these end products are generally the main event, it's easy to focus on making the products look nice and ignore the quality of the code that generates them. Because these end products are created programmatically, code quality is still important! ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Outbound sales software Icon
    Outbound sales software

    Unified cloud-based platform for dialing, emailing, appointment scheduling, lead management and much more.

    Adversus is an outbound dialing solution that helps you streamline your call strategies, automate manual processes, and provide valuable insights to improve your outbound workflows and efficiency.
    Learn More
  • 5
    Synthetic Data Generator

    Synthetic Data Generator

    SDG is a specialized framework

    Synthetic Data Generator is an open-source framework designed to generate high-quality synthetic tabular datasets that replicate the statistical characteristics of real data while avoiding privacy risks. The platform enables developers and data scientists to create artificial datasets that preserve important relationships between variables without containing sensitive personal information.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Book2_Beauty-of-Data-Visualization

    Book2_Beauty-of-Data-Visualization

    Machine Learning, Criticism and Correction

    ...By combining theory with hands-on plotting exercises, the book helps readers build both analytical and presentation skills. Overall, it is intended as a foundational guide for anyone seeking to produce professional-quality data visualizations.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    Qualitis

    Qualitis

    Qualitis is a one-stop data quality management platform

    Qualitis is a data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. Based on Spring Boot, Qualitis submits quality model task to Linkis platform. It provides functions such as data quality model construction, data quality model execution, data quality verification, reports of data quality generation and so on. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    data-diff

    data-diff

    Efficiently diff rows across two different databases

    ...Replicating data at scale, across hundreds of tables, with low latency and at a reasonable infrastructure cost is a hard problem, and most data teams we’ve talked to, have faced data quality issues in their replication processes. The hard truth is that the quality of the replication is the quality of the data. Since copying entire datasets in batch is often infeasible at the modern data scale, businesses rely on the Change Data Capture (CDC) approach of replicating data using a continuous stream of updates.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    DataQualityDashboard

    DataQualityDashboard

    A tool to help improve data quality standards in data science

    The quality checks were organized according to the Kahn Framework1 which uses a system of categories and contexts that represent strategies for assessing data quality. Using this framework, the Data Quality Dashboard takes a systematic-based approach to running data quality checks. Instead of writing thousands of individual checks, we use “data quality check types”.
    Downloads: 0 This Week
    Last Update:
    See Project
  • SoftCo: Enterprise Invoice and P2P Automation Software Icon
    SoftCo: Enterprise Invoice and P2P Automation Software

    For companies that process over 20,000 invoices per year

    SoftCo Accounts Payable Automation processes all PO and non-PO supplier invoices electronically from capture and matching through to invoice approval and query management. SoftCoAP delivers unparalleled touchless automation by embedding AI across matching, coding, routing, and exception handling to minimize the number of supplier invoices requiring manual intervention. The result is 89% processing savings, supported by a context-aware AI Assistant that helps users understand exceptions, answer questions, and take the right action faster.
    Learn More
  • 10
    CSV Lint

    CSV Lint

    CSV Lint plug-in for Notepad++ for syntax highlighting

    CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files. Use CSV Lint for metadata discovery, technical data validation, and reformatting on tabular data files. It is not meant to be a replacement for spreadsheet programs like Excel or SPSS, but rather it's a quality control tool to examine, verify or polish up a dataset before further processing.
    Downloads: 36 This Week
    Last Update:
    See Project
  • 11
    lakeFS

    lakeFS

    lakeFS - Git-like capabilities for your object storage

    ...Easily Collaborate on production data with your team. Automate data quality checks within data pipelines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Arize Phoenix

    Arize Phoenix

    Uncover insights, surface problems, monitor, and fine tune your LLM

    Phoenix provides ML insights at lightning speed with zero-config observability for model drift, performance, and data quality. Phoenix is an Open Source ML Observability library designed for the Notebook. The toolset is designed to ingest model inference data for LLMs, CV, NLP and tabular datasets. It allows Data Scientists to quickly visualize their model data, monitor performance, track down issues & insights, and easily export to improve. Deep Learning Models (CV, LLM, and Generative) are an amazing technology that will power many of future ML use cases. ...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 13
    Encord Active

    Encord Active

    The toolkit to test, validate, and evaluate your models and surface

    Encord Active is an open-source toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling to supercharge model performance. Encord Active has been designed as a all-in-one open source toolkit for improving your data quality and model performance. Use the intuitive UI to explore your data or access all the functionalities programmatically. Discover errors, outliers, and edge-cases within your data - all in one open source toolkit. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Synthetic Data Vault (SDV)

    Synthetic Data Vault (SDV)

    Synthetic Data Generation for tabular, relational and time series data

    The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset. Synthetic data can then be used to supplement, augment and in some cases replace real data when training Machine Learning models. Additionally, it enables the testing of Machine Learning or other data dependent...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Dagster

    Dagster

    An orchestration platform for the development, production

    Dagster is an orchestration platform for the development, production, and observation of data assets. Dagster as a productivity platform: With Dagster, you can focus on running tasks, or you can identify the key assets you need to create using a declarative approach. Embrace CI/CD best practices from the get-go: build reusable components, spot data quality issues, and flag bugs early. Dagster as a robust orchestration engine: Put your pipelines into production with a robust multi-tenant, multi-tool engine that scales technically and organizationally. ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 16
    FiftyOne

    FiftyOne

    The open-source tool for building high-quality datasets

    The open-source tool for building high-quality datasets and computer vision models. Nothing hinders the success of machine learning systems more than poor-quality data. And without the right tools, improving a model can be time-consuming and inefficient. FiftyOne supercharges your machine learning workflows by enabling you to visualize datasets and interpret models faster and more effectively.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Diffgram

    Diffgram

    Training data (data labeling, annotation, workflow) for all data types

    From ingesting data to exploring it, annotating it, and managing workflows. Diffgram is a single application that will improve your data labeling and bring all aspects of training data under a single roof. Diffgram is world’s first truly open source training data platform that focuses on giving its users an unlimited experience. This is aimed to reduce your data labeling bills and increase your Training Data Quality.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 18
    Mumble

    Mumble

    Mumble is an open-source, low-latency, high quality voice chat

    Mumble is an open-source, low-latency, high-quality voice chat software. There are two modules in Mumble; the client (mumble) and the server (murmur). The client works on Windows, Linux, FreeBSD, OpenBSD, and macOS, while the server should work on anything Qt can be installed on. Low-latency and high-quality voice-chat program written on top of Qt and Opus. Administrators appreciate Mumble for being able to self-host and have control over data security and privacy. ...
    Downloads: 21 This Week
    Last Update:
    See Project
  • 19
    Gretel Synthetics

    Gretel Synthetics

    Synthetic data generators for structured and unstructured text

    Unlock unlimited possibilities with synthetic data. Share, create, and augment data with cutting-edge generative AI. Generate unlimited data in minutes with synthetic data delivered as-a-service. Synthesize data that are as good or better than your original dataset, and maintain relationships and statistical insights. Customize privacy settings so that data is always safe while remaining useful for downstream workflows. Ensure data accuracy and privacy confidently with expert-grade reports....
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    CleanVision

    CleanVision

    Automatically find issues in image datasets

    CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset! The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Deequ

    Deequ

    Deequ is a library built on top of Apache Spark

    ...It also includes a little domain-specific language called DQDL (Data Quality Definition Language) which allows declarative specification of quality rules. Users typically run Deequ before feeding data downstream (to ML pipelines, analytics, or production systems), enabling early detection and isolation of data errors. There is also a Python wrapper, PyDeequ, for users who prefer working from Python environments.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    ODD Platform

    ODD Platform

    First open-source data discovery and observability platform

    ...Know the impact of each code change with automatic testing. Enjoy lineage and alerts powered with data quality information.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    SDGym

    SDGym

    Benchmarking synthetic data generation methods

    ...You also customize the process to include your own work. Select any of the publicly available datasets from the SDV project, or input your own data. Choose from any of the SDV synthesizers and baselines. Or write your own custom machine learning model. In addition to performance and memory usage, you can also measure synthetic data quality and privacy through a variety of metrics. Install SDGym using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 24
    Cleanlab

    Cleanlab

    The standard data-centric AI package for data quality and ML

    cleanlab helps you clean data and labels by automatically detecting issues in a ML dataset. To facilitate machine learning with messy, real-world data, this data-centric AI package uses your existing models to estimate dataset problems that can be fixed to train even better models. cleanlab cleans your data's labels via state-of-the-art confident learning algorithms, published in this paper and blog. See some of the datasets cleaned with cleanlab at labelerrors.com. This package helps you...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    Pandas Profiling

    Pandas Profiling

    Create HTML profiling reports from pandas DataFrame objects

    ...Mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint). Comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others).
    Downloads: 1 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB