Data Engineer

Full-Time

Remote

Apply Using Form Below

Overview:

Join Azra AI on its mission to improve healthcare through innovative applications of natural language processing (NLP). At Azra AI, we enable health systems to enhance clinical workflows by analyzing pathology and radiology reports in real-time, identifying the presence and type of cancer, and automating registry abstraction through text extraction. These reports are presented to clinicians in an intuitive workflow tool, allowing them to provide timely care to patients while focusing on what they do best—saving lives.

About the Data Engineer Role:

We are seeking a Data Engineer to design, implement, and optimize a modern cloud-based data platform using Google BigQuery and GCP-native tools. This role will be responsible for transforming data into high-quality, structured datasets to enable self-service analytics in Tableau and other BI tools.

You will ensure that our BigQuery data warehouse is scalable, cost-efficient, and aligned with business intelligence needs.

Key Responsibilities:

BigQuery Data Warehouse Management and Operations
- Design and implement scalable data pipelines using GCP-native tools
- Develop real-time and batch data pipelines using Dataflow, Apache Beam, and BigQuery Pub/Sub for streaming and structured data ingestion.
- Optimize performance with BigQuery partitioning, clustering, materialized views, and optimized SQL transformations.
- Automate and schedule workflows with tools like dbt/Dataform, Airflow/Composer, and/or Cloud Workflows.
- Define and manage fact tables (transactions, events, KPIs) and dimension tables (customers, providers, hospitals, products, locations).
Streaming & Real-Time Analytics
- Develop streaming ingestion pipelines using Dataflow (Apache Beam), Pub/Sub.
- Enable event-driven transformations for real-time data processing.
- Performance optimizations for real-time dashboards in Tableau, Looker, or Data Studio, for both our compute [ financial ] costs and dashboard-user experience.
Data Governance, Quality & Security
- Implement schema validation, deduplication, anomaly detection, and reconciliation across multiple sources.
- Define access controls, row-level security (RLS), and column-level encryption to ensure data protection, compliance.
- Maintain data lineage and metadata tracking using tools like OpenLineage, Dataplex Catalog.
Optimize & Automate Data Pipelines
- Develop incremental data refresh strategies to optimize cost and performance.
- Automate data transformation workflows with dbt, Dataform, Cloud Composer (Apache Airflow), and Python.
- Monitor pipeline performance and cloud cost efficiency with Cloud Logging, Monitoring, and BigQuery BI Engine.
Enable Self-Service BI & Analytics
- Ensure that tables and views are structured for fast and efficient queries in Tableau, Looker, and self-service BI tools.
- Work with data analysts to optimize SQL queries, views, and datasets for reporting.
- Provide data documentation and best practices to business teams for efficient self-service analytics.
- Collaborate with data producers to ensure data is well understood at product time, and ahead of ingest.
- Curate and maintain data dictionaries, data catalog, so users can understand what they are accessing.

Qualifications:

Experience in Data Architecture & Engineering
- 2+ years of experience in analytics/data engineering, cloud data architecture, or ELT development.
- Strong hands-on experience with SQL, and cloud-based data processing.
- Hands-on Development experience with Python [ or other programming language(s) ].
Expertise in GCP & BigQuery Data Processing
- Deep understanding of ELT/ETL principles
- Proficiency in dbt, Dataform, or SQL-based transformation tools for data modeling.
- Experience with GCP services: BigQuery, Dataflow (Apache Beam), Pub/Sub, Cloud Storage, and Cloud Functions.
BigQuery Optimization & Performance Tuning
- Experience optimizing BigQuery partitioning, clustering, materialized views, and query performance.
- Expertise in cost-efficient query design and workload optimization strategies.
Experience in Streaming & Real-Time Processing
- Hands-on experience with streaming data pipelines using Dataflow (Apache Beam), Apache Flink, Pub/Sub, or Kafka.
- Familiarity with real-time data transformations and event-driven architectures.
Experience Supporting BI & Analytics
- Strong knowledge of Tableau, Looker, and BI tools, ensuring reporting is optimized
- Ability to collaborate with data analysts and business teams to define data models and metrics.

Bonus Skills (Preferred but Not Required)

Knowledge of Cloud Composer (Apache Airflow) for data orchestration.

Familiarity with AI/ML model deployment and machine learning pipelines in GCP Vertex AI, Jupyter Notebooks, Pandas, etc.

Understanding of and experience with development/deployment patterns and dependency managementt, CI/CD, Testing, CodeQuality, Devcontainers or nixpkgs, poetry/uv

Programming abilities beyond python: Golang and/or Java/Kotlin/JVM

Database Administration, experience with varied database systems [ NoSQL, graph, etc ].

Why Join Azra AI?

Work on a next-generation data platform built on Google BigQuery and GCP-native tools.

Drive real-time data processing and self-service BI enablement in Tableau, Looker, and advanced analytics.

Work with modern cloud-based technologies such as BigQuery, dbt, Dataflow, and Cloud Functions.

Fully remote opportunity with a high-impact data engineering role.