{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "1b737711",
      "metadata": {},
      "source": [
        "# Using MongoDB filters\n",
        "\n",
        "This guide covers Python-first filtering with NMDC API Utilities.\n",
        "Filters use MongoDB query syntax and can be passed directly or built with helper methods.\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "27c22901",
      "metadata": {},
      "source": [
        "## Quick Start"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "93b36405",
      "metadata": {},
      "outputs": [],
      "source": [
        "from nmdc_api_utilities import BiosampleSearch\n",
        "\n",
        "client = BiosampleSearch()\n",
        "results = client.get_record_by_filter('{\"id\": \"nmdc:bsm-11-006pnx90\"}')\n",
        "\n",
        "print(f\"Records found: {len(results)}\")\n",
        "if results:\n",
        "    print(f\"First record ID: {results[0].get('id')}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "04521778",
      "metadata": {},
      "source": [
        "## Filter Formats\n",
        "\n",
        "MongoDB-style JSON filter strings are accepted by `get_record_by_filter` and `get_records`.\n",
        "\n",
        "Examples:\n",
        "\n",
        "- Exact match: `{\"id\": \"nmdc:sty-11-8fb6t785\"}`\n",
        "- Case-insensitive partial match: `{\"name\": {\"$regex\": \"forest\", \"$options\": \"i\"}}`\n",
        "- Multiple criteria (implicit AND): `{\"ecosystem_category\": \"Plants\", \"lat_lon\": {\"$exists\": true}}`\n",
        "- Nested field (dot notation): `{\"env_broad_scale.has_raw_value\": \"Forest biome\"}`"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d785429f",
      "metadata": {},
      "source": [
        "## Supported MongoDB Operators\n",
        "\n",
        "- `$regex` for pattern matching\n",
        "- `$options` for regex options (for example, `\"i\"`)\n",
        "- `$exists` for field presence\n",
        "- `$in` for matching any value in an array\n",
        "- `$gte` and `$lte` for range filters\n",
        "- `$and` and `$or` for compound logic"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0ea5a586",
      "metadata": {},
      "source": [
        "## Direct Filter Usage"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ddaac20e",
      "metadata": {},
      "outputs": [],
      "source": [
        "from nmdc_api_utilities import BiosampleSearch\n",
        "\n",
        "client = BiosampleSearch()\n",
        "\n",
        "filter_str = '{\"name\": {\"$regex\": \"forest\", \"$options\": \"i\"}}'\n",
        "records = client.get_record_by_filter(filter_str)\n",
        "\n",
        "print(f\"Matching biosamples: {len(records)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "17bd5be0",
      "metadata": {},
      "source": [
        "## Build Filters Programmatically\n",
        "\n",
        "Use `DataProcessing.build_filter` to create filters from Python dictionaries.\n",
        "By default, it builds case-insensitive regex filters and escapes special characters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a32233a7",
      "metadata": {},
      "outputs": [],
      "source": [
        "from nmdc_api_utilities import BiosampleSearch, DataProcessing\n",
        "\n",
        "client = BiosampleSearch()\n",
        "dp = DataProcessing()\n",
        "\n",
        "filter_str = dp.build_filter({\"name\": \"GC-MS (2009)\"})\n",
        "records = client.get_record_by_filter(filter_str)\n",
        "\n",
        "exact_filter = dp.build_filter({\"ecosystem_category\": \"Plants\"}, exact_match=True)\n",
        "exact_records = client.get_record_by_filter(exact_filter)\n",
        "\n",
        "print(f\"Regex-style matches: {len(records)}\")\n",
        "print(f\"Exact matches: {len(exact_records)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "71a55345",
      "metadata": {},
      "source": [
        "## Attribute-Based Query Helper\n",
        "\n",
        "For straightforward attribute lookups, use `get_record_by_attribute`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8ca80548",
      "metadata": {},
      "outputs": [],
      "source": [
        "from nmdc_api_utilities import StudySearch\n",
        "\n",
        "client = StudySearch()\n",
        "\n",
        "partial = client.get_record_by_attribute(\n",
        "    attribute_name=\"name\",\n",
        "    attribute_value=\"tropical soil\",\n",
        ")\n",
        "\n",
        "exact = client.get_record_by_attribute(\n",
        "    attribute_name=\"ecosystem_category\",\n",
        "    attribute_value=\"Plants\",\n",
        "    exact_match=True,\n",
        ")\n",
        "\n",
        "print(f\"Partial matches: {len(partial)}\")\n",
        "print(f\"Exact matches: {len(exact)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a23863f7",
      "metadata": {},
      "source": [
        "## Pagination and Performance\n",
        "\n",
        "- Use `max_page_size` to tune result size for iterative exploration.\n",
        "- Use `all_pages=True` only when full export is required.\n",
        "- Use narrow filters and projection fields where possible."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "793ccab0",
      "metadata": {},
      "outputs": [],
      "source": [
        "from nmdc_api_utilities.biosample_search import BiosampleSearch\n",
        "\n",
        "client = BiosampleSearch()\n",
        "\n",
        "records = client.get_records(\n",
        "    filter='{\"ecosystem_category\": \"Plants\"}',\n",
        "    fields=\"id,name,lat_lon\",\n",
        "    max_page_size=50,\n",
        "    all_pages=False,\n",
        ")\n",
        "\n",
        "print(f\"Page records fetched: {len(records)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2e29423a",
      "metadata": {},
      "source": [
        "## Troubleshooting\n",
        "\n",
        "Filter returns no results:\n",
        "\n",
        "- Confirm field names against schema documentation.\n",
        "- Try regex + `$options: i` instead of strict equality.\n",
        "- Confirm the selected collection class matches your target records.\n",
        "\n",
        "JSON syntax errors:\n",
        "\n",
        "- Use double quotes for keys and string values.\n",
        "- Validate JSON structure before passing filter strings.\n",
        "\n",
        "Special character issues:\n",
        "\n",
        "- Prefer `build_filter` for automatic escaping.\n",
        "- If writing raw regex filters, escape special characters carefully."
      ]
    }
  ],
  "metadata": {
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}