{ "cells": [ { "cell_type": "markdown", "id": "1b737711", "metadata": {}, "source": [ "# Using MongoDB filters\n", "\n", "This guide covers Python-first filtering with NMDC API Utilities.\n", "Filters use MongoDB query syntax and can be passed directly or built with helper methods.\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "27c22901", "metadata": {}, "source": [ "## Quick Start" ] }, { "cell_type": "code", "execution_count": null, "id": "93b36405", "metadata": {}, "outputs": [], "source": [ "from nmdc_api_utilities import BiosampleSearch\n", "\n", "client = BiosampleSearch()\n", "results = client.get_record_by_filter('{\"id\": \"nmdc:bsm-11-006pnx90\"}')\n", "\n", "print(f\"Records found: {len(results)}\")\n", "if results:\n", " print(f\"First record ID: {results[0].get('id')}\")" ] }, { "cell_type": "markdown", "id": "04521778", "metadata": {}, "source": [ "## Filter Formats\n", "\n", "MongoDB-style JSON filter strings are accepted by `get_record_by_filter` and `get_records`.\n", "\n", "Examples:\n", "\n", "- Exact match: `{\"id\": \"nmdc:sty-11-8fb6t785\"}`\n", "- Case-insensitive partial match: `{\"name\": {\"$regex\": \"forest\", \"$options\": \"i\"}}`\n", "- Multiple criteria (implicit AND): `{\"ecosystem_category\": \"Plants\", \"lat_lon\": {\"$exists\": true}}`\n", "- Nested field (dot notation): `{\"env_broad_scale.has_raw_value\": \"Forest biome\"}`" ] }, { "cell_type": "markdown", "id": "d785429f", "metadata": {}, "source": [ "## Supported MongoDB Operators\n", "\n", "- `$regex` for pattern matching\n", "- `$options` for regex options (for example, `\"i\"`)\n", "- `$exists` for field presence\n", "- `$in` for matching any value in an array\n", "- `$gte` and `$lte` for range filters\n", "- `$and` and `$or` for compound logic" ] }, { "cell_type": "markdown", "id": "0ea5a586", "metadata": {}, "source": [ "## Direct Filter Usage" ] }, { "cell_type": "code", "execution_count": null, "id": "ddaac20e", "metadata": {}, "outputs": [], "source": [ "from nmdc_api_utilities import BiosampleSearch\n", "\n", "client = BiosampleSearch()\n", "\n", "filter_str = '{\"name\": {\"$regex\": \"forest\", \"$options\": \"i\"}}'\n", "records = client.get_record_by_filter(filter_str)\n", "\n", "print(f\"Matching biosamples: {len(records)}\")" ] }, { "cell_type": "markdown", "id": "17bd5be0", "metadata": {}, "source": [ "## Build Filters Programmatically\n", "\n", "Use `DataProcessing.build_filter` to create filters from Python dictionaries.\n", "By default, it builds case-insensitive regex filters and escapes special characters." ] }, { "cell_type": "code", "execution_count": null, "id": "a32233a7", "metadata": {}, "outputs": [], "source": [ "from nmdc_api_utilities import BiosampleSearch, DataProcessing\n", "\n", "client = BiosampleSearch()\n", "dp = DataProcessing()\n", "\n", "filter_str = dp.build_filter({\"name\": \"GC-MS (2009)\"})\n", "records = client.get_record_by_filter(filter_str)\n", "\n", "exact_filter = dp.build_filter({\"ecosystem_category\": \"Plants\"}, exact_match=True)\n", "exact_records = client.get_record_by_filter(exact_filter)\n", "\n", "print(f\"Regex-style matches: {len(records)}\")\n", "print(f\"Exact matches: {len(exact_records)}\")" ] }, { "cell_type": "markdown", "id": "71a55345", "metadata": {}, "source": [ "## Attribute-Based Query Helper\n", "\n", "For straightforward attribute lookups, use `get_record_by_attribute`." ] }, { "cell_type": "code", "execution_count": null, "id": "8ca80548", "metadata": {}, "outputs": [], "source": [ "from nmdc_api_utilities import StudySearch\n", "\n", "client = StudySearch()\n", "\n", "partial = client.get_record_by_attribute(\n", " attribute_name=\"name\",\n", " attribute_value=\"tropical soil\",\n", ")\n", "\n", "exact = client.get_record_by_attribute(\n", " attribute_name=\"ecosystem_category\",\n", " attribute_value=\"Plants\",\n", " exact_match=True,\n", ")\n", "\n", "print(f\"Partial matches: {len(partial)}\")\n", "print(f\"Exact matches: {len(exact)}\")" ] }, { "cell_type": "markdown", "id": "a23863f7", "metadata": {}, "source": [ "## Pagination and Performance\n", "\n", "- Use `max_page_size` to tune result size for iterative exploration.\n", "- Use `all_pages=True` only when full export is required.\n", "- Use narrow filters and projection fields where possible." ] }, { "cell_type": "code", "execution_count": null, "id": "793ccab0", "metadata": {}, "outputs": [], "source": [ "from nmdc_api_utilities.biosample_search import BiosampleSearch\n", "\n", "client = BiosampleSearch()\n", "\n", "records = client.get_records(\n", " filter='{\"ecosystem_category\": \"Plants\"}',\n", " fields=\"id,name,lat_lon\",\n", " max_page_size=50,\n", " all_pages=False,\n", ")\n", "\n", "print(f\"Page records fetched: {len(records)}\")" ] }, { "cell_type": "markdown", "id": "2e29423a", "metadata": {}, "source": [ "## Troubleshooting\n", "\n", "Filter returns no results:\n", "\n", "- Confirm field names against schema documentation.\n", "- Try regex + `$options: i` instead of strict equality.\n", "- Confirm the selected collection class matches your target records.\n", "\n", "JSON syntax errors:\n", "\n", "- Use double quotes for keys and string values.\n", "- Validate JSON structure before passing filter strings.\n", "\n", "Special character issues:\n", "\n", "- Prefer `build_filter` for automatic escaping.\n", "- If writing raw regex filters, escape special characters carefully." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }