Author Archives: david

Notes for logging with Python on App Engine

These are my notes for how logging works for Python applications on Google App Engine and how it integrates with Cloud Logging.

I used the logging_tree package to understand how Python’s logging system gets initialized in different scenarios. Read Brandon Rhodes’ article introducing logging_tree.

Logging on the Python 2.7 runtime

Python’s logging system is automatically configured to use the handler in the google.appengine.api.logservice package from the SDK. No code is required in your application to enable this integration.
The default logging level is DEBUG.
Messages are buffered for a maximum of 60 seconds.
The log name is "projects/[PROJECT-ID]/logs/appengine.googleapis.com%2Frequest_log".
Messages can have multiple lines (records), which can be used by the handler to combine more than one log record.

Logging on the Python 3.9 runtime (default setup)

No automatic configuration for Python logging.
The default handler is the STDERR handler.
The default logging level is WARNING (this is the default level for Python’s logging module).
App Engine reads the application output from STDERR and sends it to Cloud Logging.
The log name is "projects/[PROJECT-ID]/logs/stderr".
Messages are created with a textPayload but no other structured information from the Python logging – message.

Logging with google-cloud-logging on the Python 3.9 runtime

Add the google-cloud-logging package to "requirements.txt".
Enable it with import google.cloud.logging; google.cloud.logging.Client().setup_logging().
The default logging level is INFO.
Use setup_logging(log_level=logging.DEBUG) to set a DEBUG level.
The log name is "projects/[PROJECT-ID]/logs/app".
Messages are created with a "jsonPayload" and with the correct log level (the "severity" field in log records).
If Flask is installed, the logging handler gets the trace ID from the request.
If Django is installed, and you enabled google.cloud.logging.handlers.middleware.RequestMiddleware, the logging handler gets the trace ID from the request.

Logging for applications that don’t use Flask or Django

For Python applications on App Engine, the important thing is to enable the AppEngineHandler logging handler provided by the google-cloud-logging package when the application starts:

# main.py
import google.cloud.logging

google.cloud.logging.Client().setup_logging()

This will give you log messages such that you can filter by level in the logs explorer.

However if your application does not use Flask or Django, log messages will not have the request’s trace ID, and that makes it harder to identify which messages are associated with a request.

I wrote a demo application with a logging handler to use with the Bottle web framework. The handler adds the correct logging trace ID and other request information.

It does not take much code to extend the logging system this way, but reading the source code for google-cloud-logging reminded me that sending messages to the logging system has an overhead and can only make your application a little slower. Make sure to use the logging API features that avoid unnecessary work (loggers, levels and positional message arguments) and even better just don’t log a message unless you know you will need it to debug a problem later.

Find your website’s URL on Cloud Run

If you have a Python app on Google Cloud Run, how can your app determine its own website URL?

When you deploy to Cloud Run, you specify a service name, and every app deployed to Cloud Run gets a unique URL. The domain in the URLs look something like "my-foo-service-8oafjf26aq-uc.a.run.app". That part in the middle is weird ("8oafjf26aq" in my example), and until you have deployed your first service in a project, it is not obvious how to determine what your app’s domain will be.

Here’s one way for your app to discover its own URL after it is deployed, saving you having to hard-code the value somewhere:

# Tested with Python 3.7.
import os

# Requires google-api-python-client google-auth
import googleapiclient.discovery
import google.auth
import google.auth.exceptions

def get_project_id():
    """Find the GCP project ID when running on Cloud Run."""
    try:
        _, project_id = google.auth.default()
    except google.auth.exceptions.DefaultCredentialsError:
        # Probably running a local development server.
        project_id = os.environ.get('GOOGLE_CLOUD_PROJECT', 'development')

    return project_id

def get_service_url():
    """Return the URL for this service, depending on the environment.

    For local development, this will be http://localhost:8080/. On Cloud Run
    this is https://{service}-{hash}-{region}.a.run.app.
    """
    # https://cloud.google.com/run/docs/reference/rest/v1/namespaces.services/list
    try:
        service = googleapiclient.discovery.build('run', 'v1')
    except google.auth.exceptions.DefaultCredentialsError:
        # Probably running the local development server.
        port = os.environ.get('PORT', '8080')
        url = f'http://localhost:{port}'
    else:
        # https://cloud.google.com/run/docs/reference/container-contract
        k_service = os.environ['K_SERVICE']
        project_id = get_project_id()
        parent = f'namespaces/{project_id}'

        # The global end-point only supports list methods, so you can't use
        # namespaces.services/get unless you know what region to use.
        request = service.namespaces().services().list(parent=parent)
        response = request.execute()

        for item in response['items']:
            if item['metadata']['name'] == k_service:
                url = item['status']['url']
                break
        else:
            raise EnvironmentError('Cannot determine service URL')

    return url

This code uses the metadata service to find the Google Cloud Platform project ID, and the K_SERVICE environment variable to find the Cloud Run service name. With that, it makes a request to the namespaces.services.list API, which returns a list of all the services deployed in a project. Looping through the list, it finds the matching service definition, and returns the URL for the service.

Maybe there’s a simpler approach, because this is a bunch of code that one really shouldn’t need to write. I wish Cloud Run would expose the same sort of environment variables that App Engine provides.

Finding the Cloud Tasks location from App Engine

When using the Google Cloud Tasks API you need to specify the project ID and location. It would be good to not hard-code these for your app, and instead determine the values when the application starts or on first using the API.

Code for this post is available on GitHub.

For an App Engine service, the project ID is readily available, both as a runtime environment variable and from the metadata service. The GOOGLE_CLOUD_PROJECT environment variable is your GCP project ID.

The tasks API location is a bit harder to determine. The App Engine region name (us-central, europe-west, etc.) is not exposed as an environment variable, and there’s no end-point for it in the App Engine metadata service.

However on App Engine the GAE_APPLICATION environment variable exposes the appliction ID (same as the project ID) prefixed by a short region code. We can use this cryptic region code to identify a Cloud Tasks API location. For example, all App Engine services deployed in the us-central region have a GAE_APPLICATION value that starts with s~, such as s~dbux-test.

Google’s documentation lists all the App Engine regions, but as far as I know there is no Google documentation for these short region code prefixes. So here is the list of App Engine regions, taken from the gcloud app regions list command, along with the short region prefix that appears when an App Engine application is deployed in each.

App Engine region	Short prefix
asia-east1	zde
asia-east2	n
asia-northeast1	b
asia-northeast2	u
asia-northeast3	v
asia-south1	j
asia-southeast1	zas
asia-southeast2	zet
australia-southeast1	f
europe-central2	zlm
europe-west	e
europe-west2	g
europe-west3	h
europe-west6	o
northamerica-northeast1	k
southamerica-east1	i
us-central	s
us-east1	p
us-east4	d
us-west1	zuw
us-west2	m
us-west3	zwm
us-west4	zwn

N.B. App Engine’s europe-west and us-central region names are equivalent to the Cloud Tasks locations europe-west1 and us-central1 respectively.

So from your Python code you can determine the Cloud Tasks location using this list of short region prefixes.

import os

# Hard-coded list of region prefixes to location names.
REGIONCODES_LOCATIONS = {
    'e': 'europe-west1',  # App Engine region europe-west.
    's': 'us-central1',  # App Engine region us-central.
    'p': 'us-east1',
    'j': 'asia-south1',
    # And others.
}

def get_project_and_location_for_tasks():
    # This works on App Engine, won't work on Cloud Run.
    app_id = os.environ['GAE_APPLICATION']
    region_code, _, project_id = app_id.partition('~')

    return project_id, REGIONCODES_LOCATIONS[region_code]

Nice! Does feel a little hacky to hard-code that region/locations map. And this won’t handle App Engine regions not in the list.

A more robust solution is to get the location from the Cloud Tasks API. This has the advantage of also working on Cloud Run, but requires 3 more HTTP requests (although those should be super quick). From the command-line, one can use gcloud --project=[project_id] tasks locations list (docs).

The equivalent API method is projects.locations.list.

# pip install google-auth google-api-python-client
import google.auth
import googleapiclient.discovery

def get_project_and_location_for_tasks():
    # Get the project ID from the metadata service. Works on
    # Cloud Run and App Engine.
    _, project_id = google.auth.default()

    name = f'projects/{project_id}'
    service = googleapiclient.discovery.build('cloudtasks', 'v2')
    request = service.projects().locations().list(name=name)
    # Fails if the Cloud Tasks API is not enabled.
    response = request.execute()

    # Grab the first location (there's never more than 1).
    # The response also includes 'name' which is a full location ID like
    # 'projects/[project_id]/locations/[locationId]'.
    first_location = response['locations'][0]

    return project_id, first_location['locationId']

That will fail with an exception if the tasks API is not enabled on the project. When running the application in your local development environment, you will probably want to avoid making requests to public APIs, so that will add some complexity that this code ignores.

The projects.locations.list response is a list of locations. I don’t know if it is currently possible for there to be more than 1 location in the list, the docs suggest that the Cloud Tasks service always follows whichever region the App Engine service is deployed to, and an App Engine service is always tied to 1 region (and cannot change its region later).

What is the location if you use the tasks API from a Cloud Run service, in a project which has never deployed an App Engine service? I don’t know.

I did a test with a project that had an existing App Engine service deployed to region us-central. In the same GCP project I deployed a Cloud Run service in the europe-west1 region, and from the Cloud Run service the call to the projects.locations.list API returned 1 location: us-central1.

Metadata on App Engine

Additional fields in the WSGI request environ

A request to your app results in your code being invoked with a WSGI environ object that captures the request as a series of key / value pairs describing the requested URL, the headers sent by the client, and any data sent by the client.

As well as standard variables such as HTTP_HOST and QUERY_STRING, apps on App Engine receive additional headers that are inserted by the App Engine service itself (not sent by the client). When present, you can rely on these headers being set by App Engine itself, they cannot be set by a malicious client trying to trick your app.

These extra header names all start with HTTP_X_APPENGINE. Here’s a table of the names and example values,

Key	Example value
HTTP_X_APPENGINE_CITY	london
HTTP_X_APPENGINE_CITYLATLONG	51.507351,-0.127758
HTTP_X_APPENGINE_COUNTRY	GB
HTTP_X_APPENGINE_DEFAULT_VERSION_HOSTNAME	app-engine-example.appspot.com
HTTP_X_APPENGINE_HTTPS	on
HTTP_X_APPENGINE_REGION	eng
HTTP_X_APPENGINE_REQUEST_LOG_ID	5ec7d80400ff0736739830b90c0001737e64627578746f6e2d706573740001696e626a756e64000820
HTTP_X_APPENGINE_USER_IP	33.53.215.240

Those location-related headers are cool! In my experience they tend to be fairly accurate, but there are times Google is unable to determine a location. Note that HTTP_X_APPENGINE_REGION describes the requesting user’s region, not the region/zone where your app is hosted.

The HTTP_X_APPENGINE_DEFAULT_VERSION_HOSTNAME header is interesting because in February 2020 App Engine moved to having a short region code in *.appspot.com host names. I wonder if projects created prior to the introduction of the new scheme all have the old-style host names in this header.

How come the headers from the request are prefixed with HTTP_? Because WSGI extends CGI.

Environment variables

Your code has access to a regular Unix-style environment. App Engine uses this to share details of the runtime. Here’s a table with some of those keys and example values. See the documentation for environment variables for a complete list.

Key	Example value
GAE_APPLICATION	s~app-engine-example
GAE_DEPLOYMENT_ID	426619129872753442
GAE_ENV	standard
GAE_INSTANCE	0c61b117cf53359b13df64f50b43903717d4d1d624f37d602146c9642e9ab832ea12a3a
GAE_MEMORY_MB	256
GAE_RUNTIME	python37
GAE_SERVICE	default
GAE_VERSION	20200221t144119
GOOGLE_CLOUD_PROJECT	app-engine-example
PYTHONDONTWRITEBYTECODE	1

Interesting things:

GAE_APPLICATION is your project name, but prefixed with a letter and tilde. That prefix identifies the App Engine region your app belongs to, which you must choose the first time you create the App Engine service in your project. You can’t change the region later. I swear there’s documentation listing these 1 letter prefixes, but I can’t find it now.
GAE_SERVICE and GAE_VERSION identify the version of your app that is running. Well useful if you think everything must be divided into micro-services (it should not) or if you deploy multiple versions to test things before setting the default version (you should).
PYTHONDONTWRITEBYTECODE is a Python-specific thing that prevents the creation of *.pyc files and __pycache__ directories. In theory your app would start a little faster on the first request if this option was turned off, but in practise it doesn’t matter, and this makes life easier for how App Engine runs your code. Given that, I don’t get why the Google Cloud blog highlights the parallel filesystem cache feature in their Python 3.8 beta announcement. Maybe it is enabled on that runtime, I haven’t tested it myself.

For the older Python 2.7 standard runtime, the OS environment includes all the request things as well. Again, because CGI.

The metadata service

Google’s Compute Engine metadata service is there on App Engine too (but not for the older Python 2.7 standard runtime). Except App Engine doesn’t get all the same things, and it is read-only.

The metadata service runs as an HTTP server. Your code makes requests to http://metadata.google.internal/computeMetadata/v1/ and descendant paths. On App Engine the metadata service exposes information about service accounts, the project’s zone, and the project’s ID.

The service account path is particularly useful, covering some of what used to be available with the google.appengine.api.app_identity APIs. If your code uses the google-auth package, you can get credentials with a short-lived access token for the default service account via google.auth.default() which in turn gets the token from the metadata service.

But if you want to change the OAuth scopes of the access token, you can go get it yourself. This Python snippet gets a token with scopes for the Google Spreadsheets API:

import requests

url = ‘http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token‘
headers = {‘Metadata-Flavor’: ‘Google’}
scopes = [‘https://www.googleapis.com/auth/spreadsheets‘]
params = {‘scopes’: ‘,’.join(scopes)}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
token = response.text

These are the paths currently exposed by the App Engine metadata service, relative to http://metadata.google.internal/computeMetadata/v1/, with example values:

Path	Example value
instance/service-accounts/default/aliases	default
instance/service-accounts/default/email	[email protected]
instance/service-accounts/default/identity?audience=foo	JWT token
instance/service-accounts/default/scopes	https://www.googleapis.com/auth/appengine.apis

https://www.googleapis.com/auth/cloud-platform
https://www.googleapis.com/auth/cloud_debugger
https://www.googleapis.com/auth/devstorage.full_control
https://www.googleapis.com/auth/logging.write
https://www.googleapis.com/auth/monitoring.write
https://www.googleapis.com/auth/trace.append
https://www.googleapis.com/auth/userinfo.email
|
instance/service-accounts/default/token | {
"access_token": "token",
"expires_in": 1799,
"token_type": "Bearer"
} |
instance/zone | projects/123456789012/zones/us16 |
project/numeric-project-id | 123456789012 |
project/project-id | app-engine-example |

If your project has multiple service accounts, there will be multiple entries under the instance/service-accounts path.

You can get all this (except for the access and identity tokens) with 1 request:

import requests

url = ‘http://metadata.google.internal/computeMetadata/v1/‘
headers = {‘Metadata-Flavor’: ‘Google’}
params = {‘recursive’: ‘true’}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
data = response.json()

It annoys me that the zone is provided here, but not the region, and that there is no documented way to derive the longer region identifier from a short zone identifier (as far as I am aware). But looks like that will change soon!

Conclusion

Getting information about the App Engine runtime environment turns out to be incredibly useful for your app because you can remove hard-coded assumptions from your code. The metadata service is particularly interesting because it allows more flexible response types (as opposed to strings in Unix environment variables) and a clear path for Google to extend it without fear of breaking your crazy code.

Tiny speed-ups for Python code

Jinja2 templates and Bottle

1 Reply

Although [Bottle’s][bottle] built-in mini-template language is remarkably useful, I nearly always prefer to use [Jinja2 templates][jinja2] because the syntax is very close to [Django’s template syntax][django] (which I am more familiar with) and because the Bottle template syntax for filling in blocks from a parent template is a bit limiting (but that’s kind of the point).

Bottle provides a nice jinja2_view decorator that makes it easy to use Jinja2 templates, but it isn’t that obvious how to configure the template environment for auto-escaping and default context, etc.

(The rest of this relates to Bottle version 0.11 and Jinja2 version 2.7.)

Template paths
————–

Bottle’s view decorator takes an optional `template_lookup` keyword argument. The default is to look in the current working directory and in a ‘views’ directory, i.e. `template_lookup=(‘.’, ‘./views/’)`.

You can override the template path like so:

from bottle import jinja2_view, route

@route(‘/’, name=’home’)
@jinja2_view(‘home.html’, template_lookup=[‘templates’])
def home():
return {‘title’: ‘Hello world’}

Which will load `templates/home.html`.

Most likely you will want to use the same template path for every view, which can be done by wrapping `jinja2_view`:

import functools
from bottle import jinja2_view, route

view = functools.partial(jinja2_view, template_lookup=[‘templates’])

@route(‘/’, name=’home’)
@view(‘home.html’)
def home():
return {‘title’: ‘Hello world’}

@route(‘/foo’, name=’foo’)
@view(‘foo.html’)
def foo():
return {‘title’: ‘Foo’}

That would have loaded `templates/home.html` and `templates/foo.html`.

Another way of setting a global template path for the view decorator is to fiddle with Bottle’s global default template path:

from bottle import TEMPLATE_PATH, jinja2_view, route

TEMPLATE_PATH[:] = [‘templates’]

@route(‘/’, name=’home’)
@jinja2_view(‘home.html’)
def home():
return {‘title’: ‘Hello world’}

N.B. I used `TEMPLATES_PATH[:]` to update the global template path directly rather than re-assigning it with `TEMPLATE_PATH = [‘templates’]`.

Template defaults
—————–

Bottle has a useful `url()` function to generate urls in your templates using named routes. But it isn’t in the template context by default. You can modify the default context on the Jinja2 template class provided by Bottle:

from bottle import Jinja2Template, url

Jinja2Template.defaults = {
‘url’: url,
‘site_name’: ‘My blog’,
}

Jinja2 version 2.7 by default does *not* escape variables. This surprises me, but it is easy to configure a template environment to auto-escape variables.

from bottle import Jinja2Template

Jinja2Template.settings = {
‘autoescape’: True,
}

Any of [the Jinja2 environment keyword arguments][environment] can go in this settings dictionary.

Using your own template environment
———————————–

Bottle’s template wrappers make a new instance of a Jinja2 template environment for each template (although if two views use the same template then they will share the compiled template and its environment).

You can avoid this duplication of effort by creating the Jinja2 template environment yourself, however this approach means you also need to write your own view decorator to use the custom template environment. No biggie.

Setting up a global Jinja2 template environment to look for templates in a “templates” directory:

from bottle import url
import jinja2

env = jinja2.Environment(
loader=jinja2.FileSystemLoader(‘templates’),
autoescape=True,
)
env.globals.update({
‘url’: url,
‘site_name’: ‘My blog’,
})

You then need a replacement for Bottle’s view decorator that uses the previously configured template environment:

import functools

# Assuming env has already been defined in the module’s scope.
def view(template_name):
def decorator(view_func):
@functools.wraps(view_func)
def wrapper(*args, **kwargs):
response = view_func(*args, **kwargs)

if isinstance(response, dict):
template = env.get_or_select_template(template_name)
return template.render(**response)
else:
return response

return wrapper

return decorator

@route(‘/’, name=’home’)
@view(‘home.html’)
def home():
return {‘title’: ‘Hello world’}

Conclusion
———-

It’s easy to customize the template environment for Jinja2 with Bottle and keep compatibility with Bottle’s own view decorator, but at some point you may decide it is more efficient to by-pass things and setup a custom Jinja2 environment.

Bottle is nice like that.

[bottle]: http://bottlepy.org/
[jinja2]: http://jinja.pocoo.org/
[django]: https://www.djangoproject.com/
[environment]: http://jinja.pocoo.org/docs/api/#jinja2.Environment

Grouping URLs in Django routing

1 Reply

One of the things I liked (and still like) about [Django][django] is that request routing is configured with regular expressions. You can capture positional and named parts of the request path, and the request handler will be invoked with the captured strings as positional and/or keyword arguments.

Quite often I find that the URL patterns repeat a lot of the regular expressions with minor variations for different but related view functions. For example, suppose you want CRUD-style URLs for a particular resource, you would [write an `urls.py`][urls] looking something like:

from django.conf.urls import url, patterns

urlpatterns = patterns(‘myapp.views’,
url(r’^(?P[-\w]+)/$’, ‘detail’),
url(r’^(?P[-\w]+)/edit/$’, ‘edit’),
url(r’^(?P[-\w]+)/delete/$’, ‘delete’),
)

The `detail`, `edit` and `delete` view functions (defined in `myapp.views`) all take a `slug` keyword argument, so one has to repeat that part of the regular expression for each URL.

When you have more complex routing configurations, repeating the `(?P[-\w]+)/` portion of each route can be tedious. Wouldn’t it be nice to declare that a bunch of URL patterns all start with the same capturing pattern and avoid the repetition?

It _would_ be nice.

I want to be able to write an URL pattern that defines a common base pattern that the nested URLs extend:

from django.conf.urls import url, patterns, group
from myapp.views import detail, edit, delete

urlpatterns = patterns(”,
group(r’^(?P[-\w]+)/’,
url(r’^$’, detail),
url(r’^edit/$’, edit),
url(r’^delete/$’, delete),
),
)

Of course there is no `group` function defined in Django’s `django.conf.urls` module. But if there were, it would function [like Django’s `include`][include] but act on locally declared URLs instead of a separate module’s patterns.

It happens that this is trivial to implement! Here it is:

from django.conf.urls import url, patterns, RegexURLResolver
from myapp.views import detail, edit, delete

def group(regex, *args):
return RegexURLResolver(regex, args)

urlpatterns = patterns(”,
group(r’^(?P[-\w]+)/’,
url(r’^$’, detail),
url(r’^edit/$’, edit),
url(r’^delete/$’, delete),
),
)

This way the `detail`, `edit` and `delete` view functions still get invoked with a `slug` keyword argument, but you don’t have to repeat the common part of the regular expression for every route.

There is a problem: it won’t work if you want to use a module prefix string (the first argument to `patterns(…)`). You either have to give a full module string, or use the view objects directly. So you can’t do this:

urlpatterns = patterns(‘myapp.views’,
# Doesn’t work.
group(r’^(?P[-\w]+)/’,
url(r’^$’, ‘detail’),
),
)

Personally I don’t think this is much of an issue since I prefer to use the view objects, and if you are using [class-based views][cbv] you will likely be using the view objects anyway.

I don’t know if “group” is a good name for this helper function. Other possibilities: “prefix”, “local”, “prepend”, “buxtonize”. You decide.

[django]: https://www.djangoproject.com/
[include]: https://docs.djangoproject.com/en/1.5/ref/urls/#include
[cbv]: http://ccbv.co.uk/
[urls]: https://docs.djangoproject.com/en/1.5/topics/http/urls/

Testing with Django, Haystack and Whoosh

43 Replies

The problem: you want to test a [Django][django] view for results of a search query, but [Haystack][haystack] will be using your real query index, built from your real database, instead of an index built from your test fixtures.

Turns out you can generalise this for any Haystack back-end by replacing the `haystack.backend` module with the simple back-end.

from myapp.models import MyModel
from django.test import TestCase
import haystack

class SearchViewTests(TestCase):
fixtures = [‘test-data.json’]

def setUp(self):
self._haystack_backend = haystack.backend
haystack.backend = haystack.load_backend(‘simple’)

def tearDown(self):
haystack.backend = self._haystack_backend

def test_search(self):
results = SearchQuerySet().all()
assert results.count() == MyModel.objects.count()

My first attempt at this made changes to the project settings [and did `HAYSTACK_WHOOSH_STORAGE = “ram”`][ram] which works but was complicated because then you have to re-build the index with the fixtures loaded, except the fixtures [don’t get loaded in `TestCase.setUpClass`][setupclass], so the choice was to load the fixtures myself or to re-index for each test. And it was specific to [the Whoosh back-end][whoosh] of course.

(This is for Django 1.4 and Haystack 1.2.7. In my actual project I get to deploy on Python 2.5. Ain’t I lucky? On a fucking PowerMac G5 running OS X Server 10.5 [for fuck sacks][bug].)

[django]: https://www.djangoproject.com/
[whoosh]: http://bitbucket.org/mchaput/whoosh
[haystack]: http://haystacksearch.org/
[setupclass]: http://docs.python.org/2/library/unittest.html#unittest.TestCase.setUpClass
[ram]: https://django-haystack.readthedocs.org/en/v1.2.7/settings.html#haystack-whoosh-storage
[bug]: http://www.youtube.com/watch?v=XZtpAxDEzl8

Optimizing queries in Haystack results

How to fix “ghost” files in the Finder

6 Replies

Sometimes the Mac Finder can get its knickers in a twist about files that you *ought* to be able to open just fine but Finder says no. You may see a message that says “Item XYZ is used by Mac OS X and can’t be opened.”

This can happen if the Finder is in the middle of a copy and the source disk is suddenly disconnected. During a copy the Finder sets a special type/creator code on files, and when the copy completes the proper type/creator code is restored. But if the copy is interrupted then sometimes instead of recovering gracefully the Finder leaves the files with their temporary attributes.

So if you do have these “ghost” files, here is a bit of command-line magic to remove the Finder’s temporary attributes (you would need to change the “`/path/to/files`” bit):

mdfind -onlyin /path/to/files -0 “kMDItemFSTypeCode==brok && kMDItemFSCreatorCode==MACS” | xargs -0 -n1 xattr -d com.apple.FinderInfo

(This is a single command, all on one line, which you type in Terminal.app.)

What this does is use [the Spotlight tool][mdfind] to find only files that have a Mac file type of “brok” and a creator code set to “MACS”. Then for each file we remove the Finder attributes (which means the Finder reverts to using the file-name’s extension when deciding how to open a file).

The `-onlyin` flag is used to restrict the search to a particular folder and its sub-folders, a belt and braces approach to making sure we only fix the files we are interested in.

This post was suggested by [@hobbsy][hobbsy], who had probs with RAW pictures he had copied across disks.

Although the way I showed him how to do this originally was needlessly convoluted because I forgot what one can do with `mdfind`.

Although the way I showed him how to do this was so much simpler because it was wrapped in a drag-and-drop Mac application created using [Sveinbjörn Þórðarson’s Platypus][playtpus], so no need to type the wrong thing into the shell and accidentally delete everything.

[hobbsy]: http://twitter.com/hobbsy
[playtpus]: http://sveinbjorn.org/platypus/
[codes]: http://arstechnica.com/staff/2009/09/metadata-madness/
[mdfind]: https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man1/mdfind.1.html

Reliably Broken

It's a blog: let's do funch!